Azure DocumentDB (with MongoDB compatibility) Connection Timeouts

Phillip Stenger 5 Reputation points

We have started to experience connection timeouts for our Azure DocumentDB MongoDB resource. We have an app running in container apps which connects to the database over private link which is suddenly unable to connect. Also, when I try to connect from my local machine (something that previously worked) I am unable to connect.

The issue started early this morning around 8:15 AM. The only changes under the Activity Log this morning was a "Create role assignment" and a "Create or update resource diagnostic setting" which happened as part of a Terraform deployment this morning. Both the app and my local are attempting to use password for authentication, so it is unlikely that the "role assignment" operation should affect it.

This is the error I get from my local: Unable to connect: connect ETIMEDOUT

In my container app: pymongo.errors.ServerSelectionTimeoutError: [redacted].mongocluster.cosmos.azure.com:10260: timed out, Timeout: 30s,

Another observation is I notice a strange CPU spike at 5:40AM this morning up to 60%. The previous max for the past couple months is ~12%. I could not find any requests in the logs at that time on the database.

  1. Manoj Kumar Boyini 17,060 Reputation points Microsoft External Staff Moderator

    Hi @Phillip Stenger

    I kindly request you to please share the details requested in the private message for further investigation.

  2. Manoj Kumar Boyini 17,060 Reputation points Microsoft External Staff Moderator

    Hi @Phillip Stenger

    Based on our analysis and inputs from the Product Engineering team, the issue was intermittent in nature, during which the service experienced a temporary availability disruption. The system subsequently recovered automatically, and no manual intervention was required.

    The Product team also confirmed that the service has now returned to a healthy state, with connectivity restored, and the cluster is currently operating as expected.

    As part of our ongoing improvements, the engineering team is continuously enhancing automated detection and recovery mechanisms to better identify such situations earlier and minimize potential impact on incoming requests.

    We sincerely apologize for any inconvenience caused.

    If you encounter a similar issue in the future, please feel free to reach out to us, and we will engage the Product team on priority for faster resolution.


Sign in to comment

2 answers

  1. Manoj Kumar Boyini 17,060 Reputation points Microsoft External Staff Moderator

    Hi @Phillip Stenger

    Root Cause:  

    Upon investigation, product team determined that the root cause of the incident was the unexpected termination of the internal container hosting the Azure Document DB process. Metrics showed that the health status of the cluster was stable and resources such as CPU and memory did not indicate consistent pressure, but a 'kill' event was recorded against the internal container at the onset of the issue. This event interrupted the primary process, making the cluster unavailable even though the underlying VM remained healthy throughout. The Gateway Availability metric also showed a degraded state during the incident, but a gap in telemetry complicated immediate detection and alerting.  There was no evidence of a code defect causing the disruption. Instead, the incident was the result of a backend operational event within the container infrastructure, where the process was lost for reasons that remain under review. The absence of related logs and core dump data constrained the ability to pinpoint why the container was terminated. High availability was not enabled for this cluster, which increased reliance on the health of the single node and left no automated resilience in case of process disruption.   

    Mitigation and Next Steps: 

    Following the identification of the unavailability, product team engineers undertook a reconfiguration of the backend infrastructure, which resulted in the restoration of cluster connectivity. The metrics after the intervention indicated successful requests and stable health, confirming service availability for the customer once again. The team is actively investigating to improve monitoring sensitivity and close the observed telemetry gaps, which delayed incident escalation and detection. Work is underway to tune the Gateway Availability monitor and ensure that incidents are promptly surfaced in the future.   

    We regret the inconvenience experienced due to this unexpected database interruption. As ongoing corrective actions, we recommend enabling high availability for clusters where production workloads depend on continuous access, reducing exposure to single-node failures. The engineering team is also reviewing container lifecycle controls and diagnostic data retention to aid rapid root cause identification. Thank you for your patience as we reinforce our operational defenses to prevent a recurrence of this disruption.   

    Please let us know if you have any questions or concerns.

    1. Manoj Kumar Boyini 17,060 Reputation points Microsoft External Staff Moderator

      Hi @Phillip Stenger

      I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.


    Sign in to comment
  2. Vinodh247-1375 43,181 Reputation points Volunteer Moderator

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    This is not an authentication issue. It is a connectivity failure.

    Most likely causes:

    • Private Link/DNS issue (endpoint not resolving to private IP)

    Firewall or network rules reset (public access blocked or IP not allowed)

    Private endpoint not in Approved state

    Possible Cosmos DB backend failover/service issue (CPU spike is a clue)

    Key signal: Both local and container time out + no DB logs -> requests are not reaching Cosmos DB

    Immediate checks:

    nslookup <account>.mongocluster.cosmos.azure.com

    • Verify Private Endpoint - Connected

    Temporarily enable public access to isolate issue

    Check Azure Service Health

    Conclusion: Focus on DNS + networking + private endpoint, not role assignment.

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

    1. Phillip Stenger 5 Reputation points

      Thank you Vinod but this is not helpful. The database has been working properly for months, and without any known changes, connection attempts have started to timeout. From what I can tell DNS lookup is evaluating correctly. When I try to connect using mongosh, I am having the same problem, so it not likely to be a private endpoint/DNS issue.


    Sign in to comment
Sign in to answer

Your answer