CosmosDB connectivity issues recently affecting a number of our accounts in our TestTRS01 subscription

👁 Image

Scott Edden 20 Reputation points • Microsoft Employee

Problem Description

Newly created Cosmos DB containers on both accounts immediately enter a 410 Gone / substatus 1000 state and never recover without external intervention. The containers remain in this state for hours to weeks. The control-plane API reports "Collection is not yet available for read. Please retry in some time." indefinitely.

This is a recurring pattern affecting both accounts simultaneously, suggesting a degraded storage node in West US that both accounts are being assigned to when new containers are created.

Pattern

Our dev labs delete all databases daily (cleanup cycle) and recreate them via service migrations. After cleanup:

Databases are deleted ✓
Migrations create new databases and containers ✓
New containers immediately enter 410 Gone (substatus 1000)
They remain in this state indefinitely — we have observed durations of 4 hours, 6 days, and 2 weeks on separate occurrences
The control-plane endpoint (az cosmosdb sql container show) returns "Collection is not yet available for read. Please retry in some time."
The only remediation we have found is deleting the container/database and hoping it gets assigned to a different partition on recreation

rror Details (from SDK diagnostics)

ClassName: CosmosException
statusCode: 410
substatus: 1000
error: "Gone — The requested resource is no longer available at the server"
operationType: ReadFeed
resourceType: StoredProcedure
connectionMode: DIRECT (RNTBD)

Control-plane (az cosmosdb sql container show):

"Collection is not yet available for read. Please retry in some time."
ActivityId: [varies per attempt]

Key Observations

Both accounts affected simultaneously — yam-npe-n4cilab3 and yam-npe-n4cilab3-2 are separate accounts in the same resource group and region, both exhibiting the same problem. This strongly suggests a shared degraded storage node/cluster in West US.

Happens immediately on creation — The 410 state begins as soon as the container is created by migrations; the containers never become healthy.

Does not self-heal — Incidents have persisted for 4 hours, 6 days, and 2 weeks. This is not transient initialization delay.

Recurring — We observe this pattern repeatedly after each daily cleanup cycle, suggesting the accounts are consistently being assigned partitions on the same degraded node.

Other containers on the same accounts are healthy — Only specific newly created containers are affected; existing containers on the same accounts work normally.

Ask

Identify the degraded storage node(s) in West US that are serving these partition IDs and investigate why partitions assigned to them immediately enter an unrecoverable 410 Gone state.

Migrate the affected partitions (or the accounts themselves) off the degraded node so that newly created containers become available normally.

Advise on whether there is a way to request partition reassignment without deleting and recreating the entire Cosmos account, given the ~30 minute recreation cost impacts our daily lab automation.

rror Details (from SDK diagnostics)

ClassName: CosmosException
statusCode: 410
substatus: 1000
error: "Gone — The requested resource is no longer available at the server"
operationType: ReadFeed
resourceType: StoredProcedure
connectionMode: DIRECT (RNTBD)

Control-plane (az cosmosdb sql container show):

"Collection is not yet available for read. Please retry in some time."
ActivityId: [varies per attempt]

Key Observations

 **Happens immediately on creation** — The 410 state begins as soon as the container is created by migrations; the containers never become healthy.
 
 **Does not self-heal** — Incidents have persisted for 4 hours, 6 days, and 2 weeks. This is not transient initialization delay.
 
 **Recurring** — We observe this pattern repeatedly after each daily cleanup cycle, suggesting the accounts are consistently being assigned partitions on the same degraded node.
 
 **Other containers on the same accounts are healthy** — Only specific newly created containers are affected; existing containers on the same accounts work normally.
 
 ### Ask

 **Identify the degraded storage node(s)** in West US that are serving these partition IDs and investigate why partitions assigned to them immediately enter an unrecoverable 410 Gone state.
 
 **Migrate the affected partitions** (or the accounts themselves) off the degraded node so that newly created containers become available normally.
 
 **Advise** on whether there is a way to request partition reassignment without deleting and recreating the entire Cosmos account, given the ~30 minute recreation cost impacts our daily lab automation.

0 comments No comments

Answer accepted by question author

👁 Image

Pilladi Padma Sai Manisha 10,190 Reputation points • Microsoft External Staff • Moderator

Hi Scott Edden,
Thankyou for Reaching microsoft Q&A!

We investigated the issue from the backend and performed the necessary service-side remediation. Following these backend operations, the affected Cosmos DB containers are now provisioning and becoming available as expected, and the environment is currently functioning normally.

Based on our findings, this was related to an internal service condition affecting the impacted resources. The issue has been addressed, and we are no longer observing the prolonged 410 (Substatus 1000) state for newly created containers.

Could you please validate from your side by creating new containers and confirm whether the behavior has been resolved? If you continue to encounter the issue, please share the latest Activity IDs, timestamps (UTC), and affected container names so that we can perform further investigation.

Thank you for your patience while we worked on this issue.

0 comments No comments

1 additional answer

👁 Image

Sina Salam 30,166 Reputation points • Volunteer Moderator

Hello Scott Edden,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your CosmosDB connectivity issues recently affecting a number of our accounts in our TestTRS01 subscription.

This is a service-side Azure Cosmos DB container provisioning/readiness failure, not a normal client retry issue. Cosmos DB documents that 410 Gone is retryable/transient and that collection creation can temporarily report “create in progress,” but a container that remains unavailable for hours or days is not expected behavior. Because physical partitions are an internal, fully managed implementation, there is no customer-facing capability to identify the exact backend storage node or force partition reassignment.

The best you can do now is to Open an Azure Support technical ticket immediately from the affected Cosmos DB account/subscription and classify it as a service-side availability/provisioning issue affecting newly created containers. Preserve one affected container/database for backend investigation, supply UTC timestamps, ActivityIds, and full SDK diagnostics, and request Microsoft to investigate and remediate the backend service infrastructure responsible for the stuck container creation/readiness path. Public Cosmos DB guidance supports account/region failover, restore, multi-region deployment, and zone redundancy are not customer-controlled physical partition migration. - https://learn.microsoft.com/en-us/azure/cosmos-db/conceptual-resilient-sdk-application, https://learn.microsoft.com/en-us/rest/api/cosmos-db/http-status-codes-for-cosmosdb, https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning

I hope this is helpful! Do not hesitate to let me know if you have any other questions, steps or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

0 comments No comments

Sign in to comment

URL: https://learn.microsoft.com/en-us/answers/questions/5899029/cosmosdb-connectivity-issues-recently-affecting-a