Unable to connect to Cosmos DB today

Humberto Oliveros 0 Reputation points
2025-02-13T15:22:12.6066667+00:00

Our application wasn't able to connect to Cosmos DB today, 2025-02-13, from around 3:00 AM to 3:30 AM Pacific Time GMT-8.

We're using the Mongo API, the errors are:

MongoDB.Driver.MongoWriteException: A write operation resulted in an error. WriteError: { Category : "Uncategorized", Code : 1, Message : "[ActivityId=c47959d9-de6d-496c-b502-c8cf521d1d79] Error=1, Details='Response status code does not indicate success: ServiceUnavailable (503); Substatus: 20001; ActivityId: c47959d9-de6d-496c-b502-c8cf521d1d79; Reason: (Service is currently unavailable. More info: https://aka.ms/cosmosdb-tsg-service-unavailable. The SDK failed to connect to the service. Please check your networking configuration.

A bit more details further down the stacktrace:

"TransportException":"A client transport error occurred: SSL negotiation timed out. (Time: 2025-02-13T11:16:29.3056860Z, activity ID: c47959d9-de6d-496c-b502-c8cf521d1d79, error code: SslNegotiationTimeout [0x0008], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westus2-be13.documents.azure.com:14004/, connection: 10.0.1.19:51372 -> 40.64.135.11:14004, payload sent: False)

Read-only operations to Cosmos were successful, only write operations were failing. The same behaviour was noticed not only from our application running as a Function App, but from regular Mongo tools, like MongoDB Compass.

Server Side Retry is enabled for our Cosmos DB.

The Service Health dashboard doesn't show any events at around that time, it does show one event two days ago, for an European DB cluster.

This meant our application was effectively frozen. Had to fully restart the application to get it back in operation.

Was there an outage at that time?

If so, is it possible to receive alerts when that happens, so our staff / application can be aware?

We do have alerts enabled via Application Insights, but those are sent every hour, not useful for this case, as they were sent after the outage finished.

With Server Side Retry enabled, is it possible to increase the retry parameters?

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,766 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vijayalaxmi Kattimani 1,330 Reputation points Microsoft Vendor
    2025-02-14T04:01:03.39+00:00

    Hi Humberto Oliveros,

    Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

    We apologize for the inconvenience caused.

    As we understand that, your application encountered a service interruption with Cosmos DB on 2025-02-13 between 3:00 AM and 3:30 AM Pacific Time. Unfortunately, I couldn't find any specific internal documentation or emails regarding an outage during that time.

    However, I can provide some general information and troubleshooting steps that might help.

    • Verify Substatus Code: The error message you received includes a substatus code (20001), which indicates client-side connectivity issues. This can be due to network conditions or transient connectivity problems.
    • Check Network Configuration: Ensure that your network configuration is correct and that all required ports are enabled. Transient connectivity issues can cause timeouts and can be safely retried following the design recommendations.
    • Service Health Dashboard: Although the Service Health dashboard didn't show any events at the time of the outage, it's always a good idea to check for any ongoing issues. You can also monitor the Azure status page for updates. https://azure.status.microsoft/en-gb/status
    • Retry Policies: With Server Side Retry enabled, you can increase the retry parameters to handle transient errors better. Ensure that your application design follows the guide for designing resilient applications with Azure Cosmos DB SDKs.

    To receive more timely alerts, you can consider the following options:

    1. Azure Monitor Alerts: Configure Azure Monitor to send alerts based on specific metrics or logs. You can set up alerts to trigger immediately when certain conditions are met, rather than waiting for hourly Application Insights alerts.
    2. Custom Alerts: Implement custom alerting mechanisms within your application to detect and notify you of service interruptions in real-time.

    Please refer to the below mentioned link for more information:

    https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-service-unavailable

    I hope, This response will address your query and helped you to overcome on your challenges.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.