production down

Dani Anjaya 0 Reputation points

"Face API resource 'rise-try-me' was moved from subscription 12e86485... to b657b2a6... Resource shows Active/Succeeded, NetworkRuleSet DefaultAction=Allow, PublicNetworkAccess=Enabled, but all data-plane requests to the endpoint are connection-reset at the AI gateway with no HTTP response (HTTP/2 stream CANCEL / TCP RST after request sent). Suspect gateway hostname-to-backend mapping not re-registered after subscription move. Production service down."

  1. SRILAKSHMI C 19,110 Reputation points Microsoft External Staff Moderator

    Hello @Dani Anjaya

    Could you please fill in the requested details using the link provided in the private message? Once we receive the completed information, I will review it with the relevant team and get back to you as soon as I have an update.

    Thank you for your cooperation and patience.


Sign in to comment

3 answers

  1. Dani Anjaya 0 Reputation points

    All three re-sync methods attempted — issue persists, suspect data-plane suspension

    Hi, thank you for the suggestions. I have now attempted all three of the recommended synchronization methods, and unfortunately, none of them restored data-plane connectivity:

    1. Metadata modification — Added a new resource tag via Update-AzTag (confirmed applied: tag visible on the resource). No change after propagation wait.
    2. Network rule toggle — Switched NetworkRuleSet DefaultAction from Allow → Deny, waited 90 seconds, then reverted to Allow (confirmed: DefaultAction: Allow, verified via Get-AzCognitiveServicesAccount). No change.
    3. Key regeneration — Regenerated Key2 via New-AzCognitiveServicesAccountKey (completed successfully). No change.

    All three operations succeeded at the control plane, but data-plane routing was not restored.

    Current symptoms remain exactly as before:

    • Resource: Face API "rise-try-me" (RG: mcpp-purchase, Southeast Asia), state Active/Succeeded, SKU S0, PublicNetworkAccess: Enabled, DefaultAction: Allow
    • DNS resolves correctly through the AI gateway chain (*.ai-gateway.southeastasia-01.azure-api.net → Traffic Manager → regional APIM)
    • TLS handshake completes, request is fully sent, then the connection is reset with no HTTP response (HTTP/2: stream CANCEL err 8; HTTP/1.1: TCP RST / "Connection reset by peer")
    • Requests with an intentionally invalid subscription key are also reset — no 401 is returned, which indicates the gateway is not routing the request to the backend at all (authentication is never reached)
    • The same behavior existed before the resource was moved cross-subscription (from 12e86485-... to b657b2a6-...), so the block followed the resource through the move
    • Additionally, a newly created Face resource ("rise-prod") in the destination subscription has been stuck in "Creating" provisioning state for over an hour, which may indicate a related backend/regional issue

    Given that control-plane operations succeed but no data-plane request ever reaches the backend (not even far enough to fail authentication), this looks like a gateway backend-mapping failure or a data-plane suspension that cannot be resolved from the customer side.

    This is a production service with enrolled customer face data, currently down. Could you please advise whether this can be escalated for backend investigation, or confirm that a Microsoft support ticket is the appropriate next step? I have the full diagnostic timeline available if needed.All three re-sync methods attempted — issue persists, suspect data-plane suspension

    Hi, thank you for the suggestions. I have now attempted all three of the recommended synchronization methods, and unfortunately none of them restored data-plane connectivity:

    1. Metadata modification — Added a new resource tag via Update-AzTag (confirmed applied: tag visible on the resource). No change after propagation wait.
    2. Network rule toggle — Switched NetworkRuleSet DefaultAction from Allow → Deny, waited 90 seconds, then reverted to Allow (confirmed: DefaultAction: Allow, verified via Get-AzCognitiveServicesAccount). No change.
    3. Key regeneration — Regenerated Key2 via New-AzCognitiveServicesAccountKey (completed successfully). No change.

    All three operations succeeded at the control plane, but data-plane routing was not restored.

    Current symptoms remain exactly as before:

    • Resource: Face API "rise-try-me" (RG: mcpp-purchase, Southeast Asia), state Active/Succeeded, SKU S0, PublicNetworkAccess: EnabledDefaultAction: Allow
    • DNS resolves correctly through the AI gateway chain (*.ai-gateway.southeastasia-01.azure-api.net → Traffic Manager → regional APIM)
    • TLS handshake completes, request is fully sent, then the connection is reset with no HTTP response (HTTP/2: stream CANCEL err 8; HTTP/1.1: TCP RST / "Connection reset by peer")
    • Requests with an intentionally invalid subscription key are also reset — no 401 is returned, which indicates the gateway is not routing the request to the backend at all (authentication is never reached)
    • The same behavior existed before the resource was moved cross-subscription (from 12e86485-... to b657b2a6-...), so the block followed the resource through the move
    • Additionally, a newly created Face resource ("rise-prod") in the destination subscription has been stuck in "Creating" provisioning state for over an hour, which may indicate a related backend/regional issue

    Given that control-plane operations succeed but no data-plane request ever reaches the backend (not even far enough to fail authentication), this looks like a gateway backend-mapping failure or a data-plane suspension that cannot be resolved from the customer side.

    This is a production service with enrolled customer face data, currently down. Could you please advise whether this can be escalated for backend investigation, or confirm that a Microsoft support ticket is the appropriate next step? I have the full diagnostic timeline available if needed.

    0 comments No comments

    Sign in to comment
  2. KUM 0 Reputation points

    Same for OCR document, production is down also

    0 comments No comments

    Sign in to comment
  3. Jose Benjamin Solis Nolasco 8,401 Reputation points Volunteer Moderator

    Welcome to Microsoft Q&A

    Hello Dani Anjaya, I hope you are doing well

    Based on the symptoms described, the behavior could indicate a potential synchronization latency or gap between the Azure Resource Manager (ARM) control plane and the data-plane gateway routing infrastructure following a cross-subscription migration. While ARM reflects an updated state (Active/Succeeded)

    To address this condition, administrators frequently choose to trigger a manual state refresh to push a new configuration payload to the resource provider. If you decide to proceed with troubleshooting, the following optional methods are common practices for forcing a synchronization event:

    • Key Regeneration: You may choose to navigate to Keys and Endpoint and regenerate an access key. If selected, please note that your consuming applications must be manually updated with the new key string to prevent authentication failures.
    • Network Rule Toggle: You may opt to navigate to Networking and temporarily adjust the firewall rules (e.g., switching from All networks to Selected networks, saving the change, and then reverting the configuration).
    • Metadata Modification: Alternatively, you may apply a modification to the resource metadata by adding or updating a resource tag within the Tags menu.

    Please evaluate these options carefully against your organization's change management policies and production impact thresholds.

    😊 If my answer helped you resolve your issue, please consider marking it as the correct answer. This helps others in the community find solutions more easily. Thanks!

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer