East us2 VMSS: "VMSS name redacted" is deallocated

Anand Kittur (Tata Consultancy Services Limited) 0 Reputation points Microsoft External Staff

East us2 VMSS: "VMSS name redacted" gets deallocated after a brief window. Need help troubleshooting the same.

0 comments No comments

Sign in to comment

2 answers

  1. Ankit Yadav 14,455 Reputation points Microsoft External Staff Moderator

    Issue Description: All nodes in the Service Fabric cluster became unavailable because the underlying VM Scale Set (VMSS) instances were deallocated.

    Findings: Service Fabric (SFRP model) does not control or deallocate VMSS resources. Any changes to the VMSS state must come from customer actions, automation, or platform-level processes.

    Root Cause (most likely):

    • The subscription might be labeled as non-production, which allows platform-driven capacity reclamation and can result in VM deallocation.
    • Alternatively, deallocation may have been triggered by customer-managed actions, such as manual changes, automation, ARM templates, or autoscale settings.

    Recommended Actions:

    • Check Activity Logs to determine who or what initiated the change (user, automation, or system process).
    • Verify the subscription classification (Production or Non-Production).
    • Review autoscale and deployment settings.

    Conclusion: This issue is not caused by the Service Fabric service, but rather by subscription configuration or external actions affecting the VMSS.

    0 comments No comments

    Sign in to comment
  2. Marcus Pantel 95 Reputation points

    Hi Anand.

    If a VMSS instance (e.g., "VMSS name redacted") deallocates shortly after startup, the cause is typically an automated platform action rather than a random crash. Check the following areas:

    1. Identify the Initiator (Activity Log)

    Navigate to the VMSS in the Azure Portal and check the Activity Log. Look for the "Deallocate Virtual Machine" event:

    "Initiated by: Autoscale": Your scaling rules are too aggressive. Increase the "Cooldown" period.

    "Initiated by: Azure Infrastructure": If you are using Spot Instances, this indicates an eviction due to capacity constraints in East US 2.

    1. Automatic Repairs & Health Probes

    If Automatic Repairs are enabled, Azure will deallocate and replace instances that fail health checks.

    Look at the Health Probes (Load Balancer) or Application Health Extension.

    If your application takes a long time to initialize, increase the "Initial Delay" in the health probe settings to prevent Azure from marking the VM as "Unhealthy" prematurely.

    1. Provisioning Timeouts

    Check if the VM reaches the "Succeeded" provisioning state. If it stays in "Creating" and then deallocates, a Custom Script Extension or specialized configuration might be failing or timing out, causing the platform to roll back or stop the instance.

    Recommended Action

    Review the Resource Health blade for the specific instance. It will explicitly state if the deallocation was due to a probe failure, a user action, or a Spot eviction.

    I hope this clarifies the current status! If this helps, please mark this as the "Accepted Answer" so other community members can find it easily.

    Best regards,
    Marcus

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer