East us2 VMSS: "VMSS name redacted" is deallocated
East us2 VMSS: "VMSS name redacted" gets deallocated after a brief window. Need help troubleshooting the same.
2 answers
-
Ankit Yadav 14,455 Reputation points • Microsoft External Staff • Moderator
Issue Description: All nodes in the Service Fabric cluster became unavailable because the underlying VM Scale Set (VMSS) instances were deallocated.
Findings: Service Fabric (SFRP model) does not control or deallocate VMSS resources. Any changes to the VMSS state must come from customer actions, automation, or platform-level processes.
Root Cause (most likely):
- The subscription might be labeled as non-production, which allows platform-driven capacity reclamation and can result in VM deallocation.
- Alternatively, deallocation may have been triggered by customer-managed actions, such as manual changes, automation, ARM templates, or autoscale settings.
Recommended Actions:
- Check Activity Logs to determine who or what initiated the change (user, automation, or system process).
- Verify the subscription classification (Production or Non-Production).
- Review autoscale and deployment settings.
Conclusion: This issue is not caused by the Service Fabric service, but rather by subscription configuration or external actions affecting the VMSS.
-
Marcus Pantel 95 Reputation points
Hi Anand.
If a VMSS instance (e.g., "VMSS name redacted") deallocates shortly after startup, the cause is typically an automated platform action rather than a random crash. Check the following areas:
- Identify the Initiator (Activity Log)
Navigate to the VMSS in the Azure Portal and check the Activity Log. Look for the "Deallocate Virtual Machine" event:
"Initiated by: Autoscale": Your scaling rules are too aggressive. Increase the "Cooldown" period.
"Initiated by: Azure Infrastructure": If you are using Spot Instances, this indicates an eviction due to capacity constraints in East US 2.
- Automatic Repairs & Health Probes
If Automatic Repairs are enabled, Azure will deallocate and replace instances that fail health checks.
Look at the Health Probes (Load Balancer) or Application Health Extension.
If your application takes a long time to initialize, increase the "Initial Delay" in the health probe settings to prevent Azure from marking the VM as "Unhealthy" prematurely.
- Provisioning Timeouts
Check if the VM reaches the "Succeeded" provisioning state. If it stays in "Creating" and then deallocates, a Custom Script Extension or specialized configuration might be failing or timing out, causing the platform to roll back or stop the instance.
Recommended Action
Review the Resource Health blade for the specific instance. It will explicitly state if the deallocation was due to a probe failure, a user action, or a Spot eviction.
I hope this clarifies the current status! If this helps, please mark this as the "Accepted Answer" so other community members can find it easily.
Best regards,
Marcus
