In azure batch, node in unusable state. what needs to be done to make it running.

in azure batch, nodes in unusable state. what needs to be done to make them running. I am using ubuntu 22.04 jammy image.

  1. Himanshu Shekhar 6,710 Reputation points β€’ Microsoft External Staff β€’ Moderator

    Hey @Batcha A, Mohammed Hakkim It seems like you're facing an issue with Azure Batch nodes stuck in the "Unusable" state while using an Ubuntu 22.04 Jammy image.

    This can be frustrating, but there are several steps you can take to troubleshoot and hopefully resolve the issue.

    Here’s what you can try:

    1. Check for Configuration Issues: Nodes can get stuck due to various configuration issues:
      • Virtual Network Configuration: Ensure that your virtual network is configured correctly. Missing outbound NSG rules or user-defined routes (UDRs) could cause this issue. Make sure you have proper routes pointing to the BatchNodeManagement service tag.
      • Disk Space: Check if the disk on the nodes is full. If so, you can locate node-specific task files and free up space. You might consider using the "File - List From Compute Node" API to manage files.
      • Custom Image Issues: If you’re using a custom image, verify that it’s configured correctly and has all required components for Azure Batch nodes to operate properly.
    2. Inspect Node Status and Logs: Use the Azure portal to navigate to the Nodes tab of your Batch account and check for any errors or warnings. If there are issues, resolve them as needed.
    3. Restart Unusable Nodes: If the nodes are still in an unusable state after checking configurations, try restarting them through the Azure portal. Wait a few minutes to see if their status changes to "Running."
    4. Resize the Pool: If restarting doesn’t help, consider resizing your pool. Increasing the target number of nodes can cause new nodes to be allocated, which may help bypass the unusable nodes.
    5. Review Resource Availability: Check for capacity issues in the selected Azure region for the VM size you're using. Sometimes, changing the VM size or selecting a different region can help.

    Resources references:

  2. Himanshu Shekhar 6,710 Reputation points β€’ Microsoft External Staff β€’ Moderator

    @Batcha A, Mohammed Hakkim Did you get a chance to see my response. If you have any further queries, let me know

  3. Himanshu Shekhar 6,710 Reputation points β€’ Microsoft External Staff β€’ Moderator

    @Batcha A, Mohammed Hakkim

    The BatchAgentInstallationFailure error means the Batch node agent/extension cannot be installed on the VM, so the node is marked unusable.

    In your setup this is almost always caused by networking when using private endpoints for Batch and node management.​

    Since you already ruled out disk, image, and quota, please check the following on the subnets where your Batch nodes and private endpoints are deployed:

    • Verify the nodeManagement private endpoint is present and approved, and that DNS for the node management endpoint FQDN resolves to the private IP of that endpoint. From a VM in the Batch node subnet run nslookup <node-management-endpoint> and a TCP 443 connectivity test (for example nc -v <endpoint> 443 or Test-NetConnection -Port 443).​

    Ensure your NSG/UDR configuration allows outbound TCP 443 from the node subnet to the node management endpoint (via the BatchNodeManagement.<region> service tag or directly to the private endpoint IP), and that no deny rule or forced-tunneling route is blocking this traffic.​

    If you are using a custom Ubuntu 22.04 image from a gallery, confirm that the OS SKU selected when creating the pool matches the SKU of the image; if it does not, recreate the pool with the correct SKU to avoid agent installation failures.​

    After fixing DNS/NSG/UDR or the OS SKU mismatch, restart the nodes or recreate the pool and the nodes should move from unusable to running.

    Microsoft references:

    1. Azure Batch node gets stuck in the Unusable state because of configuration issues - https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/azure-batch-node-unusable-state
    2. Create a simplified node communication pool without public IP addresses - https://learn.microsoft.com/en-us/azure/batch/simplified-node-communication-pool-no-public-ip ​

Sign in to comment

2 answers

  1. Batcha A, Mohammed Hakkim 0 Reputation points

    after updating the nsg rule with right source and destination, nodes started running.

    1. Himanshu Shekhar 6,710 Reputation points β€’ Microsoft External Staff β€’ Moderator

      Batcha A, Mohammed Hakkim - Just checking if provided response was helpful! please let me know if you have any queries.


    Sign in to comment
  2. Batcha A, Mohammed Hakkim 0 Reputation points

    Hi Himanshu,

    Azure batch account has been configured with private endpoint for batch and node management. Also access rules have been configured added set of allowed ip addresses in firewall. Batch nodes created in one subnet and private endpoint created in another subnet but in same vnet. Here networking is managed from HCP console through which outbound nsg rules have been created and attached in both the subnets. outbound 443 to nodemanagement service tag and outbound 443,445 to storage service tag.

    No disk space issue or image issue. Quota has been increased already for the nodes. vmsize used is D2sV5

    below is the error msg:

    Code: BatchAgentInstallationFailure Message: The batch agent extension provisioning has failed on compute node Values: Timestamp - 1/9/2026 2:35:47 PM

    1. Himanshu Shekhar 6,710 Reputation points β€’ Microsoft External Staff β€’ Moderator

      Batcha A, Mohammed Hakkim -Just checking if provided response was helpful! please let me know if you have any queries.


    Sign in to comment
Sign in to answer

Your answer