Azure batch node stuck in reboot

Rahul Navin 0 Reputation points

Am running my azure batch node 24 hours and reboot once a day. currently the reboot get stuck from last 6 hours . any solution to fix it

  1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

    Hello Rahul,

    Just checking if provided information was helpful! Please let me know if you have any queries.

    Additionally, could you please check the private message and provide necessary details.

  2. Rahul Navin 0 Reputation points

    Hi thank you for the suggestions. i tried every possible solution still facing same issue have to create a new pool.

    node type - windows

    check the disk size - lot of empty space left,

    reimagine - step fail (error- node is in rebooting stage )

    even tried to resize - failed

    now am trying to delete that pool - deletion process still running from last 24 hours or so

    any way to connect with technical service to sort this.

    i created a ticket still no response from technical service

  3. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

    Hi Rahul,
    Could you please share the ticket number here.

  4. Rahul Navin 0 Reputation points

    ticket number - 2605220040005369


Sign in to comment

3 answers

  1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

    Hello Rahul,

    Thank you for patience while we investigation on this issue.

    Based on the backend investigation, the node entered a stuck state during the reboot and recovery process. During automated recovery, Azure Batch attempted to perform pool deployment cleanup and recovery operations; however, these operations could not be completed because networking resources associated with the BYOvNet pool were still reported as being in use. As a result, node recovery could not complete successfully, and subsequent pool operations remained in a stuck state.

    Additionally, the resize failure was attributed to the Batch account temporarily reaching a quota/resource allocation limit for compute resources in the region. When this limit is reached, Azure Batch may be unable to allocate or recover additional nodes successfully, which can cause operations such as reboot, resize, or delete to remain in a pending state until backend reconciliation is completed.

    To help reduce the likelihood of similar issues in the future, we recommend:

    • Periodically refreshing long-running pools.

    • Monitoring node and pool health on a regular basis.

    • Reviewing and validating networking resource configurations associated with BYOvNet pools.

    • Monitoring Batch account quotas and resource utilization regularly.

    • Configuring alerts for nodes that remain in rebooting, starting, or unusable states for extended periods.

    • Considering a quota increase if additional compute capacity may be required in the future.

    For additional reference, please review the below documentation:

    Hope this helps! Please let me know if you have any queries.

    1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

      Hello Rahul

      If above information was helpful! Could you please accept the answer and upvote it. Thanks.

      👁 image


    Sign in to comment
  2. kagiyama yutaka 3,685 Reputation points

    I think Batch doesn’t support recovering a node once it’s in Rebooting. Just delete it and let the pool replace it. If the delete hangs, that’s a stuck allocation state on the service side and only MS support can clear it.

    0 comments No comments

    Sign in to comment
  3. Sina Salam 30,166 Reputation points Volunteer Moderator

    Hello Rahul Navin,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having Azure batch node stuck in reboot.

    Follow the below steps-in-order to resolve it once and for all:

    A. Recover service immediately:

    1. Stop trying to reimage or repeatedly reboot the stuck node. For a node already stuck in Rebooting, the reliable action is to replace/remove it. Reimage is documented for Idle/Running nodes, not for a node already in Rebooting. - https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.pooloperations.reimage?view=azure-dotnet, https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.pooloperations.reimageasync?view=azure-dotnet
    2. Make sure the pool allocation state is steady; if autoscale is enabled, temporarily disable it.
      Batch node delete runs only when the pool allocation state is steady. - https://learn.microsoft.com/en-us/cli/azure/batch/node?view=azure-cli-latest
    3. Delete the stuck node and requeue any work. Use Batch node deletion with a deallocation option that requeues work, so unfinished tasks are rescheduled to healthy nodes. The CLI supports node delete and deallocation options. - https://learn.microsoft.com/en-us/cli/azure/batch/node?view=azure-cli-latest
    4. If this is a single-node pool, add capacity first, then remove the bad node. Batch best practices explicitly recommend multiple nodes when you need deterministic progress, because individual nodes are not guaranteed to remain available. - https://learn.microsoft.com/en-us/azure/batch/best-practices B: Obtain root cause:
    5. This is the official path for debugging broken nodes and for escalation if needed. Use az batch node service-logs upload or the equivalent PowerShell/REST operation. - https://learn.microsoft.com/en-us/cli/azure/batch/node/service-logs?view=azure-cli-latest, https://learn.microsoft.com/en-us/rest/api/batchservice/nodes/upload-node-logs?view=rest-batchservice-2025-06-01, https://learn.microsoft.com/en-us/powershell/module/az.batch/start-azbatchcomputenodeservicelogupload?view=azps-15.5.0
    6. Batch stores start-task content in the node’s startup directory, and application/task diagnostics go to stdout.txt / stderr.txt. That is the exact place to confirm whether startup is hanging or failing. - https://learn.microsoft.com/en-us/azure/batch/files-and-directories, https://learn.microsoft.com/en-us/azure/batch/error-handling C. Remove the recurrence trigger:
    7. Start tasks rerun on reboot/reimage, and Microsoft requires them to be idempotent; repeated forced reboot is a direct multiplier for this failure. - https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.starttask?view=azure-dotnet, https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.protocol.models.starttask?view=azure-dotnet
    8. Microsoft’s troubleshooting guidance is explicit: heavy package installation in a start task causes restart/reimage delays and failures; preinstall the runtime/packages into the image instead. - https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/batch-node-creation-delay-restart-reimage
    9. Task working directories are retained and can consume disk; the default retention is seven days unless node/job lifecycle removes them sooner. Persist outputs to Azure Storage and reduce retention to what your workload truly needs. - https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.taskconstraints.retentiontime?view=azure-dotnet, https://learn.microsoft.com/en-us/azure/batch/files-and-directories, https://video2.skills-academy.com/en-us/azure/batch/batch-task-output
    10. For newer pools, RDP/SSH is not automatically opened; configure Batch pool endpoints/NSG/NAT explicitly before assuming remote login is possible. - https://learn.microsoft.com/en-us/azure/batch/pool-endpoint-configuration, https://techcommunity.microsoft.com/blog/azurepaasblog/configure-remote-access-to-compute-nodes-in-an-azure-batch-pool-using-azure-port/4368870

    I hope this is helpful! Do not hesitate to let me know if you have any other questions, steps or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer