Azure Batch Nodes Starting With Programs/Packages Missing

Brian Bertrand 21 Reputation points

Hi there,

I've recently run into a problem where our batch system (that has been running for 5+ years) is randomly spawning nodes that our tasks cannot run on. I will receive errors such as 255, 127, etc. Sometimes python is missing, something it's a gdal library, etc.

Prior to this issue, around the end of January the nodes began spawning in unusable state as they ran out of disk space. Increasing the OS to 128gb seemed to fix this. Looking at the node image offer, it was indeed updated on Jan 29th - clearly something changed.

I've tried figuring out how to set my pool to use the old version, but it seems rather convoluted.

I have no way to reach out to Microsoft about this issue as their new support system seems to be AI only for batch issues. Unfortunately it isn't very helpful.

Has anyone else run into this issue? Any suggestions to stop this from happening?

Some pool info (latest version):
Publisher microsoft-dsvm
Offer Ubuntu-hpc
Sku 2404
Version 22.04.2026021901

  1. Himanshu Shekhar 6,710 Reputation points Microsoft External Staff Moderator

    Just checking if provided response was helpful! please let me know if you have any queries.Also reached for additonal details via private messages


Sign in to comment

2 answers

  1. Brendan-8792 0 Reputation points

    I am also experiencing this same issue. We are using the same Ubuntu HPC base image and run containerized workloads using custom Docker images that contain all the dependencies we need to run (interestingly we also use Python and GDAL). We serve the images from Azure Container Registry. They are updated fairly regularly via CI/CD and used in long-lived pools where the image configuration is set in tasks (which is used preferentially to the pool image configuration).

    0 comments No comments

    Sign in to comment
  2. Himanshu Shekhar 6,710 Reputation points Microsoft External Staff Moderator

    Brian Bertrand Thank you for the detailed context. Based on our review, this behavior is expected with Azure Batch pools that reference Marketplace images using latest, especially for long‑running production systems.

    Your Batch pools are using microsoft-dsvm:ubuntu-hpc:2404:latest.

    This image was updated around Jan 29, which introduced changes to the base OS footprint and preinstalled packages.

    As a result:

    Some nodes entered Unusable state due to OS disk exhaustion (resolved by increasing OS disk to 128 GB).

    Newer nodes no longer consistently include runtime dependencies (e.g., Python, GDAL), causing task failures with exit codes 127 / 255.

    This is by design: Azure Marketplace images are serviced and updated automatically, and dependency immutability is not guaranteed when using latest.

    Your workload relied on implicit availability of system libraries from the Marketplace image. When the image was updated, those assumptions no longer held, leading to non‑deterministic node behavior during scale‑out.

    Recommended way to permanently stop this - For production Batch workloads, Microsoft recommends one of the following supported patterns:

    1. For production Batch workloads, Microsoft recommends one of the following supported patterns:
    2. Use a custom image via Azure Compute Gallery (Recommended)
    3. Create a VM from a known‑good Ubuntu‑HPC image.
    4. Install and validate all required dependencies (Python, GDAL, etc.).
    5. Capture it into Azure Compute Gallery and point the Batch pool to a specific image version.
    6. This guarantees runtime stability and prevents breaking changes from Marketplace updates.
    7. Containerize the workload
    8. Run Batch tasks inside Docker/Singularity containers. This fully decouples your application runtime from the host OS and avoids image drift issues.
    9. Avoid relying on latest Marketplace images
    10. Pinning a Marketplace image version can be used temporarily, but it is not recommended long‑term, as older versions may be retired without notice.

    Use Azure Batch to run container workloads - https://docs.azure.cn/en-us/batch/batch-docker-container-workloads

    Use the Azure Compute Gallery to create a custom image pool - https://learn.microsoft.com/en-us/azure/batch/batch-sig-images

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer