Automated ML job - Error loading data schema Timeout exceeded

Onyango, David 31 Reputation points

I am getting the error, "Error loading data schema, please go back and choose another data. Timeout of 20000ms exceeded." while submitting an Automated ML job for Image classification against a source with approximately 1,700 training images. The soon to be decommissioned Azure custom vision was able to handle this and an additional 700 images associated with Testing and Validation.

  1. Anonymous

    Hi Onyango, David

    Welcome to Microsoft Q&A and Thank you for reaching out.

    This error occurs before the AutoML job even starts, during the step where Azure ML tries to scan your dataset and infer the schema. When the dataset contains hundreds or thousands of image files (your case: ~1,700 total), Azure ML must enumerate every file and request metadata from the storage account. Azure ML imposes backend execution timeouts (commonly 20–60 seconds), and these limits cannot be changed, so schema loading may fail when the dataset contains many small files.

    Image classification datasets stored as folders of hundreds of small files trigger large numbers of:

    • Directory lookups
    • Blob metadata reads
    • File open/read calls

    Azure ML’s storage-layer guidance confirms that many small files create high request overhead, making it easier to hit storage limits, bandwidth limits, or request throttles — all of which slow down schema loading.

    Network/security issues can make schema loading even slower

    If your workspace uses VNet + private endpoints or storage firewall restrictions, Azure ML may struggle to read dataset files fast enough. Azure ML documentation notes that dataset preview and schema loading can fail if the storage account does not allow required access paths. The fix is to temporarily enable:

    • Public Network Access: Enabled, or
    • Allow trusted Microsoft services The most reliable fix: materialize the dataset into fewer, larger files

    Microsoft's recommendation for handling timeout‑prone datasets is to pre-materialize the dataset into a small number of larger files so Azure ML avoids scanning thousands of separate images at schema time. Examples that work well:

    • ZIP file containing images
    • Parquet file containing encoded image bytes
    • TFRecord file for vision data

    Azure ML engineering explicitly states that materializing and registering the dataset before submitting the AutoML job is the most stable solution.

    Additional optimizations to avoid future schema failures

    You can further improve schema loading reliability by:

    • Flattening the folder structure (avoid deeply nested directories)
    • Avoiding mounts for huge numbers of small files (downloads perform better than mount-on-open behaviors)
    • Using premium storage if you have very high file count workloads

    Recommended practical workflow for your 1,700-image dataset:

    To avoid the AutoML “Error loading data schema” completely:

    1. Zip your 1,700 images, upload the ZIP to your datastore.
    2. Create a data asset pointing to the ZIP.
    3. Use AutoML Vision with the ZIP as input — Azure ML will unpack internally.
    4. (Optional) Convert images → TFRecords or parquet if you want maximum scalability.

    References:

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

  2. Onyango, David 31 Reputation points

    in the case of zipping the file, how will it interpret the .jsonl placed on the storage container along with the images it references?

  3. Anonymous

    Hi Onyango, David

    Azure AutoML does NOT read images from inside a ZIP archive. AutoML for Images requires that each JSONL entry’s image_url points to an AzureML datastore path (an azureml:// URI), not a file path inside a ZIP. Zipping is only a workaround to reduce schema-load timeouts, but AutoML does not open and enumerate the ZIP’s internal file[learn.microsoft.com]

    Each JSON line must reference the full AzureML datastore path where the actual image file exist

    { "image_url": "azureml://subscriptions/<sub>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore>/paths/images/image001.jpg", "label": "cat" }
    

    If you upload a ZIP archive to the datastore to reduce file-count–related timeouts, you must explicitly unzip the archive (e.g., in a preprocessing step or manually) so that the images exist as individual files in the datastore before AutoML starts. AutoML will not internally unzip or look inside the ZIP. The JSONL requires that each image exists as an accessible datastore path. [learn.microsoft.com]

    MLTable you use for AutoML Vision points to the JSONL file. That JSONL file must reference real, accessible cloud image paths. The AutoML Vision pipeline (experimental) loads images from those URIs only. If the images remain inside a ZIP, they simply cannot be resolved. Therefore, uploading ZIP → unzipping → generating JSONL with datastore URIs is the correct workflow. [learn.microsoft.com]

    So in the ZIP scenario:

    • AutoML does NOT interpret JSONL paths as ZIP-internal references.
    • The JSONL is interpreted normally (each image_url must point to a datastore path).
    • The ZIP is only a temporary storage optimization.
    • Before training, images must be extracted so the JSONL can correctly point to them.

    If JSONL points to images still inside a ZIP, AutoML will fail because the files cannot be located.

    If you’d like, I can provide a correct end-to-end workflow showing how to:

    1. Upload ZIP →
    2. Unzip inside the datastore →
    3. Generate JSONL automatically →
    4. Build MLTable for AutoML Vision.So in the ZIP scenario:
      • AutoML does NOT interpret JSONL paths as ZIP-internal references.
      • The JSONL is interpreted normally (each image_url must point to a datastore path).
      • The ZIP is only a temporary storage optimization.
      • Before training, images must be extracted so the JSONL can correctly point to them.
  4. Onyango, David 31 Reputation points

    I think you had misunderstood. The images are already in the datastore along with the JSONL pointing to each relative path of the images already in the datastore.

  5. Anonymous

    Hi Onyango, David

    When you zip a JSONL file together with the images it references, the system does not “scan” the storage container or resolve URLs dynamically. Instead, interpretation depends on how the JSONL references the images and how the zip is unpacked by the service.

    The JSONL is interpreted exactly as written. Zipping files together does not automatically rewrite paths or “link” images unless the JSONL references them correctly.

    If your JSONL references images using relative paths, and those files exist in the same ZIP, everything works.


Sign in to comment