Vision fine-tuning training-file preprocessing fails with HTTP 500 "Infrastructure Issues"

👁 Image

Giannis 0 Reputation points

Hello,

I am consistently running into a backend infrastructure error when trying to run supervised vision fine-tuning jobs on Azure OpenAI. I need help to identify the root cause, as the user-facing logs do not provide any actionable details.

The Problem

When the job attempts to preprocess my training file, it runs for about 60–80 minutes and then completely fails with an HTTP 500 error.

Small training files: Preprocess successfully.
Validation files: Preprocess successfully ).

Large training files: Consistently fail during preprocessing.

The API error tells me to check the Logs tab, but the UI Logs/Events tab only says: "File Preprocessing failed for file training file," with no actual error trace.

My Ask

Could someone from the backend team look up the Job ID below and check the internal logs to tell me why this large file is failing to preprocess?

Failing Job Information

Job ID: ftjob-e0c116095e0a4da5a9a8ec5ccd80c4fd

Training File ID: file-323036aa3f0f4c019c00d5498416abbc

Validation File ID: file-f8a5247df8604db28ede600d41871fd9

Hyperparameters: n_epochs=3, batch_size=8, learning_rate_multiplier=2.0, seed=42, suffix: sampled-62

Timeline: Preprocessing started at epoch 1780648527 and failed at 1780651980 (~57 minutes).

Exact API Error Response:
status: failed

error: code='500',

 message='File preprocessing failed. For more details, please check the Logs tab.',

 param='Infrastructure Issues'

0 comments No comments

1 answer

👁 Image

Rayyan Fawad 1,075 Reputation points

An HTTP 500 with an "Infrastructure Issues" tag right around that ~60-minute mark is a classic sign of an underlying gateway timeout or a background out-of-memory (OOM) crash during the image preprocessing stage. Since your small files and validation files pass without any issues, your file formatting is completely correct—the backend pipeline is just hitting a resource or execution wall when it tries to unpack, decode, and resize a massive batch of high-resolution images all at once.

To bypass this infrastructure bottleneck, you can try a couple of quick adjustments:

Downsize/Compress Your Source Images: If your training images are shot in extremely high resolution or raw formats, try bulk-resizing them down locally (e.g., to a maximum of 512x512 or 1024x1024 pixels) and compressing the file before uploading. The model will downsample them anyway, and this drastically cuts down the backend's unpacking and CPU processing time.

Streamline the Training Dataset: Check if you can trim down the total volume of images in the large training file slightly to keep the preprocessing window comfortably under that strict 60-minute container timeout threshold.

If you've already optimized the file sizes and it still chokes, a backend engineer will definitely need to pull your Job ID (ft-job-e0c116095e0a4da5a9a8ec5ccd88c4fd) to manually scale up the processing container's memory allocation for your tenant!
👁 Image

Giannis 0 Reputation points

Thanks Rayyan — I implemented suggestion #1 (downsize) thoroughly, and the result actually rules out image size as the cause, so I think this needs #3 (backend escalation).

What I did (suggestion #1): I bulk-resized every source image to GPT-4.1's own high-detail ceiling — short side 768 px, long side ≤ 2048 px — locally before uploading. This is lossless with respect to what the model ingests (the service downsamples to the same dimensions server-side), and it cut raw image data ~10× (full-res → 1,663 MB for the training split). Images are referenced by SAS URL, not inlined.

Result — it still fails, and the failure signature shifted in a telling way:

New job (resized 768 px images): ftjob-a862d4f725034e16830cc010779185a3

Training file: file-3b8adb1cd53048be95940d24a6899342

Validation file: file-2587cbbe8200430e8b5ce4d9d62e2459

Model: gpt-4.1-2025-04-14

Preprocessing started 1780912391, failed 1780916516 → ~68.75 min

Same error: code='500', param='Infrastructure Issues', message='File preprocessing failed...'

Original job (full-res images): ftjob-e0c116095e0a4da5a9a8ec5ccd80c4fd — failed at ~57 min.

Could a backend engineer please:

Pull ftjob-a862d4f725034e16830cc010779185a3 (and ftjob-e0c116095e0a4da5a9a8ec5ccd80c4fd) and check the preprocessing container logs for the OOM/timeout, and

Scale up the preprocessing container's memory/timeout for this tenant?

And two questions so I can engineer around it if a raise isn't possible:

What is the documented hard limit for vision FT preprocessing — is it bounded by number of examples, total decoded pixels, or a fixed preprocessing wall-clock (the ~60-min container timeout)?

Is there a supported path for large vision FT datasets (e.g., an extended preprocessing budget, or a region with higher limits)?
Sign in to comment

URL: https://learn.microsoft.com/en-us/answers/questions/5912295/vision-fine-tuning-training-file-preprocessing-fai