Request for GPU VM Access for LLM Deployment on Sponsorship Subscription
I am using an Azure Sponsorship subscription and need to deploy a fine-tuned Llama 3.1 8B model for an academic/project use case.
I previously requested access to the NCasT4v3 GPU VM family, but the request was denied due to limited availability for sponsorship subscriptions.
Could you please advise which GPU VM families are available for my subscription, or recommend an alternative GPU-enabled VM series that can be approved for running a quantized Llama 3.1 8B model with LoRA adapters?
My goal is model inference and API deployment, not large-scale training.
Thank you for your guidance.
2 answers
-
Himanshu Shekhar 6,710 Reputation points • Microsoft External Staff • Moderator
For Azure Sponsorship subscriptions, GPU quota approvals are subject to both subscription eligibility and regional capacity availability, and Microsoft does not publish a fixed list of GPU families that are guaranteed to be approved.
Recent guidance indicates that high-demand GPU families such as NCasT4_v3 may be restricted or denied for some Sponsorship subscriptions.
For your use case (quantized Llama 3.1 8B with LoRA adapters, inference/API deployment only), the following GPU families would generally be appropriate if quota is available in your region
The exact GPU family that can be approved for your Sponsorship subscription can only be confirmed by the Azure quota/capacity review process. If VM quota requests continue to be denied, consider:
Deploying through Azure AI Foundry managed compute (if available to your subscription).
Using serverless/model-as-a-service offerings where GPU capacity is managed by Azure.
Submitting quota requests for multiple GPU families and regions to identify available capacity.
Since your requirement is inference rather than model training, a single T4-class or V100-class GPU with 16 GB VRAM is typically sufficient for a quantized Llama 3.1 8B deployment - https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ncast4v3-series?tabs=sizebasic
Please request evaluation of NCv3 (V100) or NV-series quotas as alternative GPU families if NCasT4_v3 is unavailable, and consider Azure AI Foundry managed deployment options for the fastest path to hosting the model
-
AI answer
The available GPU VM families and specific SKUs for a sponsorship subscription are controlled by Azure capacity and policy, and this information is not listed in the provided context. The context only describes general GPU options and capabilities, not which series are allowed or disallowed per subscription type.
From the context, the following GPU-capable options are documented as suitable for AI/LLM-style workloads in general (not sponsorship-specific):
- NCasT4_v3-series (NVIDIA T4)
- Powered by NVIDIA Tesla T4 GPUs and AMD EPYC 7V12 CPUs.
- Up to 4 T4 GPUs (16 GB each), up to 64 vCPUs, and 440 GiB RAM.
- Designed for AI services such as real-time inferencing and interactive graphics.
- CUDA drivers can be installed via the Azure NVIDIA GPU driver extension; GRID drivers can be installed manually for graphics workloads.
- This series is explicitly called out as a recommended upgrade target for older ND-series workloads and is suitable for standard GPU compute workloads based on CUDA, TensorRT, ONNX, etc.
- Other N-series GPU options (general guidance)
- NCv3 (NVIDIA V100) and NCasT4_v3 (NVIDIA T4) are mentioned as aligned GPU models for Azure Stack Hub and as general-purpose GPU compute options.
- NP-series (FPGA-based) are being retired and are not GPU-based; they require porting workloads to GPU frameworks if used as a migration source.
- GPU containers and AKS for inference
- Azure Container Instances with V100 GPUs (NCv3-backed) can be used for containerized GPU inference workloads, with 1–4 GPUs and up to 24 vCPUs and 448 GB RAM.
- AKS GPU node pools (for example, NV or ND series) are recommended for scalable training and inference, with guidance to use lower GPU-count SKUs and distributed inference for cost-effective LLM deployment.
However, the context does not specify:
- Which GPU VM series are allowed or blocked specifically for sponsorship subscriptions.
- Any alternative GPU families that are guaranteed to be approvable for sponsorship subscriptions.
Because of that, the only supported guidance from the context is:
- NCasT4_v3, NCv3, and other N-series GPUs are technically suitable for LLM inference (including quantized Llama 3.x 8B) from a capability perspective.
- For scalable inference/API deployment, AKS with GPU node pools or container-based GPU options (ACI) are recommended patterns.
To determine which GPU VM families are actually available and approvable for a specific sponsorship subscription, it is necessary to work with Azure Support or the subscription’s support channel; this selection and approval process is not documented in the provided context.
References:
- Azure Confidential GPU options
- 'NC' sub-family GPU accelerated VM size series
- Migration Guide for GPU Compute Workloads in Azure
- Use Azure Kubernetes Service to host GPU-based workloads
- Deploy container instances that use GPU resources
- Graphics processing unit (GPU) virtual machine (VM) on Azure Stack Hub
AI-generated content may be incorrect. Read our transparency notes for more information. - NCasT4_v3-series (NVIDIA T4)
