Voozh

If you've spent any time in the local LLM space, you're almost certainly familiar with the hardware ceiling. The most interesting open-source models keep getting bigger, and the gap between what's published on Hugging Face and what you can actually load into VRAM at home has generally been growing, sans the handful of releases a year that run on anything and are genuinely impressive. Sure, you can download a 230B mixture-of-experts model for free, but it's not free to run. You need a workstation that costs as much as a car, and even then, you're often quantizing the thing into oblivion just to fit it.

That's why I've been trying out Nvidia's hosted endpoints on the Build Nvidia site. It's not a new platform, but the lineup of models has quietly grown into something I think a lot of people have missed. You sign up for the developer program, generate an API key, and you get free access to dozens of the largest open-source models out there, served from Nvidia's own DGX Cloud hardware. Not every model in the catalog is on the free tier, and a few are flagged for upcoming deprecation, but the variety of free models is wide enough to cover most of what you'd actually want to try. You don't need to submit a credit card, there's no per-token billing, and more importantly, there's no GPU cost that you have to pay yourself.

There are presumably rate limits somewhere, and you have to trust Nvidia with the request, but it works. I've been using it through my own coding harness for weeks now to test models most people can't ever run at home, and it's surprisingly good.

The lineup is far bigger than people give it credit for

Most of the models I actually want to try are already there

The catalog runs to over a hundred models at this point, with 50 of them (at the time of writing) carrying the "Free Endpoint" tag, but the count doesn't matter as much as the curation does. For example, MiniMax M2.7 is there, which I've actually tested in the past with a MiniMax subscription given that it's one of Claude's most credible open-weight competitors. So is Step-3.5 Flash, the Stepfun model that I loved on the Lenovo ThinkStation PGX. Nvidia's own Nemotron family is there too, a set of reasoning and agent-focused open models that Nvidia has been refining as a showcase for what its own training stack can do.

As for some of the other models, GLM-4.7 arrived recently, though it's nowhere near as powerful for coding as GLM-5.1. Kimi K2 Thinking, Qwen3-Coder-480B, DeepSeek V3.2, Llama 4 Maverick, Mistral Large 3, Devstral 2, ByteDance's Seed-OSS, and Google's Gemma 3 family are all in there too, so there's deep variety in terms of what you can try out. A handful of the older Mistral and DeepSeek entries carry deprecation notices, so the lineup isn't static, but new free endpoints have been landing faster than old ones leave.

The best part about this is that all of these models are massive. MiniMax M2.7 is a 230 billion parameter sparse MoE with 10B active per token, and Step-3.5 Flash is a 196B model with 11B active and a 256K context window. These are actually usable models that can power real work, but they're also the kind of models that need a serious server to host. There are some limitations compared to running it locally, with the most notable being that these are only inference endpoints, so you can't fine-tune or modify a model. What Nvidia hosts is what you get, but for the kind of testing and evaluation work people might want to do, it's more than good enough.

These models are still open-weight models, and you can still pull them from Hugging Face, run them on your own hardware if you have it, and fine-tune them under whatever license they ship with. Nvidia's hosting is just a convenience layer on top that allows you to "try before you buy" in a sense. Unlike when a new GPT version releases, it's OpenAI that hosts a model, and that's the only place the model exists. When Nvidia hosts MiniMax M2.7, the weights are published, the architecture is documented, and the only thing you're really paying for, if you eventually decide to self-host it, is the GPU power to run it. There's no version of this particular deal where the free tier locks you in.

Setup is basically an OpenAI-compatible endpoint

If your tool speaks OpenAI, it speaks Nvidia

The whole thing is built around an OpenAI-compatible API, which makes it trivially easy to drop into existing tooling. You sign up at build.nvidia.com with the free Nvidia Developer account, you generate an API key prefixed with "nvapi-", and that's about it. The base URL is "https://integrate.api.nvidia.com/v1" and the rest of the calls look exactly like what you'd send to any OpenAI-compatible service.

In practice, that means configuring something like this in your coding harness:

"nvidia-build": {
 "baseUrl": "https://integrate.api.nvidia.com/v1",
 "api": "openai-completions",
 "apiKey": "nvapi-XXXXXXXXXXX",
 "models": [
 {
 "id": "stepfun-ai/step-3.5-flash",
 "contextWindow": 200000
 },
 {
 "id": "minimaxai/minimax-m2.7",
 "contextWindow": 200000
 }
 ]
}

After that, if you want to change the model, you just change the id or model name to be whatever's on the catalog page. Tool calling has worked cleanly in everything I've thrown at it so far, including the agentic stuff like MiniMax M2.7 and Step-3.5 Flash.

I run these models through Pi, but any other coding harness capable of using OpenAI-style completions, such as OpenCode, will work as well. This is the same harness I've pointed at MiniMax's own API, GLM-5.1, and some of my self-hosted models. Swapping in Nvidia's endpoint is a base URL change and an API key, nothing more. If you're already using something like Aider, Continue, or any other tool that lets you specify a custom OpenAI-compatible backend, the configuration takes about a minute.

There is, in theory, a rate limit you have to live with on the free tier, but I haven't run into one yet across weeks of regular use, and Nvidia doesn't publish one. For interactive coding work or one-off evaluation runs, whatever ceiling exists hasn't been something I've come across. You're not going to use this to serve a production app because it can be quite slow (especially at peak times), and that's fine. That's not what it's for.

Nvidia's free tier used to work on a credit system, where you'd get an allotment when you signed up and could request more if you needed them, and that system has quietly evolved over the past year. The current state, as of this April, is more or less a forever-free plan with rate limits as the main constraint, rather than per-token credits. There's still a credit pool tied to the developer program for some of the larger or more specialized models, but for the headline open-weight stuff, throughput is the only thing you really run up against. I get why Nvidia runs it this way, as they sell the GPUs these models run on, and this really serves as a showcase or a sales funnel. It just happens to also be useful to anyone who just wants to try a significantly larger model than their hardware allows for.

There are other places that will rent you access to the same models, of course, and they won't have the same throughput problems. OpenRouter aggregates a lot of these same open-weight models behind a single billing account, Groq runs a smaller curated set on its own LPU silicon at speeds most GPU stacks struggle to match, and Together AI has been running an open-model API for years. What Nvidia has, that the others don't quite, is the combination of an obvious hardware match for how these models are deployed in production, a free tier generous enough to be useful, and a curated catalog that tends to pick up new releases close to the day they drop. None of those alone makes it the obvious pick, but together, they make it extremely enticing.

Hosted endpoints don't replace local, they extend what you can test

Running 196B at home is still a non-starter for most people

I run local LLMs daily, and I've been pretty open that for the kind of work I actually do, the smaller models are usually fine. Qwen3-Coder-Next on my own hardware is still my go-to for anything where privacy matters or where I don't want a network dependency in the loop. Local always wins when local is good enough.

But the problem is that local isn't always good enough, and the gap shows up exactly at the model sizes Nvidia is hosting. There's no realistic way for me to run MiniMax M2.7 on the ThinkStation PGX. The full model needs serious hardware, and using the 3-bit quantization to run it on that machine strips away enough of its intelligence that I'm not actually testing the same model anymore. If I want to know how M2.7 actually performs on a task I care about, I need it served at full precision. Nvidia's endpoint is the easiest way to get that without paying MiniMax directly or setting up an H100 cluster.

This is the most practical way to use the free API, as it's a way to evaluate the largest open-weight models honestly, against the same prompts and tasks I throw at my local stack, without having to pretend that the heavily quantized model is the real thing. I've used it to play around with Step-3.5 Flash without having to worry about an out of memory error, and I've played around with other models I wouldn't normally be bothered to configure locally. As well, I discovered Nvidia's free platform after I started testing out MiniMax, but I'd have just tested it on Nvidia's API first before taking the plunge on the base monthly subscription.

There's one group of people this is especially useful to, and it's the people pricing up an Nvidia DGX Spark, a high-VRAM workstation, or a Mac Studio for local AI work. If you're about to spend a few thousand dollars on a box for the express purpose of running these models, doing it without first trying the actual models on Nvidia's free endpoints is a mistake. You can spend a weekend running Step-3.5 Flash, MiniMax M2.7, and a few Nemotron variants through your own real workflows, see which ones actually fit how you work, and then size your hardware around that answer instead. That's a buying decision you can't undo cheaply, but testing it is totally free.

The only downside is that nothing here is private. You're sending your prompts to Nvidia's infrastructure, usage is logged, and models each have their own data policies. For evaluation and exploratory work that's completely fine, but for anything that touches real data or anything you'd consider sensitive, it's the wrong tool.

Free access to frontier-scale open weights is a bigger deal than it sounds

More than enough for what most people actually need

The last few years of local AI have largely been a story of clever compression without reducing output quality and of finding ways to run smaller and smaller versions of bigger and bigger models. However, the new architectures and the capabilities of these models that can truly catch you off guard are the ones that arrive at a scales that doesn't fit on home hardware.

A free, OpenAI-compatible endpoint serving the full-precision weights of those open models, hosted by the company that makes the GPUs they run on, flys in the face of the closed-model status quo. You get to try the big stuff, while comparing it to what you can already run locally. And when you're done with your testing, you get to make an informed call about which model is worth your hardware budget or worth paying for on OpenRouter before you spend anything at all.

URL: https://www.xda-developers.com/ive-been-running-some-biggest-open-weight-llms-free-nvidia-cloud/

⇱ I've been running some of the biggest open-weight LLMs for free on Nvidia's cloud