VOOZH about

URL: https://www.hardware-corner.net/how-to-test-gpus-for-local-llms/

⇱ How I Test GPUs for Local LLMs Before I Buy One


How I Test GPUs for Local LLMs Before I Buy One

By Allan Witt | Updated: January 22, 2026

πŸ‘ gpu with instance renting for testing with llm

I see this question a lot on Reddit and Discord. You want to jump into local LLMs. Maybe it is about privacy, maybe you want to tinker, maybe you are just tired of API limits. You already checked GPU tier lists, VRAM tables, and how much context different models can fit per card. Still, you want to test things yourself before spending real money. That is exactly what I do.

Recent price hikes made me reevaluate my options. The RTX 3090 still shows up as the value king for local LLMs, so I decided to use it as an example. Instead of relying on benchmarks, I wanted to test it the way I actually work, using my own workflow and measuring the things that really matter for inference.

This guide is how I test GPUs before I buy them. I do this regularly, and this is based on hands-on testing, not theory.

Why I Test Instead of Trusting Benchmarks

Most online benchmarks do not reflect real local LLM usage. They often use short contexts, synthetic tests, or old inference engines. As a local LLM user, I care about VRAM limits, max usable context, prompt processing speed when the KV cache is already full, and token generation speed once everything is warmed up.

I also pay attention to what other users report. Reddit is usually the first place where real problems show up. Interestingly, even in unrelated research discussions, you can spot signals. 

Renting GPUs Instead of Guessing

The two cheapest ways I know to test GPUs properly are Vast.ai and Runpod.io. These are GPU on-demand platforms where you rent real consumer GPUs by the hour. You load a Docker template, run your own inference stack, and measure everything yourself.

For this test I use an RTX 3090, the latest llama.cpp build, and Open WebUI. Everything runs on Linux. I always recommend Linux if you are serious about local LLMs. It is faster, tooling is native, and debugging is easier. I recently switched my desktop fully to Linux. Even games just work now.

Renting an RTX 3090 on Vast.ai

I like Vast.ai because it is fast, cheap, and has a lot of consumer GPUs available. You can test almost anything there.

First, create an account on Vast.ai and load five dollars. At current prices, that gives roughly a full day of testing time.

From the left sidebar, go to Templates and search for β€œOpen WebUI (Ollama)”. We are not actually using Ollama, but this template already includes everything we need.

πŸ‘ Screenshot of the Vast.ai interface for choosing a Docker template to test a GPU with an LLM

After clicking the play button, you will be redirected to the search view. At the top, filter by NVIDIA, then RTX, then RTX 30 Series, and select RTX 3090. Pick an instance with good download speed so model downloads are not painful. Rent the instance.

πŸ‘ rtx 3090 gpu instances available for rent on vast ai

Once the instance is running, go to Instances in the left menu. Wait until the blue button says Open. Click the small terminal icon to the right. Vast.ai will show you SSH credentials. Copy the direct SSH command and connect from your local terminal.

πŸ‘ active rtx 3090 instance on vast ai

Installing llama.cpp with CUDA

After connecting to the instance, I remove the preinstalled llama.cpp directory to avoid conflicts.

rm -rf llama.cpp

Then I clone and build llama.cpp with CUDA enabled.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

This gives me a clean, up-to-date inference engine built specifically for the GPU I am testing.

Downloading a Test Model

For this test I use gpt-oss 20B from the Unsloth repo. It is fast, behaves well, and can push a 24 GB GPU. I use the original MXFP4 version to stress the system.

cd /workspace/llama.cpp/models/
wget https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf

Running llama-server and Connecting Open WebUI

To test real user experience, I always start with llama-server and connect it to Open WebUI. This shows me how the system behaves in interactive use, not just raw benchmarks.

cd /workspace/llama.cpp/build/bin/
./llama-server \
 --model /workspace/llama.cpp/models/gpt-oss-20b-F16.gguf \
 --ctx-size 131072 \
 --flash-attn 1 \
 --n-gpu-layers 99 \
 --port 10000 \
 --chat-template jinja

Once the server is running, I note the port.

Back in the Vast.ai console, I click the blue Open button again and open Open WebUI. I create a local account and log in. Then I go bottom left, click the user name, open admin panel, open settings, and navigate to connections.

πŸ‘ open webui user menu

πŸ‘ open webui admin settings

Under the OpenAI API section, I add a new connection with the small plus button. The address is http://127.0.0.1:10000 and set the API key as β€œnone”.

πŸ‘ opan webui connection settings for openaip compatible api

Now I can start a new chat (top left) and test different context sizes. I start small, then progressively push larger prompts. gpt-oss 20B supports up to 131k context here, so this quickly shows how far the RTX 3090 can really go.

In the llama-server console, I watch lines like these:

prompt eval time = 2378.51 ms / 10310 tokens (0.23 ms per token, 4334.64 tokens per second)
eval time = 4698.38 ms / 675 tokens (6.96 ms per token, 143.67 tokens per second)
total time = 7076.89 ms / 10985 tokens

This already tells me a lot about prompt processing versus generation speed.

Standardized GPU Testing with llama-bench

Interactive testing is useful, but I always follow up with llama-bench. This gives me repeatable numbers across GPUs.

Here is how I test different context lengths with the same model. Stop the llama-server and run this in the the terminal.

./llama-bench \
 -m /workspace/llama.cpp/models/gpt-oss-20b-F16.gguf \
 -fa 1 \
 -d 4096,8196,16384,32768,45062 \
 -p 1024 \
 -n 128 \
 -ngl 99

This test pre-fills the KV cache with multiple context sizes. After that, it adds another 1024 tokens to simulate a real prompt and measures prompt processing speed. Then it generates 128 tokens and measures generation speed for each context size.

This closely matches real-world usage where the KV cache is already heavily used and you keep adding more context and generating responses.

Testing GPUs on Runpod.io

After finishing my tests on Vast.ai, I usually move to Runpod.io. The workflow is very similar, but the hardware selection is different and sometimes you can access newer GPUs earlier. This makes Runpod useful when I want to see what the current top-end consumer hardware can actually do for local LLM inference.

Head to the Runpod.io website, create an account, and load ten dollars. This is the minimum deposit. Once logged in, go to the Pods section from the left sidebar. You will see a list of available pods.

At the top navigation menu, switch from Secure Cloud to Community Cloud. This is where most consumer GPUs live. For this test, I use the RTX 5090. With 32 GB of VRAM, it is currently the top consumer GPU for local inference, and a good reference point even if you do not plan to buy one.

πŸ‘ runpod io instance selection screen

Select the RTX 5090, then scroll down to the Template section. Click Change Template and search for PyTorch 2.8.0. Choose the official Runpod PyTorch 2.8.0 template.

πŸ‘ runpod io template selection

Before deploying, you need to expose the port that Open WebUI will use. Click Edit Template and add a comma-separated list of ports under β€œExpose HTTP Ports (Max 10)”, then click Save. I use port 8088 for Open WebUI.

πŸ‘ expose port on runpod io gpu instance

Scroll to the bottom and click Deploy On-Demand. Now wait for the pod to load. Once it is ready, Runpod will show SSH credentials. Copy the credentials for SSH over exposed TCP and connect using your local terminal.

πŸ‘ running gpu pod instance on runpod io

Building llama.cpp on Runpod

After connecting, navigate to the workspace directory and clone llama.cpp.

cd /workspace
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

This builds llama.cpp with CUDA support, just like on Vast.ai.

Installing Open WebUI

Unlike the Vast.ai template, this one does not include Open WebUI. You need to install it manually.

pip install open-webui
pip install hf_transfer

Once installed, start the Open WebUI server and bind it to the port you exposed earlier.

open-webui serve --port 8088

Leave this running.

Downloading and Running a Larger Model

Open another terminal session to the same pod. This time, I use Qwen3 Coder 30B A3B in Q4_K_XL quantization. It is a good stress test for both VRAM and context length.

cd /workspace/llama.cpp/models/
wget https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/resolve/main/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

Qwen3 Coder supports up to 262k tokens of context, but with 32 GB of VRAM on the RTX 5090 you can realistically load around 147k. For this test, I use 131072 tokens.

cd /workspace/llama.cpp/build/bin/
./llama-server \
 --model /workspace/llama.cpp/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
 --ctx-size 131072 \
 --flash-attn 1 \
 --n-gpu-layers 99 \
 --port 10000

Connecting Open WebUI to llama.cpp

Go back to the Runpod console, open Pods, and click on your active pod. Under HTTP Services, you should see β€œPort 8088 HTTP Service”. Open it in a new browser tab.

πŸ‘ runpod io http service console screen

Create a local account and log in. Then connect Open WebUI to the llama.cpp server the same way you did on Vast.ai, using http://127.0.0.1:10000 as the API endpoint and any value as the key.

At this point, you can test real interaction, large contexts, and observe prompt processing and token generation behavior directly.

Running llama-bench on Runpod

If you want standardized numbers, llama-bench works exactly the same here.

cd /workspace/llama.cpp/build/bin
./llama-bench \
 -m /workspace/llama.cpp/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
 -fa 1 \
 -d 4096,8196,16384,32768,45062 \
 -p 1024 \
 -n 128 \
 -ngl 99

This lets you compare results directly with your Vast.ai tests or with other GPUs you have already evaluated.

Deciding If the GPU Is Worth It

After this, the decision is usually clear. Does the GPU load enough context for my use case. Is prompt processing fast enough once the KV cache is full. Is token generation speed acceptable for long sessions.

That is how I decide if a GPU is worth buying. Not by specs alone, but by running my own models, my own context sizes, and my own workflows.

When I am done, I stop and destroy the instance in the Vast.ai console. Do not forget that part.

Read more: Run LLMs Locally