![]() |
VOOZH | about |
TrueFoundry recognized in Gartner Hype Cycle for Platform Engineering 2026. Read the full report β
Join our VAR & VAD ecosystem β deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner β
Get instant access to a live TrueFoundry environment. Deploy models, route LLM traffic, and explore the full platform β your sandbox is ready in seconds, no credit card required.
Blazingly fast way to build, track and deploy your models!
This benchmark measures Llama 2-7B on latency, cost, and throughput across deployment modes to gauge whether it's production-ready for your workload.
We benchmark the performance of LLama2-7B in this article from latency, cost, and requests per second perspective. This will help us evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.
In this blog, we have benchmarked the Llama-2-7B model from NousResearch. This is a pre-trained version of Llama-2 with 7 billion parameters.
β
β
Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
Benchmarking models to pick the right one?
Once you've chosen a model, TrueFoundry's AI Gateway lets you serve it alongside 1000+ others behind one OpenAI-compatible endpoint β with routing, fallbacks, and cost controls, in your own VPC.
Book a 30-min DemoExplore AI GatewayThe key factors across which we benchmarked are:
GPU Type:
Prompt Length:
For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users and Spawn Rate. Here the Number of Users signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate signifies how many users will be spawned per second.
In each benchmarking test for a deployment config, we started from 1 user and kept increasing the Number of Users gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms) and total requests per second.
In each of the 2 deployment configurations, we have used the huggingface text-generation-inference model server having version=0.9.4. The following are the parameters passed to the text-generation-inference image for different model configurations:
β
| PARAMETERS | LLAMA-2-7B ON A100 | LLAMA-2-7B ON A10G |
| Max Batch Prefill Tokens | 6100 | 10000 |
β
Here's The Evaluation Framework for Proposal Template
| Criteria | What should you evaluate ? | Priority | TrueFoundry |
|---|---|---|---|
| Unified API & Routing | |||
| Unified OpenAI-compatible endpoint | Is the gateway API compatible with OpenAI's /v1/chat/completions and /v1/responses formats, allowing consistent access across different models through a standardized interface? | Must have | β Supported: OpenAI-compatible endpoint across all providers. |
| Provider and model coverage | Does it support leading providers like OpenAI, Azure OpenAI, Amazon Bedrock, Anthropic, Gemini, Groq, plus self-hosted models? | Must have | β Supported: 1000+ LLMs across hosted and self-hosted providers. |
| Model onboarding speed | How quickly can new models (OpenAI-compatible and non-standard APIs) be added without code changes? | Must have | β Supported: config-driven onboarding within minutes. |
| Multimodal support | Does the gateway support text, vision, audio, image generation, and embeddings through a single interface? | Depends on use case | β Supported: chat, embeddings, images, audio, rerank, and realtime APIs. |
| Routing, load balancing, fallback | Can requests be routed by model, provider, latency, priority, weight, region, and failure state with automatic retries? | Must have | β Supported: load balancing, fallbacks, weighted and latency-based routing. |
| Model switching without code change | Is model switching supported via headers or config without changing client code? | Must have | β Supported: header-based and config-based model switching. |
Latency, RPS, and Cost
We calculate the best latency based on sending only one request at a time. To increase throughput, we send requests parallelly to the LLM. The max throughput is the case when the model is able to process the input requests without significant deterioration in latency.
Tokens Per Second
LLMs process input tokens and generation differently - hence we have calculated the input tokens and output tokens processing rate differently.
From benchmark to production?
Route across self-hosted and hosted models, switch without code changes, and govern cost and access from one control plane. See how TrueFoundry's AI Gateway runs models at scale.
Book a 30-min DemoExplore AI GatewayA10 24GB GPU (1500 input + 100 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 4.1 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 0.9 RPS without a significant drop in latency. Beyond 0.9 RPS, the latency increases drastically which means requests are being queued up.
A10 24GB GPU (50 input + 500 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 15 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 0.9 RPS without a significant drop in latency. Beyond 0.9 RPS, the latency increases drastically which means requests are being queued up.
A100 40GB GPU (1500 input + 100 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 2 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 3.6 RPS without a significant drop in latency. Beyond 3.6 RPS, the latency increases drastically which means requests are being queued up.
A100 40GB GPU (50 input + 500 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 8.5 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 3.5 RPS without a significant drop in latency. Beyond 3.5 RPS, the latency increases drastically which means requests are being queued up.
Hopefully, this will be useful for you to decide if LLama7B will suit your use case and the costs you can expect to incur while hosting Llama7B.
TrueFoundry AI Gateway delivers ~3β4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.
Product
Company
Resources