VOOZH about

URL: https://www.hardware-corner.net/guides/llama-cpp-vs-ik_llama-cpp/

⇱ I Benchmarked Llama.cpp vs. ik_llama.cpp on a 70B Model with an RTX 3090 – Here’s What I Found


I Benchmarked Llama.cpp vs. ik_llama.cpp on a 70B Model with an RTX 3090 – Here’s What I Found

Last updated: | Author: Allan Witt

For local LLM enthusiasts, the choice of inference engine can be just as crucial as the hardware it runs on. While llama.cpp has become the de facto standard, a fork named ik_llama.cpp has been gaining attention for its focus on performance, especially in hybrid CPU/GPU setups. Although many of ik_llama.cpp’s listed improvements target Mixture-of-Experts (MoE) models, I wanted to see how it compared to the mainline llama.cpp using a large, dense model.

I put both engines to the test on my rig with Meta’s Llama 3.3 70B model to see what, if any, practical difference it makes for a home setup.

The Test System

My hardware consists of a dual-purpose server and AI machine built for value and performance. The core of the system is a 20-core Xeon E5-2673 v4 CPU with 128 GB of RAM. The main GPU is a single NVIDIA RTX 3090. On the software side, the system runs Ubuntu 24.04 LTS with the CUDA 12.4 drivers.

For this comparison, I used two different 4-bit quantizations of the Llama 3.3 70B model: Unsloth’s dynamic UD-Q4_K_XL and an IQ4_NL quant. The goal was to test how each inference engine handled different quantization strategies across both short (~1k token) and long (~8k token) context lengths.

Performance at a Glance

Here is a summary of the performance results, focusing on prompt processing (ingest speed) and token generation (output speed), both measured in tokens per second (t/s).

Model Quantization Context Size Inference Engine Prompt Processing (t/s) Token Generation (t/s)
UD-Q4_K_XL ~1k ik_llama.cpp 14.57 2.16
UD-Q4_K_XL ~8k ik_llama.cpp 14.44 2.10
UD-Q4_K_XL ~1k llama.cpp 123.28 2.03
UD-Q4_K_XL ~8k llama.cpp 127.86 1.47
IQ4_NL ~1k ik_llama.cpp 138.04 2.29
IQ4_NL ~8k ik_llama.cpp 146.29 2.04
IQ4_NL ~1k llama.cpp 132.66 2.22
IQ4_NL ~8k llama.cpp 138.12 1.66

Analysis of the Results: Llama.cpp vs. ik_llama.cpp

The numbers suggest a mix of outcomes, depending on the quantization type and context size.

For prompt processing, ik_llama.cpp paired with the IQ4_NL quantization reached the highest observed speed at 146 t/s on the 8k context. This is moderately faster than llama.cpp in the same configuration. However, results were reversed with the Unsloth UD-Q4_K_XL model, where llama.cpp showed higher ingest speeds and ik_llama.cpp performed more slowly. This may indicate engine-specific optimizations for certain quant types in llama.cpp that aren’t present in the fork.

Token generation speeds were generally close across both engines, especially on shorter contexts, with both hovering around 2.0–2.3 t/s. On the longer 8k context, llama.cpp’s performance dropped more noticeably (1.47–1.66 t/s), while ik_llama.cpp maintained more stable speeds (around 2.0 t/s). This suggests that ik_llama.cpp may be handling longer context workloads slightly more efficiently under hybrid CPU/GPU conditions.

The Quantization Question: IQ vs. K-Quants

The choice between quantization types like “I-quants” and “K-quants” adds another dimension. Generally, I-quants (IQ) aim to reduce model size and preserve quality but may increase compute demands. K-quants are a more established option and often favored for their balance of speed and simplicity, especially in CPU-heavy environments. In this test, IQ quants performed well with ik_llama.cpp, particularly for ingest speed, without major tradeoffs in generation performance.

Unsloth’s “dynamic” quantization takes a different approach by using variable precision across layers to retain accuracy. In practice, performance appears to depend heavily on which engine is used, and some combinations may perform better than others.

The Fork in the Road

The existence of llama.cpp and ik_llama.cpp presents a choice for enthusiasts. The developer of ik_llama.cpp forked the project to focus on performance and quantization-specific features, without adopting the broader feature set (such as vision or speech support) seen in the mainline project. Mainline llama.cpp remains a full-featured general-purpose engine, while ik_llama.cpp focuses on runtime efficiency in hybrid setups.

For my hardware, the takeaway is that ik_llama.cpp offers slightly more consistent token generation speed with long contexts, especially when using IQ quants. It seems to manage the load between VRAM and system RAM more evenly in that scenario. On the other hand, llama.cpp produced better results for ingest speed with some quantization types and was overall more predictable in performance across configurations.

Ultimately, the differences between the two engines aren’t dramatic, but they may matter depending on workload and priorities. If you’re looking to optimize for a specific use case, it’s worth testing both with your own models and hardware.

Allan Witt

<p>Allan Witt is the co-founder and Editor-in-Chief of Hardware-Corner.net. Computers and the web have fascinated him since childhood. In 2011, he began training as an IT specialist at a mid-sized company while launching a tech blog on the side—quickly discovering a passion for writing about hardware and technology.</p> <p>After completing his training, Allan worked as a system administrator for two years. Alongside that, he started building and upgrading custom gaming PCs at a local hardware shop. What began as a part-time project grew into a full-time career. Today, his work also focuses on building and optimizing PC systems for local AI and LLM workloads, combining hands-on experience with a passion for making complex tech easy to understand.</p>

0 Comments

Submit a Comment Cancel reply

Related

Desktops
Dell refurbished desktop computers

If you are looking to buy a certified refurbished Dell desktop computer, this article will help you …

Guides
Dell Outlet and Dell Refurbished Guide

For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option to use …

Guides
Refurbished, Renewed, Off Lease

When you are looking for refurbished computer, you often see – certified, renewed, and off-lease placed in …

Laptops
Excelent Refurbished ZenBook Laptops

If you are looking for a compact ultrabook and a reasonable price, consider a refurbished Asus Zenbook …