GLM-5.2 GGUF Benchmarks!
We ran KLD (KL Divergence) to gauge the accuracy of our quantizations of GLM-5.2-GGUF. In general, dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless, and smaller quants also work great!
On pure top-1% accuracy, dynamic 1-bit gets around 76.2% accuracy yet being 86% smaller! Dynamic 2-bit gets around 82% accuracy whilst being 84% smaller.
👁 glm52_top1_acc_anchored_vs_gb(1)
More details in our Guide: https://unsloth.ai/docs/models/glm-5.2
You can now run GLM-5.2 in Unsloth Studio: https://github.com/unslothai/unsloth
Disk space x-axis doesnt seem to match actual HF-reported sizes 👁 doesnt match
unless I'm missing something?
@fraserprice Oh you're right haha - I think the plot is GiB haha - the plot is still correct since all of them are GiB - I guess we'll re-label
getting very bad speeds with 2x6000 Pros and 512GB of 6400 MHz DDR5 ram ...
(base) mukul@jarvis:~/dev-ai/llama.cpp$ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES="0,2" ./build/bin/llama-server \
--model /media/mukul/data/models/unsloth/GLM-5.2-GGUF/UD-Q4_K_XL/GLM-5.2-UD-Q4_K_XL-00001-of-00011.gguf \
--alias unsloth/GLM-5.2 \
--ctx-size 262144 \
-fa on \
-np 1 -kvu \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--fit on \
-b 4096 -ub 4096 \
--parallel 1 \
--threads 56 \
--jinja \
--host 0.0.0.0 \
--port 10002
0.00.396.981 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.396.984 I device_info:
0.00.538.569 I - CUDA0 : NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97249 MiB, 96657 MiB free)
0.00.685.104 I - CUDA1 : NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97249 MiB, 96675 MiB free)
0.00.685.112 I - CPU : Intel(R) Xeon(R) w9-3495X (515257 MiB, 515257 MiB free)
0.00.685.160 I system_info: n_threads = 56 (n_threads_batch = 56) / 112 | CUDA : ARCHS = 750,800,860,890,900,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.685.240 I srv init: using 111 threads for HTTP server
0.00.685.568 I srv start: binding port with default address family
0.00.686.706 I srv llama_server: loading model
0.00.686.710 I srv load_model: loading model '/media/mukul/data/models/unsloth/GLM-5.2-GGUF/UD-Q4_K_XL/GLM-5.2-UD-Q4_K_XL-00001-of-00011.gguf'
0.00.686.724 I common_init_result: fitting params to device memory ...
0.00.686.725 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.24.429.073 W load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
0.24.429.077 W load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
0.24.550.379 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance
0.24.694.443 W model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
0.24.694.446 W model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
0.24.694.448 W model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
0.24.694.450 W model has unused tensor blk.78.attn_q_a.weight (size = 13369344 bytes) -- ignoring
0.24.694.452 W model has unused tensor blk.78.attn_q_b.weight (size = 35651584 bytes) -- ignoring
0.24.694.454 W model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 3760128 bytes) -- ignoring
0.24.694.456 W model has unused tensor blk.78.attn_k_b.weight (size = 6684672 bytes) -- ignoring
0.24.694.458 W model has unused tensor blk.78.attn_v_b.weight (size = 8912896 bytes) -- ignoring
0.24.694.460 W model has unused tensor blk.78.attn_output.weight (size = 106954752 bytes) -- ignoring
0.24.694.463 W model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
0.24.694.465 W model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
0.24.694.467 W model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
0.24.694.469 W model has unused tensor blk.78.indexer.proj.weight (size = 786432 bytes) -- ignoring
0.24.694.472 W model has unused tensor blk.78.indexer.attn_k.weight (size = 835584 bytes) -- ignoring
0.24.694.474 W model has unused tensor blk.78.indexer.attn_q_b.weight (size = 8912896 bytes) -- ignoring
0.24.694.476 W model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
0.24.694.593 W model has unused tensor blk.78.ffn_gate_exps.weight (size = 1811939328 bytes) -- ignoring
0.24.694.596 W model has unused tensor blk.78.ffn_down_exps.weight (size = 2214592512 bytes) -- ignoring
0.24.694.598 W model has unused tensor blk.78.ffn_up_exps.weight (size = 1811939328 bytes) -- ignoring
0.24.694.600 W model has unused tensor blk.78.ffn_gate_shexp.weight (size = 13369344 bytes) -- ignoring
0.24.694.602 W model has unused tensor blk.78.ffn_down_shexp.weight (size = 13369344 bytes) -- ignoring
0.24.694.604 W model has unused tensor blk.78.ffn_up_shexp.weight (size = 13369344 bytes) -- ignoring
0.24.694.607 W model has unused tensor blk.78.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
0.24.694.609 W model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
0.24.694.612 W model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
0.24.694.619 W model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
2.23.258.600 W llama_context: n_ctx_seq (262144) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
2.24.076.740 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2.25.936.288 I srv load_model: initializing slots, n_slots = 1
2.27.201.994 W common_speculative_init: no implementations specified for speculative decoding
2.27.202.002 I slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
2.27.202.165 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
2.27.202.167 I srv load_model: use `--cache-ram 0` to disable the prompt cache
2.27.202.168 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
2.27.202.168 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
2.27.202.337 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
2.27.228.389 I init: chat template, example_format: '[gMASK]<sop><|system|>Reasoning Effort: Max<|system|>You are a helpful assistant<|user|>Hello<|assistant|><think></think>Hi there<|user|>How are you?<|assistant|><think>'
2.27.247.196 I srv init: init: chat template, thinking = 1
2.27.247.217 I srv llama_server: model loaded
2.27.247.220 I srv llama_server: server is listening on http://0.0.0.0:10002
2.27.247.223 I srv update_slots: all slots are idle
3.31.305.955 I srv params_from_: Chat format: peg-native
3.31.322.814 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
3.31.322.816 I srv get_availabl: updating prompt cache
3.31.322.820 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
3.31.322.822 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
3.31.322.823 I srv get_availabl: prompt cache update took 0.01 ms
3.31.323.567 I reasoning-budget: activated, budget=2147483647 tokens
3.31.323.587 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
4.32.538.505 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 4096, progress = 0.33, t = 61.21 s / 66.91 tokens per second
5.12.022.168 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8192, progress = 0.67, t = 100.70 s / 81.35 tokens per second
5.38.789.669 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8863, progress = 0.72, t = 127.47 s / 69.53 tokens per second
6.37.430.688 I reasoning-budget: deactivated (natural end)
6.44.682.812 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 2.89 t/s
6.47.967.359 I slot print_timing: id 0 | task 0 | n_decoded = 111, tg = 2.93 t/s
6.51.071.939 I slot print_timing: id 0 | task 0 | n_decoded = 121, tg = 2.95 t/s
6.52.862.794 I slot print_timing: id 0 | task 0 | prompt eval time = 158735.98 ms / 12232 tokens ( 12.98 ms per token, 77.06 tokens per second)
6.52.862.797 I slot print_timing: id 0 | task 0 | eval time = 42803.21 ms / 127 tokens ( 337.03 ms per token, 2.97 tokens per second)
6.52.862.798 I slot print_timing: id 0 | task 0 | total time = 201539.18 ms / 12359 tokens
6.52.862.799 I slot print_timing: id 0 | task 0 | graphs reused = 125
6.52.863.713 I slot release: id 0 | task 0 | stop processing: n_tokens = 12358, truncated = 0
6.52.863.724 I srv update_slots: all slots are idle
6.52.984.634 I srv params_from_: Chat format: peg-native
6.53.000.188 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.713 (> 0.100 thold), f_keep = 0.717
6.53.001.961 I reasoning-budget: activated, budget=2147483647 tokens
6.53.002.012 I slot launch_slot_: id 0 | task 131 | processing task, is_child = 0
7.24.683.468 I reasoning-budget: deactivated (natural end)
7.42.269.207 I slot print_timing: id 0 | task 131 | prompt eval time = 31681.46 ms / 3563 tokens ( 8.89 ms per token, 112.46 tokens per second)
7.42.269.211 I slot print_timing: id 0 | task 131 | eval time = 17585.72 ms / 48 tokens ( 366.37 ms per token, 2.73 tokens per second)
7.42.269.211 I slot print_timing: id 0 | task 131 | total time = 49267.17 ms / 3611 tokens
7.42.269.212 I slot print_timing: id 0 | task 131 | graphs reused = 171
7.42.270.126 I slot release: id 0 | task 131 | stop processing: n_tokens = 12475, truncated = 0
7.42.270.212 I srv update_slots: all slots are idle
7.53.779.290 I srv params_from_: Chat format: peg-native
7.53.796.320 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.677 (> 0.100 thold), f_keep = 0.711
7.53.798.019 I reasoning-budget: activated, budget=2147483647 tokens
7.53.798.130 I slot launch_slot_: id 0 | task 180 | processing task, is_child = 0
8.11.072.889 I slot print_timing: id 0 | task 180 | prompt processing, n_tokens = 96, progress = 0.68, t = 17.27 s / 5.56 tokens per second
^C^CReceived second interrupt, terminating immediately.
(base) mukul@jarvis:~/dev-ai/llama.cpp$
