VOOZH about

URL: https://huggingface.co/mratsim/GLM-4-32B-0414.w4a16-gptq

⇱ mratsim/GLM-4-32B-0414.w4a16-gptq · Hugging Face


GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16)

This repo contains GLM-4-32B-0414 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.

The model was calibrated with 2048 samples of max sequence length 4096 from the dataset mit-han-lab/pile-val-backup.

This is my very first quantized model, I welcome suggestions. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence. They also happen to fit in my GPU.

Original Model:

📥 Usage & Running Instructions

The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs.

export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq"
vllm serve "${MODEL}" \
 --served-model-name glm-4-32b \
 --gpu-memory-utilization 0.90 \
 --enable-prefix-caching \
 --enable-chunked-prefill \
 --max-model-len 130000 \
 --max_num_seqs 256 \
 --generation-config "${MODEL}" \
 --enable-auto-tool-choice --tool-call-parser pythonic \
 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'

🔬 Quantization method

The llmcompressor library was used with the following recipe for asymmetric GPTQ:

default_stage:
 default_modifiers:
 GPTQModifier:
 dampening_frac: 0.005
 config_groups:
 group_0:
 targets: [Linear]
 weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
 dynamic: false, observer: minmax}
 ignore: [lm_head]

and calibrated on 2048 samples, 4096 sequence length of mit-han-lab/pile-val-backup

Downloads last month
108,096
Safetensors
Model size
33B params
Tensor type
I64
·
I32
·
BF16
·

Model tree for mratsim/GLM-4-32B-0414.w4a16-gptq

Quantized
(6)
this model

Dataset used to train mratsim/GLM-4-32B-0414.w4a16-gptq

Collection including mratsim/GLM-4-32B-0414.w4a16-gptq