Voozh

This is an llmcompressor v0.4.0 FP8 Dynamic quant.

You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:

config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
 model = AutoModelForCausalLM.from_config(config)

max_memory = {
 0: "60GiB",
 1: "60GiB",
 2: "60GiB",
 3: "60GiB",
 4: "60GiB",
 5: "60GiB",
 6: "60GiB",
 7: "60GiB",
 "cpu": "1500GiB",
}

device_map = infer_auto_device_map(
 model,
 max_memory=max_memory,
 no_split_module_classes=["LlamaDecoderLayer"],
)

Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B