Paper • 2603.20843 • Published
Llama-2-7b-HiCI-16k-SFT
Model Description
This is a HiCI instruction-following (SFT) adapter checkpoint for Llama-2-7B, extending its context window to 16K tokens and fine-tuned on LongAlpaca-12k for instruction following.
It was produced in two stages:
- HiCI pre-training on RedPajama at 16K context (adapter:
Llama-2-7b-hici-16k-none-8Gpus) - Supervised fine-tuning on LongAlpaca-12k, resuming from the pre-training checkpoint
The adapter contains three components: LoRA adapters (q/k/v/o_proj), HiCI module weights (LocalConstructor + GlobalIntegrator), and fine-tuned embedding + LayerNorm weights.
Paper: HiCI (arXiv 2603.20843)
HiCI Architecture
Three-stage hierarchy per transformer layer:
- Local Construction — M learnable query slots attend to each segment via bottleneck cross-attention → local summary L_i
- Global Integration — multi-view statistics (mean/max/min/std/ℓ2-norm) → shared compression → attention-based selection → gated expansion → G
- Top-down Broadcast — per-segment attention with augmented KV=[G, L_i, segment tokens]; queries from segment tokens only
Input (16K tokens) → 4 segments × 4K
Stage 1: 8 local slots per segment → L_i
Stage 2: multi-view stats → K=4 global slots G
Stage 3: Q=[chunk], KV=[G, L_i, chunk] → Flash Attention
Trainable Components
adapter_model.bin (~28 MB)
└── LoRA Adapters (r=8, alpha=16): q_proj, k_proj, v_proj, o_proj
trainable_params.bin (~2 GB)
├── local_constructor.* — Local Construction modules (32 layers)
├── global_integrator.* — Global Integration modules (32 layers)
├── input_layernorm / post_attention_layernorm — LayerNorm weights (32 layers)
├── model.embed_tokens.weight — Token embeddings
└── model.norm.weight — Final LayerNorm
Training Details
Stage 1 — HiCI Pre-training (continued pre-training)
- Base Model: meta-llama/Llama-2-7b-hf
- Context Length: 16,384 tokens (16K)
- Dataset: RedPajama-Data-1T-Sample
- Steps: 1,000
- LR: 2e-5 (base), 2e-4 (HiCI modules)
- Hardware: 8× H100 80GB, DeepSpeed Stage 2
Stage 2 — SFT (instruction fine-tuning)
- Base Model: meta-llama/Llama-2-7b-hf
- Resumed from: Llama-2-7b-hici-16k pre-training checkpoint (step 1000)
- Context Length: 16,384 tokens (16K)
- Dataset: LongAlpaca-12k (12,000 long-context instruction samples)
- Epochs: 10 (max_steps=2,000)
- Segments: 4 × 4,096 tokens (fixed group size for irregular SFT sequence lengths)
- Local Representation Slots (M): 8 per segment
- Global Representation Slots (K): 4
- HiCI Attention Heads: 8, Bottleneck dim: 512, Shared compress dim: 128
- LoRA: r=8, alpha=16, target: q/k/v/o_proj
- Checkpoint: step 2,000
- Batch: per_device=1, grad_accum=8 (effective batch=8)
- LR: 2e-5 (base/LoRA), 2e-4 (HiCI modules), grad clip=0.3
- Precision: bf16
- Hardware: 8× H100 80GB, DeepSpeed Stage 2
Usage
Requires llama_attn_hici.py from the HiCI repo.
import torch
import transformers
from peft import PeftModel
import llama_attn_hici as hici_attn
# 1. Replace attention with HiCI BEFORE loading model
hici_attn.MIXED_GROUP_TRAINING = False
hici_attn.replace_llama_attn(use_flash_attn=True, use_full=False, use_hierarchical_forward=True)
# 2. Load base model
base_model = transformers.AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16, device_map="auto",
)
# 3. Register HiCI modules (must match training config)
hici_attn.register_hici_to_model(base_model, num_memory_slots=8, global_slots=4, num_heads=8, bottleneck_dim=512)
# 4. Load LoRA adapter + HiCI weights
model = PeftModel.from_pretrained(base_model, "ZengXiangyu/Llama-2-7b-HiCI-16k-SFT")
# 5. Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("ZengXiangyu/Llama-2-7b-HiCI-16k-SFT")
# 6. Inference — use Llama-2 instruction format
prompt = "[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\n{user_question} [/INST]"
inputs = tokenizer(prompt.format(user_question="Summarize the following text: ..."),
return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Citation
@article{zeng2026hici,
title={HiCI: Hierarchical Construction-Integration for Long-Context Attention},
author={Zeng, Xiangyu and Xu, Qi and Wang, Yunke and Xu, Chang},
journal={arXiv preprint arXiv:2603.20843},
year={2026}
}
License
This model follows the Llama 2 Community License.
- Downloads last month
- 5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ZengXiangyu/Llama-2-7b-HiCI-16k-SFT
Base model
meta-llama/Llama-2-7b-hf