VOOZH about

URL: https://huggingface.co/nazdef/1gpu-llm-small-en-it-base

⇱ nazdef/1gpu-llm-small-en-it-base · Hugging Face


👁 Ih4-DhpQJv3BJTj_RG5tR (1) (1)

1gpu-llm Small EN/IT Base

This repository is the current ready-to-use base release for the 1gpu-llm small EN/IT family.

1gpu-llm is a family of language models trained from scratch on a single consumer GPU.

For this release family, the reference training hardware is:

  • GPU: NVIDIA GeForce RTX 4060 Ti 16GB
  • training setup: single GPU
  • practical small-model wall-clock target: about 1 day to get to the current small base release class on this hardware

Concretely, this release packages the GPT2PreLN decay-only winner at step_8600:

  • family name: 1gpu-llm
  • model tier: small
  • languages: English + Italian
  • context window: 2500 tokens
  • architecture: GPT-2-style decoder with pre-layernorm blocks
  • architecture config: architecture: gpt2, block_type: gpt2_prelayernorm
  • parameter count: 136,128,000 parameters (~136.128M)
  • released checkpoint: step_8600.pt
  • checkpoint role: best decay-only checkpoint so far in the small GPT2PreLN line

This is a base model, not an instruction-tuned chat model.

Provenance

  • parent checkpoint: step_8000.pt
  • parent run: 202606212315_fresh-gpt2small-gpt2preln-k20-wsd-lr2e-4-7k-final2e5-webwiki
  • decay-only continuation run: 20260622_resume-gpt2small-gpt2preln-k20-wsds800-final2e5-webwiki-step8000-dense50
  • released checkpoint: step_8600.pt

Practical reading:

  • the no-decay parent established the strong GPT2PreLN baseline
  • the short decay-only continuation from step_8000 to step_8600 produced the best resumed-tail checkpoint
  • this repo is the current public “small base” release for the family

Training Data

This model was trained on the bilingual EN/IT web + wiki dataset:

  • dataset id on disk: 202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
  • context window during training: 2500 tokens
  • packing length: 2500
  • mixing strategy: source_balanced
  • validation ratio: 0.05

Train-split token inventory from the dataset summary:

  • train tokens: 6,899,597,399
  • English train tokens: 3,593,711,492
  • Italian train tokens: 3,305,883,508

Main source groups:

  • English FineWeb-HQ (epfml/FineWeb-HQ)
  • Italian FineWeb2-HQ (epfml/FineWeb2-HQ)
  • English Wiki40B (google/wiki40b)
  • Italian Wiki40B (google/wiki40b)

How Many Tokens This Checkpoint Saw

The released checkpoint is at step_8600.

Training math:

  • sequence length: 2500
  • batch size: 6
  • grad accumulation: 16
  • tokens per optimizer step: 240,000

So this checkpoint saw approximately:

  • 2.064B tokens total by step_8600
  • 144M extra tokens during the decay-only continuation from step_8000 to step_8600

Why This Checkpoint Was Chosen

Inside the two decay-only tails, the repo-native benchmark selected:

  • 7k tail winner: step_7500
  • 8k tail winner: step_8600

The final selection metric is val_loss_mixed, and step_8600 wins:

  • step_7500: val_loss_mixed = 4.8401
  • step_8600: val_loss_mixed = 4.7964

So this release is not “latest for the sake of latest”; it is the best benchmark-selected decay-only checkpoint.

Main Metrics for step_8600

  • val_loss_mixed = 4.7964
  • val_loss_en = 4.8075
  • val_loss_it = 3.6415
  • ppl_mixed = 121.0694
  • ppl_en = 122.4273
  • ppl_it = 38.1498

Behavior snapshot:

  • loop_rate = 0.525
  • distinct_2 = 0.4591
  • repeated_4gram_rate = 0.925
  • language_consistency_en = 0.975
  • language_consistency_it = 0.825

Source losses:

  • books_en = 4.8621
  • books_it = 4.5495
  • code = 8.1807
  • web_en = 5.9236
  • web_it = 5.7604
  • wiki_en = 3.4361
  • wiki_it = 3.4338

Short honest read:

  • benchmark-wise, this is the best small decay-only checkpoint so far
  • behavior is improved but not magically perfect
  • decoding choices matter a lot for actual generation quality

Probe Snapshot at step_8600

  • The capital of Italy is -> expected Rome
    • correct_token_rank = 7
    • correct_token_probability = 0.01446533203125
    • perplexity_on_target_sequence = 69.13080168776372
  • A small language model should -> expected be
    • correct_token_rank = 1
    • correct_token_probability = 0.46484375
    • perplexity_on_target_sequence = 2.1512605042016806
  • La capitale d'Italia è -> expected Roma
    • correct_token_rank = 6
    • correct_token_probability = 0.0257568359375
    • perplexity_on_target_sequence = 38.824644549763036
  • Un piccolo modello linguistico dovrebbe -> expected essere
    • correct_token_rank = 1
    • correct_token_probability = 0.4453125
    • perplexity_on_target_sequence = 2.245614035087719

Practical read:

  • the procedural prompts are clean and top-1
  • the factual EN/IT prompts are much healthier than the bad pre-LN phases we saw before the decay-only cleanup
  • this is a respectable probe snapshot, not proof of omniscience

Recommended Decoding

The repo-native decoding sweep was run on this exact checkpoint.

Winner:

  • tuning winner: balanced
  • holdout winner: balanced

Recommended generation params:

  • do_sample = true
  • temperature = 0.8
  • top_k = 50
  • top_p = 0.95
  • repetition_penalty = 1.1
  • no_repeat_ngram_size = 0
  • max_new_tokens = 64

Holdout metrics for the recommended preset:

  • completion_rate = 1.0
  • distinct_2 = 0.9244
  • language_consistency_mean = 0.9762
  • loop_rate = 0.0
  • repeated_4gram_rate = 0.1667
  • language_switch_rate_mean = 0.0

Both generation_config.json and recommended_decoding_params.json are included in the repo.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "nazdef/1gpu-llm-small-en-it-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

prompt = "A small language model should"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)

outputs = model.generate(
 input_ids=input_ids,
 attention_mask=attention_mask,
 do_sample=True,
 max_new_tokens=64,
 temperature=0.8,
 top_k=50,
 top_p=0.95,
 repetition_penalty=1.1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files Included

  • original .pt checkpoint
  • exported checkpoint-native .safetensors weights plus metadata sidecar
  • standard Transformers model.safetensors
  • Transformers config.json
  • tokenizer files
  • training config
  • resumed-run telemetry (best_validation.json, metrics.jsonl, eval_metrics.jsonl, probe_generations.jsonl)
  • repo-native tail benchmark bundle (eval_summary.json, comparison.json, benchmark_report.md, benchmark_metrics.json, benchmark_scores.json, benchmark_source_losses.json)
  • decoding search bundle (decoding_summary.json, decoding_report.md, tuning_leaderboard.csv, holdout_leaderboard.csv, tuning_generations.jsonl, holdout_generations.jsonl)
  • probe summary probe_step8600_summary.json
  • recommended generation settings (generation_config.json, recommended_decoding_params.json)
  • release note release_note.md

Intended Use

Use this model as:

  • the current small bilingual base checkpoint of the 1gpu-llm family
  • a starting point for decoding experiments, evaluation, and downstream adaptation
  • a compact EN/IT base model for local single-GPU experimentation

Limitations

  • this is a benchmark-selected base checkpoint, not an instruction-following assistant
  • free-form generations still depend strongly on decoding parameters
  • repetition and factual brittleness are improved, not solved forever by divine intervention
  • English and Italian are both supported, but quality is uneven across prompts and domains

License

  • released under CC BY-SA 4.0
  • trained on English FineWeb-HQ, Italian FineWeb2-HQ, English Wiki40B, Italian Wiki40B
  • FineWeb/FineWeb2-HQ under ODC-By v1.0 and subject to Common Crawl terms
  • Wiki40B derived from Wikipedia and inherits CC-BY-SA
  • users responsible for compliant downstream use

Summary

If you want the current best small decay-only GPT2PreLN checkpoint from this project in a form that is actually ready to load and use, this is the one.

Downloads last month
110
Safetensors
Model size
0.1B params
Tensor type
F32
·

Datasets used to train nazdef/1gpu-llm-small-en-it-base

Collection including nazdef/1gpu-llm-small-en-it-base