👁 Ih4-DhpQJv3BJTj_RG5tR (1) (1)
1gpu-llm Small EN/IT Base
This repository is the current ready-to-use base release for the 1gpu-llm small EN/IT family.
1gpu-llm is a family of language models trained from scratch on a single consumer GPU.
For this release family, the reference training hardware is:
- GPU: NVIDIA GeForce RTX 4060 Ti 16GB
- training setup: single GPU
- practical small-model wall-clock target: about 1 day to get to the current small base release class on this hardware
Concretely, this release packages the GPT2PreLN decay-only winner at step_8600:
- family name:
1gpu-llm - model tier:
small - languages: English + Italian
- context window:
2500tokens - architecture: GPT-2-style decoder with pre-layernorm blocks
- architecture config:
architecture: gpt2,block_type: gpt2_prelayernorm - parameter count:
136,128,000parameters (~136.128M) - released checkpoint:
step_8600.pt - checkpoint role: best decay-only checkpoint so far in the small GPT2PreLN line
This is a base model, not an instruction-tuned chat model.
Provenance
- parent checkpoint:
step_8000.pt - parent run:
202606212315_fresh-gpt2small-gpt2preln-k20-wsd-lr2e-4-7k-final2e5-webwiki - decay-only continuation run:
20260622_resume-gpt2small-gpt2preln-k20-wsds800-final2e5-webwiki-step8000-dense50 - released checkpoint:
step_8600.pt
Practical reading:
- the no-decay parent established the strong GPT2PreLN baseline
- the short decay-only continuation from
step_8000tostep_8600produced the best resumed-tail checkpoint - this repo is the current public “small base” release for the family
Training Data
This model was trained on the bilingual EN/IT web + wiki dataset:
- dataset id on disk:
202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M - context window during training:
2500tokens - packing length:
2500 - mixing strategy:
source_balanced - validation ratio:
0.05
Train-split token inventory from the dataset summary:
- train tokens:
6,899,597,399 - English train tokens:
3,593,711,492 - Italian train tokens:
3,305,883,508
Main source groups:
- English FineWeb-HQ (
epfml/FineWeb-HQ) - Italian FineWeb2-HQ (
epfml/FineWeb2-HQ) - English Wiki40B (
google/wiki40b) - Italian Wiki40B (
google/wiki40b)
How Many Tokens This Checkpoint Saw
The released checkpoint is at step_8600.
Training math:
- sequence length:
2500 - batch size:
6 - grad accumulation:
16 - tokens per optimizer step:
240,000
So this checkpoint saw approximately:
- 2.064B tokens total by
step_8600 - 144M extra tokens during the decay-only continuation from
step_8000tostep_8600
Why This Checkpoint Was Chosen
Inside the two decay-only tails, the repo-native benchmark selected:
7k tailwinner:step_75008k tailwinner:step_8600
The final selection metric is val_loss_mixed, and step_8600 wins:
step_7500:val_loss_mixed = 4.8401step_8600:val_loss_mixed = 4.7964
So this release is not “latest for the sake of latest”; it is the best benchmark-selected decay-only checkpoint.
Main Metrics for step_8600
val_loss_mixed = 4.7964val_loss_en = 4.8075val_loss_it = 3.6415ppl_mixed = 121.0694ppl_en = 122.4273ppl_it = 38.1498
Behavior snapshot:
loop_rate = 0.525distinct_2 = 0.4591repeated_4gram_rate = 0.925language_consistency_en = 0.975language_consistency_it = 0.825
Source losses:
books_en = 4.8621books_it = 4.5495code = 8.1807web_en = 5.9236web_it = 5.7604wiki_en = 3.4361wiki_it = 3.4338
Short honest read:
- benchmark-wise, this is the best small decay-only checkpoint so far
- behavior is improved but not magically perfect
- decoding choices matter a lot for actual generation quality
Probe Snapshot at step_8600
The capital of Italy is-> expectedRomecorrect_token_rank = 7correct_token_probability = 0.01446533203125perplexity_on_target_sequence = 69.13080168776372
A small language model should-> expectedbecorrect_token_rank = 1correct_token_probability = 0.46484375perplexity_on_target_sequence = 2.1512605042016806
La capitale d'Italia è-> expectedRomacorrect_token_rank = 6correct_token_probability = 0.0257568359375perplexity_on_target_sequence = 38.824644549763036
Un piccolo modello linguistico dovrebbe-> expectedesserecorrect_token_rank = 1correct_token_probability = 0.4453125perplexity_on_target_sequence = 2.245614035087719
Practical read:
- the procedural prompts are clean and top-1
- the factual EN/IT prompts are much healthier than the bad pre-LN phases we saw before the decay-only cleanup
- this is a respectable probe snapshot, not proof of omniscience
Recommended Decoding
The repo-native decoding sweep was run on this exact checkpoint.
Winner:
- tuning winner:
balanced - holdout winner:
balanced
Recommended generation params:
do_sample = truetemperature = 0.8top_k = 50top_p = 0.95repetition_penalty = 1.1no_repeat_ngram_size = 0max_new_tokens = 64
Holdout metrics for the recommended preset:
completion_rate = 1.0distinct_2 = 0.9244language_consistency_mean = 0.9762loop_rate = 0.0repeated_4gram_rate = 0.1667language_switch_rate_mean = 0.0
Both generation_config.json and recommended_decoding_params.json are included in the repo.
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "nazdef/1gpu-llm-small-en-it-base"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
prompt = "A small language model should"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
do_sample=True,
max_new_tokens=64,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Files Included
- original
.ptcheckpoint - exported checkpoint-native
.safetensorsweights plus metadata sidecar - standard Transformers
model.safetensors - Transformers
config.json - tokenizer files
- training config
- resumed-run telemetry (
best_validation.json,metrics.jsonl,eval_metrics.jsonl,probe_generations.jsonl) - repo-native tail benchmark bundle (
eval_summary.json,comparison.json,benchmark_report.md,benchmark_metrics.json,benchmark_scores.json,benchmark_source_losses.json) - decoding search bundle (
decoding_summary.json,decoding_report.md,tuning_leaderboard.csv,holdout_leaderboard.csv,tuning_generations.jsonl,holdout_generations.jsonl) - probe summary
probe_step8600_summary.json - recommended generation settings (
generation_config.json,recommended_decoding_params.json) - release note
release_note.md
Intended Use
Use this model as:
- the current small bilingual base checkpoint of the
1gpu-llmfamily - a starting point for decoding experiments, evaluation, and downstream adaptation
- a compact EN/IT base model for local single-GPU experimentation
Limitations
- this is a benchmark-selected base checkpoint, not an instruction-following assistant
- free-form generations still depend strongly on decoding parameters
- repetition and factual brittleness are improved, not solved forever by divine intervention
- English and Italian are both supported, but quality is uneven across prompts and domains
License
- released under CC BY-SA 4.0
- trained on English FineWeb-HQ, Italian FineWeb2-HQ, English Wiki40B, Italian Wiki40B
- FineWeb/FineWeb2-HQ under ODC-By v1.0 and subject to Common Crawl terms
- Wiki40B derived from Wikipedia and inherits CC-BY-SA
- users responsible for compliant downstream use
Summary
If you want the current best small decay-only GPT2PreLN checkpoint from this project in a form that is actually ready to load and use, this is the one.
- Downloads last month
- 110
