1gpu-llm Small EN/IT Base

This repository is the current ready-to-use base release for the 1gpu-llm small EN/IT family.

1gpu-llm is a family of language models trained from scratch on a single consumer GPU.

For this release family, the reference training hardware is:

GPU: NVIDIA GeForce RTX 4060 Ti 16GB
training setup: single GPU
practical small-model wall-clock target: about 1 day to get to the current small base release class on this hardware

Concretely, this release packages the GPT2PreLN decay-only winner at step_8600:

family name: 1gpu-llm
model tier: small
languages: English + Italian
context window: 2500 tokens
architecture: GPT-2-style decoder with pre-layernorm blocks
architecture config: architecture: gpt2, block_type: gpt2_prelayernorm
parameter count: 136,128,000 parameters (~136.128M)
released checkpoint: step_8600.pt
checkpoint role: best decay-only checkpoint so far in the small GPT2PreLN line

This is a base model, not an instruction-tuned chat model.

Provenance

parent checkpoint: step_8000.pt
parent run: 202606212315_fresh-gpt2small-gpt2preln-k20-wsd-lr2e-4-7k-final2e5-webwiki
decay-only continuation run: 20260622_resume-gpt2small-gpt2preln-k20-wsds800-final2e5-webwiki-step8000-dense50
released checkpoint: step_8600.pt

Practical reading:

the no-decay parent established the strong GPT2PreLN baseline
the short decay-only continuation from step_8000 to step_8600 produced the best resumed-tail checkpoint
this repo is the current public “small base” release for the family

Training Data

This model was trained on the bilingual EN/IT web + wiki dataset:

dataset id on disk: 202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
context window during training: 2500 tokens
packing length: 2500
mixing strategy: source_balanced
validation ratio: 0.05

Train-split token inventory from the dataset summary:

train tokens: 6,899,597,399
English train tokens: 3,593,711,492
Italian train tokens: 3,305,883,508

Main source groups:

English FineWeb-HQ (epfml/FineWeb-HQ)
Italian FineWeb2-HQ (epfml/FineWeb2-HQ)
English Wiki40B (google/wiki40b)
Italian Wiki40B (google/wiki40b)

How Many Tokens This Checkpoint Saw

The released checkpoint is at step_8600.

Training math:

sequence length: 2500
batch size: 6
grad accumulation: 16
tokens per optimizer step: 240,000

So this checkpoint saw approximately:

2.064B tokens total by step_8600
144M extra tokens during the decay-only continuation from step_8000 to step_8600

Why This Checkpoint Was Chosen

Inside the two decay-only tails, the repo-native benchmark selected:

7k tail winner: step_7500
8k tail winner: step_8600

The final selection metric is val_loss_mixed, and step_8600 wins:

step_7500: val_loss_mixed = 4.8401
step_8600: val_loss_mixed = 4.7964

So this release is not “latest for the sake of latest”; it is the best benchmark-selected decay-only checkpoint.

Main Metrics for `step_8600`

val_loss_mixed = 4.7964
val_loss_en = 4.8075
val_loss_it = 3.6415
ppl_mixed = 121.0694
ppl_en = 122.4273
ppl_it = 38.1498

Behavior snapshot:

loop_rate = 0.525
distinct_2 = 0.4591
repeated_4gram_rate = 0.925
language_consistency_en = 0.975
language_consistency_it = 0.825

Source losses:

books_en = 4.8621
books_it = 4.5495
code = 8.1807
web_en = 5.9236
web_it = 5.7604
wiki_en = 3.4361
wiki_it = 3.4338

Short honest read:

benchmark-wise, this is the best small decay-only checkpoint so far
behavior is improved but not magically perfect
decoding choices matter a lot for actual generation quality

Probe Snapshot at `step_8600`

The capital of Italy is -> expected Rome
- correct_token_rank = 7
- correct_token_probability = 0.01446533203125
- perplexity_on_target_sequence = 69.13080168776372
A small language model should -> expected be
- correct_token_rank = 1
- correct_token_probability = 0.46484375
- perplexity_on_target_sequence = 2.1512605042016806
La capitale d'Italia è -> expected Roma
- correct_token_rank = 6
- correct_token_probability = 0.0257568359375
- perplexity_on_target_sequence = 38.824644549763036
Un piccolo modello linguistico dovrebbe -> expected essere
- correct_token_rank = 1
- correct_token_probability = 0.4453125
- perplexity_on_target_sequence = 2.245614035087719

Practical read:

the procedural prompts are clean and top-1
the factual EN/IT prompts are much healthier than the bad pre-LN phases we saw before the decay-only cleanup
this is a respectable probe snapshot, not proof of omniscience

Recommended Decoding

The repo-native decoding sweep was run on this exact checkpoint.

Winner:

tuning winner: balanced
holdout winner: balanced

Recommended generation params:

do_sample = true
temperature = 0.8
top_k = 50
top_p = 0.95
repetition_penalty = 1.1
no_repeat_ngram_size = 0
max_new_tokens = 64

Holdout metrics for the recommended preset:

completion_rate = 1.0
distinct_2 = 0.9244
language_consistency_mean = 0.9762
loop_rate = 0.0
repeated_4gram_rate = 0.1667
language_switch_rate_mean = 0.0

Both generation_config.json and recommended_decoding_params.json are included in the repo.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "nazdef/1gpu-llm-small-en-it-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

prompt = "A small language model should"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)

outputs = model.generate(
 input_ids=input_ids,
 attention_mask=attention_mask,
 do_sample=True,
 max_new_tokens=64,
 temperature=0.8,
 top_k=50,
 top_p=0.95,
 repetition_penalty=1.1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files Included

original .pt checkpoint
exported checkpoint-native .safetensors weights plus metadata sidecar
standard Transformers model.safetensors
Transformers config.json
tokenizer files
training config
resumed-run telemetry (best_validation.json, metrics.jsonl, eval_metrics.jsonl, probe_generations.jsonl)
repo-native tail benchmark bundle (eval_summary.json, comparison.json, benchmark_report.md, benchmark_metrics.json, benchmark_scores.json, benchmark_source_losses.json)
decoding search bundle (decoding_summary.json, decoding_report.md, tuning_leaderboard.csv, holdout_leaderboard.csv, tuning_generations.jsonl, holdout_generations.jsonl)
probe summary probe_step8600_summary.json
recommended generation settings (generation_config.json, recommended_decoding_params.json)
release note release_note.md

Intended Use

Use this model as:

the current small bilingual base checkpoint of the 1gpu-llm family
a starting point for decoding experiments, evaluation, and downstream adaptation
a compact EN/IT base model for local single-GPU experimentation

Limitations

this is a benchmark-selected base checkpoint, not an instruction-following assistant
free-form generations still depend strongly on decoding parameters
repetition and factual brittleness are improved, not solved forever by divine intervention
English and Italian are both supported, but quality is uneven across prompts and domains

License

released under CC BY-SA 4.0
trained on English FineWeb-HQ, Italian FineWeb2-HQ, English Wiki40B, Italian Wiki40B
FineWeb/FineWeb2-HQ under ODC-By v1.0 and subject to Common Crawl terms
Wiki40B derived from Wikipedia and inherits CC-BY-SA
users responsible for compliant downstream use

Summary

If you want the current best small decay-only GPT2PreLN checkpoint from this project in a form that is actually ready to load and use, this is the one.

Downloads last month: 110

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train nazdef/1gpu-llm-small-en-it-base

Collection including nazdef/1gpu-llm-small-en-it-base

Language models trained from scratch on a single consumer GPU. • 1 item • Updated 4 days ago

URL: https://huggingface.co/nazdef/1gpu-llm-small-en-it-base

⇱ nazdef/1gpu-llm-small-en-it-base · Hugging Face

1gpu-llm Small EN/IT Base

Provenance

Training Data

How Many Tokens This Checkpoint Saw

Why This Checkpoint Was Chosen

Main Metrics for `step_8600`

Probe Snapshot at `step_8600`

Recommended Decoding

Quick Start

Files Included

Intended Use

Limitations

License

Summary

Datasets used to train nazdef/1gpu-llm-small-en-it-base

Collection including nazdef/1gpu-llm-small-en-it-base

URL: https://huggingface.co/nazdef/1gpu-llm-small-en-it-base

⇱ nazdef/1gpu-llm-small-en-it-base · Hugging Face

1gpu-llm Small EN/IT Base

Provenance

Training Data

How Many Tokens This Checkpoint Saw

Why This Checkpoint Was Chosen

Main Metrics for step_8600

Probe Snapshot at step_8600

Recommended Decoding

Quick Start

Files Included

Intended Use

Limitations

License

Summary

Datasets used to train nazdef/1gpu-llm-small-en-it-base

Collection including nazdef/1gpu-llm-small-en-it-base

Main Metrics for `step_8600`

Probe Snapshot at `step_8600`