GPT2Medium EN/IT NanoChat GPT2PreLN step_13500
This repository publishes the best scalar checkpoint from the medium GPT2PreLN anchor20k run evaluated in the full repo-native sweep currently recorded in this workspace.
This is an ordinary checkpoint release, not the definitive public-facing family base release.
Released checkpoint summary:
- released checkpoint:
step_13500.pt - parent run:
stable-recipe-gpt2medium-gpt2preln-k20-wsd-lr2e-4-anchor20k-final2e5-webwiki - languages: English + Italian
- context window:
2500tokens - architecture: GPT-2-style decoder with pre-layernorm blocks
- architecture config:
architecture: gpt2,block_type: gpt2_prelayernorm - training-time parameter count:
337,671,424 - published Transformers-export parameter count:
337,639,424 - hardware class: single consumer GPU (
RTX 4060 Ti 16GB) for training, CPU benchmarked selection
This is a base model, not an instruction-tuned chat model.
Why This Checkpoint Was Chosen
The full checkpoint sweep across the evaluated medium run selected step_13500 as the best checkpoint on the main benchmark metric:
step_13500:val_loss_mixed = 4.4652step_14500:val_loss_mixed = 4.5796step_9500:val_loss_mixed = 4.5982step_12500:val_loss_mixed = 4.6220step_13000:val_loss_mixed = 4.6226
So this release is not “latest wins because timestamp”. It is the current benchmark-selected medium winner for this run family.
Best-Behavior Shortlist
The scalar winner is step_13500, but the behavior shortlist is:
step_13500step_14000step_14500
Why step_14000 remained interesting:
loop_rate = 0.250distinct_2 = 0.6081repeated_4gram_rate = 0.725language_consistency_it = 0.875
Why step_14500 is a good compromise:
val_loss_mixed = 4.5796distinct_2 = 0.5694repeated_4gram_rate = 0.750language_consistency_it = 0.850
Practical reading:
step_13500is the best scalar checkpointstep_14000is one of the cleanest checkpoints on repetition/behavior proxiesstep_14500trades a little scalar loss for a still-healthy qualitative profile
Main Metrics for step_13500
val_loss_mixed = 4.4652val_loss_en = 4.3219val_loss_it = 3.5930ppl_mixed = 86.9371ppl_en = 75.3304ppl_it = 36.3446
Behavior snapshot:
loop_rate = 0.325distinct_2 = 0.5268repeated_4gram_rate = 0.800language_consistency_en = 0.975language_consistency_it = 0.850cloze_en_contains = 0.16cloze_it_contains = 0.30
Source losses:
books_en = 4.4590books_it = 4.5447code = 7.2659web_en = 5.9069web_it = 5.4595wiki_en = 3.0141wiki_it = 3.0585
Honest short read:
- this is the best scalar medium checkpoint seen so far in this run family
- it is healthier than the later collapsed tail
- it is still a pretrained checkpoint with visible repetition artifacts, not a magically polished model
Training Data
This model was trained on the bilingual EN/IT web + wiki dataset:
- dataset id on disk:
202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M - context window during training:
2500tokens - packing length:
2500 - mixing strategy:
source_balanced - validation ratio:
0.05
Main source groups:
- English FineWeb-HQ (
epfml/FineWeb-HQ) - Italian FineWeb2-HQ (
epfml/FineWeb2-HQ) - English Wiki40B (
google/wiki40b) - Italian Wiki40B (
google/wiki40b)
How Many Tokens This Checkpoint Saw
Training math:
- sequence length:
2500 - batch size:
2 - grad accumulation:
48 - tokens per optimizer step:
239,904
So step_13500 saw approximately:
3.238704Btokens totalK = 9.5913tokens per parameter relative to the native training-time parameter count
Included Files
This release bundle includes:
step_13500.ptstep_13500.safetensorsmodel.safetensorsconfig.json- tokenizer files
training_config.yaml- run telemetry (
best_validation.json,metrics.jsonl,eval_metrics.jsonl,probe_generations.jsonl) - benchmark bundle (
eval_summary.json,comparison.json,benchmark_report.md,benchmark_metrics.json,benchmark_scores.json,benchmark_source_losses.json)
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
prompt = "La capitale d'Italia è"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
do_sample=True,
max_new_tokens=64,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License
This release is published with CC-BY-SA-4.0 as the practical downstream posture for the mixed training corpus used here.
The training mix includes:
- FineWeb-HQ / FineWeb2-HQ web data
- Wiki40B English and Italian slices
Downstream users are responsible for checking whether their use, redistribution, or derivative packaging remains compatible with the obligations of the upstream datasets and their terms.
- Downloads last month
- 4
