GPT2Medium EN/IT NanoChat GPT2PreLN `step_13500`

This repository publishes the best scalar checkpoint from the medium GPT2PreLN anchor20k run evaluated in the full repo-native sweep currently recorded in this workspace.

This is an ordinary checkpoint release, not the definitive public-facing family base release.

Released checkpoint summary:

released checkpoint: step_13500.pt
parent run: stable-recipe-gpt2medium-gpt2preln-k20-wsd-lr2e-4-anchor20k-final2e5-webwiki
languages: English + Italian
context window: 2500 tokens
architecture: GPT-2-style decoder with pre-layernorm blocks
architecture config: architecture: gpt2, block_type: gpt2_prelayernorm
training-time parameter count: 337,671,424
published Transformers-export parameter count: 337,639,424
hardware class: single consumer GPU (RTX 4060 Ti 16GB) for training, CPU benchmarked selection

This is a base model, not an instruction-tuned chat model.

Why This Checkpoint Was Chosen

The full checkpoint sweep across the evaluated medium run selected step_13500 as the best checkpoint on the main benchmark metric:

step_13500: val_loss_mixed = 4.4652
step_14500: val_loss_mixed = 4.5796
step_9500: val_loss_mixed = 4.5982
step_12500: val_loss_mixed = 4.6220
step_13000: val_loss_mixed = 4.6226

So this release is not “latest wins because timestamp”. It is the current benchmark-selected medium winner for this run family.

Best-Behavior Shortlist

The scalar winner is step_13500, but the behavior shortlist is:

step_13500
step_14000
step_14500

Why step_14000 remained interesting:

loop_rate = 0.250
distinct_2 = 0.6081
repeated_4gram_rate = 0.725
language_consistency_it = 0.875

Why step_14500 is a good compromise:

val_loss_mixed = 4.5796
distinct_2 = 0.5694
repeated_4gram_rate = 0.750
language_consistency_it = 0.850

Practical reading:

step_13500 is the best scalar checkpoint
step_14000 is one of the cleanest checkpoints on repetition/behavior proxies
step_14500 trades a little scalar loss for a still-healthy qualitative profile

Main Metrics for `step_13500`

val_loss_mixed = 4.4652
val_loss_en = 4.3219
val_loss_it = 3.5930
ppl_mixed = 86.9371
ppl_en = 75.3304
ppl_it = 36.3446

Behavior snapshot:

loop_rate = 0.325
distinct_2 = 0.5268
repeated_4gram_rate = 0.800
language_consistency_en = 0.975
language_consistency_it = 0.850
cloze_en_contains = 0.16
cloze_it_contains = 0.30

Source losses:

books_en = 4.4590
books_it = 4.5447
code = 7.2659
web_en = 5.9069
web_it = 5.4595
wiki_en = 3.0141
wiki_it = 3.0585

Honest short read:

this is the best scalar medium checkpoint seen so far in this run family
it is healthier than the later collapsed tail
it is still a pretrained checkpoint with visible repetition artifacts, not a magically polished model

Training Data

This model was trained on the bilingual EN/IT web + wiki dataset:

dataset id on disk: 202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
context window during training: 2500 tokens
packing length: 2500
mixing strategy: source_balanced
validation ratio: 0.05

Main source groups:

English FineWeb-HQ (epfml/FineWeb-HQ)
Italian FineWeb2-HQ (epfml/FineWeb2-HQ)
English Wiki40B (google/wiki40b)
Italian Wiki40B (google/wiki40b)

How Many Tokens This Checkpoint Saw

Training math:

sequence length: 2500
batch size: 2
grad accumulation: 48
tokens per optimizer step: 239,904

So step_13500 saw approximately:

3.238704B tokens total
K = 9.5913 tokens per parameter relative to the native training-time parameter count

Included Files

This release bundle includes:

step_13500.pt
step_13500.safetensors
model.safetensors
config.json
tokenizer files
training_config.yaml
run telemetry (best_validation.json, metrics.jsonl, eval_metrics.jsonl, probe_generations.jsonl)
benchmark bundle (eval_summary.json, comparison.json, benchmark_report.md, benchmark_metrics.json, benchmark_scores.json, benchmark_source_losses.json)

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

prompt = "La capitale d'Italia è"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)

outputs = model.generate(
 input_ids=input_ids,
 attention_mask=attention_mask,
 do_sample=True,
 max_new_tokens=64,
 temperature=0.8,
 top_k=50,
 top_p=0.95,
 repetition_penalty=1.1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

This release is published with CC-BY-SA-4.0 as the practical downstream posture for the mixed training corpus used here.

The training mix includes:

FineWeb-HQ / FineWeb2-HQ web data
Wiki40B English and Italian slices

Downstream users are responsible for checking whether their use, redistribution, or derivative packaging remains compatible with the obligations of the upstream datasets and their terms.

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32

URL: https://huggingface.co/nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500

⇱ nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500 · Hugging Face

GPT2Medium EN/IT NanoChat GPT2PreLN `step_13500`

Why This Checkpoint Was Chosen

Best-Behavior Shortlist

Main Metrics for `step_13500`

Training Data

How Many Tokens This Checkpoint Saw

Included Files

Quick Start

License

Datasets used to train nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500

URL: https://huggingface.co/nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500

⇱ nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500 · Hugging Face

GPT2Medium EN/IT NanoChat GPT2PreLN step_13500

Why This Checkpoint Was Chosen

Best-Behavior Shortlist

Main Metrics for step_13500

Training Data

How Many Tokens This Checkpoint Saw

Included Files

Quick Start

License

Datasets used to train nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500

GPT2Medium EN/IT NanoChat GPT2PreLN `step_13500`

Main Metrics for `step_13500`