VOOZH about

URL: https://huggingface.co/nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500

⇱ nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500 · Hugging Face


GPT2Medium EN/IT NanoChat GPT2PreLN step_13500

This repository publishes the best scalar checkpoint from the medium GPT2PreLN anchor20k run evaluated in the full repo-native sweep currently recorded in this workspace.

This is an ordinary checkpoint release, not the definitive public-facing family base release.

Released checkpoint summary:

  • released checkpoint: step_13500.pt
  • parent run: stable-recipe-gpt2medium-gpt2preln-k20-wsd-lr2e-4-anchor20k-final2e5-webwiki
  • languages: English + Italian
  • context window: 2500 tokens
  • architecture: GPT-2-style decoder with pre-layernorm blocks
  • architecture config: architecture: gpt2, block_type: gpt2_prelayernorm
  • training-time parameter count: 337,671,424
  • published Transformers-export parameter count: 337,639,424
  • hardware class: single consumer GPU (RTX 4060 Ti 16GB) for training, CPU benchmarked selection

This is a base model, not an instruction-tuned chat model.

Why This Checkpoint Was Chosen

The full checkpoint sweep across the evaluated medium run selected step_13500 as the best checkpoint on the main benchmark metric:

  • step_13500: val_loss_mixed = 4.4652
  • step_14500: val_loss_mixed = 4.5796
  • step_9500: val_loss_mixed = 4.5982
  • step_12500: val_loss_mixed = 4.6220
  • step_13000: val_loss_mixed = 4.6226

So this release is not “latest wins because timestamp”. It is the current benchmark-selected medium winner for this run family.

Best-Behavior Shortlist

The scalar winner is step_13500, but the behavior shortlist is:

  • step_13500
  • step_14000
  • step_14500

Why step_14000 remained interesting:

  • loop_rate = 0.250
  • distinct_2 = 0.6081
  • repeated_4gram_rate = 0.725
  • language_consistency_it = 0.875

Why step_14500 is a good compromise:

  • val_loss_mixed = 4.5796
  • distinct_2 = 0.5694
  • repeated_4gram_rate = 0.750
  • language_consistency_it = 0.850

Practical reading:

  • step_13500 is the best scalar checkpoint
  • step_14000 is one of the cleanest checkpoints on repetition/behavior proxies
  • step_14500 trades a little scalar loss for a still-healthy qualitative profile

Main Metrics for step_13500

  • val_loss_mixed = 4.4652
  • val_loss_en = 4.3219
  • val_loss_it = 3.5930
  • ppl_mixed = 86.9371
  • ppl_en = 75.3304
  • ppl_it = 36.3446

Behavior snapshot:

  • loop_rate = 0.325
  • distinct_2 = 0.5268
  • repeated_4gram_rate = 0.800
  • language_consistency_en = 0.975
  • language_consistency_it = 0.850
  • cloze_en_contains = 0.16
  • cloze_it_contains = 0.30

Source losses:

  • books_en = 4.4590
  • books_it = 4.5447
  • code = 7.2659
  • web_en = 5.9069
  • web_it = 5.4595
  • wiki_en = 3.0141
  • wiki_it = 3.0585

Honest short read:

  • this is the best scalar medium checkpoint seen so far in this run family
  • it is healthier than the later collapsed tail
  • it is still a pretrained checkpoint with visible repetition artifacts, not a magically polished model

Training Data

This model was trained on the bilingual EN/IT web + wiki dataset:

  • dataset id on disk: 202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
  • context window during training: 2500 tokens
  • packing length: 2500
  • mixing strategy: source_balanced
  • validation ratio: 0.05

Main source groups:

  • English FineWeb-HQ (epfml/FineWeb-HQ)
  • Italian FineWeb2-HQ (epfml/FineWeb2-HQ)
  • English Wiki40B (google/wiki40b)
  • Italian Wiki40B (google/wiki40b)

How Many Tokens This Checkpoint Saw

Training math:

  • sequence length: 2500
  • batch size: 2
  • grad accumulation: 48
  • tokens per optimizer step: 239,904

So step_13500 saw approximately:

  • 3.238704B tokens total
  • K = 9.5913 tokens per parameter relative to the native training-time parameter count

Included Files

This release bundle includes:

  • step_13500.pt
  • step_13500.safetensors
  • model.safetensors
  • config.json
  • tokenizer files
  • training_config.yaml
  • run telemetry (best_validation.json, metrics.jsonl, eval_metrics.jsonl, probe_generations.jsonl)
  • benchmark bundle (eval_summary.json, comparison.json, benchmark_report.md, benchmark_metrics.json, benchmark_scores.json, benchmark_source_losses.json)

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

prompt = "La capitale d'Italia è"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)

outputs = model.generate(
 input_ids=input_ids,
 attention_mask=attention_mask,
 do_sample=True,
 max_new_tokens=64,
 temperature=0.8,
 top_k=50,
 top_p=0.95,
 repetition_penalty=1.1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

This release is published with CC-BY-SA-4.0 as the practical downstream posture for the mixed training corpus used here.

The training mix includes:

  • FineWeb-HQ / FineWeb2-HQ web data
  • Wiki40B English and Italian slices

Downstream users are responsible for checking whether their use, redistribution, or derivative packaging remains compatible with the obligations of the upstream datasets and their terms.

Downloads last month
4
Safetensors
Model size
0.3B params
Tensor type
F32
·

Datasets used to train nazdef/gpt2medium-en-it-nanochat-gpt2preln-lr2e4-bs2-ga48-wsd-anchor20k-final2e5-webwiki-step13500