Veyra 30M Base 2.5B Checkpoint !! 5B CHECKPOINT OUT NOW !!
This is an early Veyra-30M base checkpoint trained for approximately 2.5B pretraining tokens.
It is not instruction tuned and should not be evaluated like a finished chat assistant. It is expected to hallucinate, repeat, fail simple factual/math prompts, and continue text in odd ways. This checkpoint is uploaded for transparency, reproducibility, and milestone tracking before further continuation training.
Training summary
Approximate training stages:
- 1B tokens: Cosmopedia v2 bootstrap pretraining.
- +1.5B tokens: mixed continuation using Cosmopedia-v2 repository configs including
cosmopedia-v2,fineweb-edu-dedup, andpython-edu. - Total: about 2.5B pretraining tokens.
Architecture
Veyra-30M is a small attention-sparse decoder-only language model.
Key details:
- Exact parameters: 31,988,224 / 31.99M
- Vocabulary: 8,192 tokens
- Hidden size: 512
- Layers: 8
- Attention heads: 8 query heads, 2 KV heads
- MLP intermediate size: 2048
- Activation: SwiGLU
- Normalization: RMSNorm
- Position encoding: RoPE
- Tied token embeddings / LM head
- Context in this checkpoint: 512 tokens
Loading
This repository uses custom Transformers code.
Minimal usage:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "veyra-ai/veyra-30m-base-2.5b-tokens"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, dtype=torch.float32)
model.eval()
prompt = "Photosynthesis is the process by which"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
out = model.generate(
input_ids,
do_sample=True,
temperature=0.5,
top_k=30,
repetition_penalty=1.15,
no_repeat_ngram_size=2,
max_new_tokens=80,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
For raw completion prompts, use add_special_tokens=False.
Optimizer
Training used:
- CosineGatedAdam / CGA-v0 on 2D projection matrices
- AdamW on embeddings, norms, tied head, and auxiliary parameters
Intended use
This checkpoint is primarily for:
- continued pretraining
- research / ablations
- tracking Veyra training milestones
- testing tiny model behavior
It is not intended for production use or reliable factual answering.
Known limitations
This model can:
- hallucinate confidently
- repeat phrases
- fail arithmetic
- fail simple factual questions
- produce fake code
- continue in textbook-like or tutorial-like styles
Further continuation pretraining and post-training are planned.
- Downloads last month
- 3,086
