A newer version of this model is available: veyra-ai/Veyra-30M-Base

Veyra 30M Base 2.5B Checkpoint !! 5B CHECKPOINT OUT NOW !!

This is an early Veyra-30M base checkpoint trained for approximately 2.5B pretraining tokens.

It is not instruction tuned and should not be evaluated like a finished chat assistant. It is expected to hallucinate, repeat, fail simple factual/math prompts, and continue text in odd ways. This checkpoint is uploaded for transparency, reproducibility, and milestone tracking before further continuation training.

Training summary

Approximate training stages:

1B tokens: Cosmopedia v2 bootstrap pretraining.
+1.5B tokens: mixed continuation using Cosmopedia-v2 repository configs including cosmopedia-v2, fineweb-edu-dedup, and python-edu.
Total: about 2.5B pretraining tokens.

Architecture

Veyra-30M is a small attention-sparse decoder-only language model.

Key details:

Exact parameters: 31,988,224 / 31.99M
Vocabulary: 8,192 tokens
Hidden size: 512
Layers: 8
Attention heads: 8 query heads, 2 KV heads
MLP intermediate size: 2048
Activation: SwiGLU
Normalization: RMSNorm
Position encoding: RoPE
Tied token embeddings / LM head
Context in this checkpoint: 512 tokens

Loading

This repository uses custom Transformers code.

Minimal usage:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "veyra-ai/veyra-30m-base-2.5b-tokens"

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, dtype=torch.float32)
model.eval()

prompt = "Photosynthesis is the process by which"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
 out = model.generate(
 input_ids,
 do_sample=True,
 temperature=0.5,
 top_k=30,
 repetition_penalty=1.15,
 no_repeat_ngram_size=2,
 max_new_tokens=80,
 )

print(tokenizer.decode(out[0], skip_special_tokens=True))

For raw completion prompts, use add_special_tokens=False.

Optimizer

Training used:

CosineGatedAdam / CGA-v0 on 2D projection matrices
AdamW on embeddings, norms, tied head, and auxiliary parameters

Intended use

This checkpoint is primarily for:

continued pretraining
research / ablations
tracking Veyra training milestones
testing tiny model behavior

It is not intended for production use or reliable factual answering.

Known limitations

This model can:

hallucinate confidently
repeat phrases
fail arithmetic
fail simple factual questions
produce fake code
continue in textbook-like or tutorial-like styles

Further continuation pretraining and post-training are planned.

Downloads last month: 3,086

Safetensors

Model size

34.6M params

Tensor type

F32

Dataset used to train veyra-ai/Veyra-30M-Base-2.5B-Tokens

Collection including veyra-ai/Veyra-30M-Base-2.5B-Tokens

The first version of Veyra, these models are meant for local CPU inference. • 3 items • Updated 25 days ago

URL: https://huggingface.co/veyra-ai/Veyra-30M-Base-2.5B-Tokens

⇱ veyra-ai/Veyra-30M-Base-2.5B-Tokens · Hugging Face