VOOZH about

URL: https://huggingface.co/veyra-ai/Veyra-30M-Base-2.5B-Tokens

⇱ veyra-ai/Veyra-30M-Base-2.5B-Tokens · Hugging Face


A newer version of this model is available: veyra-ai/Veyra-30M-Base

Veyra 30M Base 2.5B Checkpoint !! 5B CHECKPOINT OUT NOW !!

This is an early Veyra-30M base checkpoint trained for approximately 2.5B pretraining tokens.

It is not instruction tuned and should not be evaluated like a finished chat assistant. It is expected to hallucinate, repeat, fail simple factual/math prompts, and continue text in odd ways. This checkpoint is uploaded for transparency, reproducibility, and milestone tracking before further continuation training.

Training summary

Approximate training stages:

  • 1B tokens: Cosmopedia v2 bootstrap pretraining.
  • +1.5B tokens: mixed continuation using Cosmopedia-v2 repository configs including cosmopedia-v2, fineweb-edu-dedup, and python-edu.
  • Total: about 2.5B pretraining tokens.

Architecture

Veyra-30M is a small attention-sparse decoder-only language model.

Key details:

  • Exact parameters: 31,988,224 / 31.99M
  • Vocabulary: 8,192 tokens
  • Hidden size: 512
  • Layers: 8
  • Attention heads: 8 query heads, 2 KV heads
  • MLP intermediate size: 2048
  • Activation: SwiGLU
  • Normalization: RMSNorm
  • Position encoding: RoPE
  • Tied token embeddings / LM head
  • Context in this checkpoint: 512 tokens

Loading

This repository uses custom Transformers code.

Minimal usage:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "veyra-ai/veyra-30m-base-2.5b-tokens"

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, dtype=torch.float32)
model.eval()

prompt = "Photosynthesis is the process by which"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
 out = model.generate(
 input_ids,
 do_sample=True,
 temperature=0.5,
 top_k=30,
 repetition_penalty=1.15,
 no_repeat_ngram_size=2,
 max_new_tokens=80,
 )

print(tokenizer.decode(out[0], skip_special_tokens=True))

For raw completion prompts, use add_special_tokens=False.

Optimizer

Training used:

  • CosineGatedAdam / CGA-v0 on 2D projection matrices
  • AdamW on embeddings, norms, tied head, and auxiliary parameters

Intended use

This checkpoint is primarily for:

  • continued pretraining
  • research / ablations
  • tracking Veyra training milestones
  • testing tiny model behavior

It is not intended for production use or reliable factual answering.

Known limitations

This model can:

  • hallucinate confidently
  • repeat phrases
  • fail arithmetic
  • fail simple factual questions
  • produce fake code
  • continue in textbook-like or tutorial-like styles

Further continuation pretraining and post-training are planned.

Downloads last month
3,086
Safetensors
Model size
34.6M params
Tensor type
F32
·

Dataset used to train veyra-ai/Veyra-30M-Base-2.5B-Tokens

Collection including veyra-ai/Veyra-30M-Base-2.5B-Tokens