VOOZH about

URL: https://huggingface.co/shivendrra/BiosaicTokenizer

⇱ shivendrra/BiosaicTokenizer · Hugging Face


Biosaic Tokenizer

Overview

Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.

Features

  • Tokenization: converts the sequences into K-Mers. (for DNA only)
  • Encoding: converts sequences into embeddings for classification, training purposes.
  • Easy use: it's very basic and easy to use library.
  • SoTA encoder: Evoformer & VQ-VAE model are inspired from the AlphaFold-2

Models

It has two different Models,

  • for DNA tokenization & encoding: VQ-VAE
  • for Protein Encodings: EvoFormer

VQ-VAE is around 160M parameter big(for now it's just around 40M just to test run). EvoFormer is around 136M parameter big (still in training).

Config:

class ModelConfig:
 d_model: int= 768
 in_dim: int= 4
 beta: float= 0.15
 dropout: float= 0.25
 n_heads: int= 16
 n_layers: int= 12
class ModelConfig:
 DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 A = 4 # DNA alphabet
 C = 21 # 21 letter for amino acid & 4 for dna
 d_msa = 768
 d_pair = 256
 n_heads = 32
 n_blocks = 28

Training:

For training the VQ-VAE & Evo-Former model, batch training is preferred, with it's own sepearte Dateset class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to train & val splits which is around 20% of the full dataset.

For VQ-VAE:

class TrainConfig:
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 learning_rate = 1e-4 # bumped from 1e-5
 weight_decay = 1e-4
 amsgrad = True
 warmup_epochs = 50 # linear warm‑up
 epochs = 2000
 eval_interval = 100
 eval_iters = 30
 batch_size = 6
 block_size = 256

For EvoFormer:

class TrainConfig:
 DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 LR = 1e-4
 WD = 1e-4
 AMS = True
 WARMUP = 50
 EPOCHS = 500
 BATCH = 8
 MSA_SEQ = 32 # number of sequences in each MSA
 L_SEQ = 256 # length of each sequence
 EVAL_ITERS = 5
 EVAL_INTV = 50
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train shivendrra/BiosaicTokenizer