12 items โข Updated โข 10
MioCodec-25Hz-44.1kHz-v2: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling
MioCodec-25Hz-44.1kHz-v2 is an upsampled, high-fidelity version of the MioCodec-25Hz-24kHz model.
By integrating an UpsamplerBlock inspired by Inworld TTS-1 into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.
๐ What's New in v2
This model is a fine-tuned version of MioCodec-25Hz-24kHz with the following architectural enhancements:
- 44.1 kHz Output: Achieves higher audio fidelity compared to the base 24 kHz model.
- UpsamplerBlock + SnakeBeta: We adopted the UpsamplerBlock architecture from Inworld TTS-1 and enhanced it by integrating SnakeBeta activations. This combination allows the decoder to effectively predict and generate high-frequency components, enabling clear 44.1 kHz reconstruction from the lower-resolution input.
- Token Compatibility: During fine-tuning, the content branch was frozen. This means the discrete tokens generated by this model are identical to those from
MioCodec-25Hz-24kHz. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.
๐ Model Comparison
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights |
|---|---|---|---|---|---|---|---|---|
| MioCodec-25Hz-44.1kHz-v2 | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | - (iSTFTHead) | 133M | Fast inference, good quality |
| MioCodec-25Hz-24kHz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | - (iSTFTHead) | 132M | Lightweight, fast inference |
| MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | MioVocoder (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate |
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model |
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model |
๐ Quick Start
Installation
# Install via pip
pip install git+https://github.com/Aratako/MioCodec
# Or using uv
uv add git+https://github.com/Aratako/MioCodec
Basic Inference
Basic usage for encoding and decoding audio:
from miocodec import MioCodecModel, load_audio
import soundfile as sf
# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()
# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
# 3. Encode Audio
features = model.encode(waveform)
# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
content_token_indices=features.content_token_indices,
global_embedding=features.global_embedding,
)
# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
Voice Conversion (Zero-shot)
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
๐ Acknowledgements
- Codec Architecture: Based on the brilliant work of kanade-tokenizer.
- Decoder Design: Inspired by XCodec2 and Inworld TTS-1.
๐๏ธ Citation
@misc{miocodec-25hz-44.1khz-v2,
author = {Chihiro Arata},
title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
}
- Downloads last month
- 5,959
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Model tree for Aratako/MioCodec-25Hz-44.1kHz-v2
Base model
Aratako/MioCodec-25Hz-24kHz