ViiTorVoice-NAR Local Models
๐ GitHub
๐ Hugging Face Demo
This directory contains the local model files used by viitor-ai/viitor-voice-nar.
ViiTorVoice-NAR is a non-autoregressive speech generation model for voice cloning, local speech editing, and emotion / paralinguistic speech control. The files in this directory are split by function so each model component can be loaded independently.
Directory
local_models/
โโโ aligner/
โ โโโ Qwen3-ForcedAligner-0.6B/
โโโ assets/
โ โโโ dualcodec_silence_2s.pt
โโโ dualcodec/
โ โโโ dualcodec_ckpts/
โ โโโ w2v-bert-2.0/
โโโ llm/
โโโ 0p6_emotion/
Model Components
| Component | Path | Purpose |
|---|---|---|
| ViiTorVoice-NAR LLM | llm/0p6_emotion/ |
Generates target speech tokens from text, prompt speech tokens, edit masks, duration conditions, and emotion or non-verbal tags. |
| DualCodec | dualcodec/dualcodec_ckpts/ |
Converts waveform audio into discrete speech codebook tokens and decodes generated tokens back into waveform audio. |
| W2V-BERT 2.0 | dualcodec/w2v-bert-2.0/ |
Extracts semantic speech features used by the DualCodec encoder. |
| Qwen3 Forced Aligner | aligner/Qwen3-ForcedAligner-0.6B/ |
Aligns speech audio with text and provides timestamps for local speech editing. |
| Runtime Assets | assets/ |
Stores small auxiliary files, such as precomputed silence tokens used during generation or padding. |
Main Uses
- Voice cloning: synthesize new speech from target text while preserving the speaker characteristics of prompt audio.
- Local speech editing: replace only the changed region of an utterance while keeping the rest of the audio stable.
- Emotion and paralinguistic control: condition generation with tags such as emotion labels or non-verbal vocal events.
Notes
- Keep the directory structure unchanged unless the loading code is updated as well.
- Model weights are large binary files and are usually stored outside normal git tracking.
- Check the upstream project and each submodel for license and usage terms.
- Downloads last month
- 53
