Multi-lingual NVASR
Multi-lingual Nonverbal Vocalization Automatic Speech Recognition
๐ Demo Page
๐ Dataset
๐ Model
Multi-lingual NVASR is a speech recognition model fine-tuned from SenseVoice-Small for transcribing both regular speech and nonverbal vocalizations (NVVs) with a unified paralinguistic label taxonomy. It is a core component of the NV-Bench evaluation pipeline.
Highlights
- ๐ฃ๏ธ Multi-lingual Support โ Chinese (zh), English (en)
- ๐ฏ NVV-Aware Transcription โ Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
- ๐ High-Quality General ASR โ Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
- ๐ท๏ธ Unified Label Taxonomy โ Consistent paralinguistic labels across all supported languages
NVV Taxonomy
NVVs are organized into three functional levels:
| Function | Categories |
|---|---|
| Vegetative | [Cough], [Sigh], [Breathing] |
| Affect Burst | [Surprise-oh], [Surprise-ah], [Dissatisfaction-hnn], [Laughter] |
| Conversational Grunt | [Uhm], [Question-en/oh/ah/ei/huh], [Confirmation-en] |
Mandarin supports 13 NVV categories; English supports 7 categories.
Usage
Quick Start with FunASR
from funasr import AutoModel
model = AutoModel(model="path/to/Multi-lingual-NVASR")
# Single file inference
res = model.generate(
input="example/zh.mp3",
language="auto",
use_itn=True,
)
print(res[0]["text"])
Evaluation Metrics
Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:
| Metric | Description |
|---|---|
| OCER / OWER | Overall Character/Word Error Rate (text + NVV tags) |
| PCER / PWER | Paralinguistic CER/WER (NVV tags only) |
| CER / WER | Text-only error rate (NVV tags removed) |
Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. โ NV-Bench
File Structure
Multi-lingual NVASR/
โโโ model.pt # Model weights (~2.8 GB)
โโโ config.yaml # Model architecture configuration
โโโ configuration.json # FunASR pipeline configuration
โโโ am.mvn # Acoustic model mean-variance normalization
โโโ paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary
โโโ example/ # Example audio files
โ โโโ zh.mp3 # Chinese example
โ โโโ en.mp3 # English example
Related Resources
- NV-Bench Project Page: https://nvbench.github.io
- NV-Bench Dataset: Hugging Face
- SenseVoice: GitHub
Citation
If you use this model, please cite:
@article{ni2026nv,
title={NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation},
author={Ni, Qinke and Liao, Huan and Chen, Dekun and Wang, Yuxiang and Wu, Zhizheng},
journal={arXiv preprint arXiv:2603.15352},
year={2026}
}
License
This project is licensed under the CC BY-NC-4.0 License.
- Downloads last month
- 24
Model tree for CharlesNi/Multilingual-NVASR
Base model
FunAudioLLM/SenseVoiceSmall