You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Multi-lingual NVASR

Multi-lingual Nonverbal Vocalization Automatic Speech Recognition

Multi-lingual NVASR is a speech recognition model fine-tuned from SenseVoice-Small for transcribing both regular speech and nonverbal vocalizations (NVVs) with a unified paralinguistic label taxonomy. It is a core component of the NV-Bench evaluation pipeline.

Highlights

🗣️ Multi-lingual Support — Chinese (zh), English (en)
🎯 NVV-Aware Transcription — Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
📊 High-Quality General ASR — Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
🏷️ Unified Label Taxonomy — Consistent paralinguistic labels across all supported languages

NVV Taxonomy

NVVs are organized into three functional levels:

Function	Categories
Vegetative	`[Cough]`, `[Sigh]`, `[Breathing]`
Affect Burst	`[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]`
Conversational Grunt	`[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]`

Mandarin supports 13 NVV categories; English supports 7 categories.

Usage

Quick Start with FunASR

from funasr import AutoModel

model = AutoModel(model="path/to/Multi-lingual-NVASR")

# Single file inference
res = model.generate(
 input="example/zh.mp3",
 language="auto",
 use_itn=True,
)
print(res[0]["text"])

Evaluation Metrics

Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:

Metric	Description
OCER / OWER	Overall Character/Word Error Rate (text + NVV tags)
PCER / PWER	Paralinguistic CER/WER (NVV tags only)
CER / WER	Text-only error rate (NVV tags removed)

Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. — NV-Bench

File Structure

Multi-lingual NVASR/
├── model.pt # Model weights (~2.8 GB)
├── config.yaml # Model architecture configuration
├── configuration.json # FunASR pipeline configuration
├── am.mvn # Acoustic model mean-variance normalization
├── paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary
├── example/ # Example audio files
│ ├── zh.mp3 # Chinese example
│ ├── en.mp3 # English example

Related Resources

NV-Bench Project Page: https://nvbench.github.io
NV-Bench Dataset: Hugging Face
SenseVoice: GitHub

Citation

If you use this model, please cite:

@article{ni2026nv,
 title={NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation},
 author={Ni, Qinke and Liao, Huan and Chen, Dekun and Wang, Yuxiang and Wu, Zhizheng},
 journal={arXiv preprint arXiv:2603.15352},
 year={2026}
}

License

This project is licensed under the CC BY-NC-4.0 License.

Downloads last month: 24

Model tree for CharlesNi/Multilingual-NVASR

Base model

FunAudioLLM/SenseVoiceSmall

Finetuned

(8)

this model

Datasets used to train CharlesNi/Multilingual-NVASR

Paper for CharlesNi/Multilingual-NVASR

Paper • 2603.15352 • Published Mar 18

URL: https://huggingface.co/CharlesNi/Multilingual-NVASR

⇱ CharlesNi/Multilingual-NVASR · Hugging Face