USAD: Universal Speech and Audio Representation via Distillation

Universal Speech and Audio Distillation (USAD) is a unified speech, sound, and music encoder distilled from domain-specific teachers. Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

👀 Read Full Paper

🛠️ GitHub

🗂️ Models

USAD models are all transformer encoders operating at 50Hz frame rate. The teacher models are WavLM Base+ and ATST Frame.

Model	Parameters	Dim	Layer
USAD Small	24M	384	12
USAD Base	94M	768	12
USAD Large	330M	1024	24

🚀 How To Use

Installation

pip install -U torch torchaudio transformers

Load Model and Extract Features

import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
 results = model(wav)

# result["x"]: model final output (batch_size, seq_len)
# result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
# result["ffn"]: list of (batch_size, seq_len, encoder_dim)

See usad_model.py for more details about the model.

📖 Citation

@inproceedings{chang2025usad,
 title={{USAD}: Universal Speech and Audio Representation via Distillation},
 author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
 booktitle={IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
 year={2025}
}

🙏 Acknowledgement

Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.

Downloads last month: 86

Safetensors

Model size

97.2M params

Tensor type

F32

Datasets used to train MIT-SLS/USAD-Base

Collection including MIT-SLS/USAD-Base

USAD: Universal Speech and Audio Representation via Distillation • 4 items • Updated Jun 24, 2025 • 1

Paper for MIT-SLS/USAD-Base

Paper • 2506.18843 • Published Jun 23, 2025 • 13

URL: https://huggingface.co/MIT-SLS/USAD-Base

⇱ MIT-SLS/USAD-Base · Hugging Face