Voozh

You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Contains approximately 10k hours of audio sourced from YouTube
- Videos are chosen at random, and scraped on a channel basis
- Includes news, vlogs, entertainment, stories, health
Columns
- transcript_whisper: Transcribed using Scrya/whisper-large-v2-cantonese with alvanlii/whisper-small-cantonese for speculative decoding
- transcript_sensevoice: Transcribed using FunAudioLLM/SenseVoiceSmall
  - used OpenCC to convert to traditional chinese
  - isolated event tags to event_sensevoice
  - isolated emotion tags to emotion_sensevoice
- snr: Signal-to-noise ratio, extracted from ylacombe/brouhaha-best
- c50: Speech clarity, extracted from ylacombe/brouhaha-best
- emotion: Emotion, extracted from emotion2vec/emotion2vec_plus_large
- Note that id does not reflect the ordering of the audio within the same video
Processing
- The full audio is split using WhisperX, using Scrya/whisper-large-v2-cantonese
  - it is split in <30s chunks and according to speakers
- Preliminary filtering includes filtering out phrases like:
  - like/subscribe to YouTube channel
  - subtitles by [xxxx]
  - Additional filtering is recommended for your own use
Note: An earlier version of the dataset has duplicated data. I recommend re-downloading it if you downloaded it before Nov-7-2024