Updated • 12 • 2
Cantonese Youtube Pseudo-Transcription Dataset
- Contains approximately 10k hours of audio sourced from YouTube
- Videos are chosen at random, and scraped on a channel basis
- Includes news, vlogs, entertainment, stories, health
- Columns
transcript_whisper: Transcribed usingScrya/whisper-large-v2-cantonesewithalvanlii/whisper-small-cantonesefor speculative decodingtranscript_sensevoice: Transcribed usingFunAudioLLM/SenseVoiceSmall- used OpenCC to convert to traditional chinese
- isolated event tags to
event_sensevoice - isolated emotion tags to
emotion_sensevoice
snr: Signal-to-noise ratio, extracted fromylacombe/brouhaha-bestc50: Speech clarity, extracted fromylacombe/brouhaha-bestemotion: Emotion, extracted fromemotion2vec/emotion2vec_plus_large- Note that
iddoes not reflect the ordering of the audio within the same video
- Processing
- The full audio is split using WhisperX, using
Scrya/whisper-large-v2-cantonese- it is split in <30s chunks and according to speakers
- Preliminary filtering includes filtering out phrases like:
like/subscribe to YouTube channelsubtitles by [xxxx]- Additional filtering is recommended for your own use
- The full audio is split using WhisperX, using
- Note: An earlier version of the dataset has duplicated data. I recommend re-downloading it if you downloaded it before Nov-7-2024
- Downloads last month
- 547
