The Dataset Viewer has been disabled on this dataset.

Transcription

Scripts for transcribing audio files using HF Buckets and Jobs.

Quick Start

Scripts run directly from their Hub URL — no clone or local checkout needed:

# 1. Download audio from Internet Archive straight into a bucket
hf jobs uv run \
 -v hf://buckets/user/audio-files:/output \
 https://huggingface.co/datasets/uv-scripts/transcription/raw/main/download-ia.py \
 SUSPENSE /output

# 2. Transcribe — audio bucket in, transcript bucket out
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 -e UV_TORCH_BACKEND=cu128 \
 -v hf://buckets/user/audio-files:/input:ro \
 -v hf://buckets/user/transcripts:/output \
 https://huggingface.co/datasets/uv-scripts/transcription/raw/main/cohere-transcribe.py \
 /input /output --language en --compile

No download/upload step. Buckets are mounted directly as volumes via hf-mount.

Local dev: if you've cloned this repo, swap the URL for the local filename (e.g. cohere-transcribe.py /input /output ...).

Scripts

Transcription

Script	Model	Backend	Output	Speed
`cohere-transcribe.py`	Cohere Transcribe (2B)	transformers	`.txt`	161x RT (A100)
`cohere-transcribe-vllm.py`	Cohere Transcribe (2B)	vLLM nightly	`.txt`	214x RT (A100)
`easytranscriber-transcribe.py`	Cohere Transcribe 2B (default) or Whisper variants	easytranscriber	JSON word timestamps (+ optional `.txt` / `.srt`)	42.9x RT (L4)

cohere-transcribe.py (recommended for plain text) — uses model.transcribe() with automatic long-form chunking, overlap, and reassembly. Stable dependencies.

cohere-transcribe-vllm.py — experimental vLLM variant. Faster but requires nightly vLLM and has minor duplication at chunk boundaries.

easytranscriber-transcribe.py — when you need word-level timestamps (subtitles, search indexing, forced alignment). Runs VAD → ASR → wav2vec2 emissions → forced alignment. Defaults to the Cohere backend so you get the same model as the other scripts with alignment on top; swap to --backend ct2 + a Whisper model for languages Cohere doesn't cover (e.g. Swedish via KBLab/kb-whisper-large).

Options — `cohere-transcribe.py` / `cohere-transcribe-vllm.py`

Flag	Default	Description
`--language`	required	en, de, fr, it, es, pt, el, nl, pl, ar, vi, zh, ja, ko
`--compile`	off	torch.compile encoder (one-time warmup, faster after)
`--batch-size`	16	Batch size for inference
`--max-files`	all	Limit files to process (for testing)

Options — `easytranscriber-transcribe.py`

Flag	Default	Description
`--language`	required	ISO 639-1 code. Cohere supports the same 14 languages as above; ct2/hf support any Whisper language
`--backend`	`cohere`	`cohere`, `ct2` (CTranslate2 Whisper, fastest for Whisper), or `hf` (transformers)
`--transcription-model`	Cohere 2B / distil-whisper-large-v3.5	HF model ID; override to use KB-Whisper, Whisper-large-v3, etc.
`--emissions-model`	per-language default	wav2vec2 for forced alignment: en→`wav2vec2-base-960h`, sv→`voxrex-swedish`, else→`facebook/mms-1b-all`
`--vad`	`silero`	`silero` (no auth) or `pyannote` (requires accepting terms + HF_TOKEN)
`--tokenizer-lang`	derived from `--language`	NLTK Punkt language name for sentence tokenization
`--emit-txt`	off	Also write `.txt` transcripts alongside the JSON alignments
`--emit-srt`	off	Also write `.srt` subtitles derived from alignment segments
`--batch-size-features`	8	Feature-extraction batch size
`--batch-size-transcribe`	16	ASR batch size (where backend supports it)
`--max-files`	all	Limit files to process (for testing)

Benchmarks

CBS Suspense (1940s radio drama), 66 episodes, 33 hours of audio.

cohere-transcribe.py (plain text):

GPU	Time	RTFx
A100-SXM4-80GB	12.3 min	161x realtime
L4	~64s / 30 min episode	28x realtime

easytranscriber-transcribe.py (JSON alignments + optional .txt/.srt; VAD → ASR → wav2vec2 → forced alignment):

GPU	Time	RTFx	Output
L4	46.2 min	42.9x realtime	66 JSON + SRT + TXT (42,633 segments, 295k words)

Data

Script	Description
`download-ia.py`	Download audio from Internet Archive into a mounted bucket

Notes

Gated model: Accept terms at the model page before use.
Tokenizer workaround: cohere-transcribe.py applies a one-line patch for a tokenizer compat issue. Will be removed once upstream fixes land (model discussion).
easytranscriber: the Cohere backend requires transformers>=5.4.0 (pinned in the script). Pyannote VAD is gated — accept terms at pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1 if using --vad pyannote. Otherwise stick with the default Silero VAD.

Downloads last month: 48

URL: https://huggingface.co/datasets/uv-scripts/transcription

⇱ uv-scripts/transcription · Datasets at Hugging Face

Transcription

Quick Start

Scripts

Transcription

Options — `cohere-transcribe.py` / `cohere-transcribe-vllm.py`

Options — `easytranscriber-transcribe.py`

Benchmarks

Data

Notes

URL: https://huggingface.co/datasets/uv-scripts/transcription

⇱ uv-scripts/transcription · Datasets at Hugging Face

Transcription

Quick Start

Scripts

Transcription

Options — cohere-transcribe.py / cohere-transcribe-vllm.py

Options — easytranscriber-transcribe.py

Benchmarks

Data

Notes

Options — `cohere-transcribe.py` / `cohere-transcribe-vllm.py`

Options — `easytranscriber-transcribe.py`