OpenAI Whisper is a powerful speech recognition system that can transcribe audio files with impressive accuracy. When combined with NVIDIA GPU acceleration through CUDA, Whisper can process audio files significantly faster than CPU-only processing. This guide demonstrates how to install and use Whisper with GPU support on Debian and Ubuntu Linux systems.
In this tutorial you will learn:
How to install OpenAI Whisper with GPU support
How to verify GPU acceleration is working
How to transcribe audio files using the command line
Software Requirements and Linux Command Line Conventions
Category
Requirements, Conventions or Software Version Used
System
Debian or Ubuntu Linux with NVIDIA GPU
Software
NVIDIA drivers, CUDA toolkit, PyTorch with CUDA support, FFmpeg
Other
Python 3.8 or higher, pip package manager
Conventions
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command $ – requires given linux commands to be executed as a regular non-privileged user
Prerequisites
Before installing Whisper with GPU support, you must have the following components installed and configured on your system:
NVIDIA Drivers: Your system needs properly installed NVIDIA proprietary drivers. Follow the appropriate guide for your distribution:
VERIFY GPU SETUP
Before proceeding, verify that your GPU is detected and CUDA is working by running nvidia-smi and checking that PyTorch can access CUDA with: python3 -c "import torch; print(torch.cuda.is_available())"
Install FFmpeg: Whisper requires FFmpeg to process audio files in various formats
# apt install ffmpeg
This package provides the necessary audio codec support for Whisper to handle MP3, WAV, and other common audio formats.
Install OpenAI Whisper: Use pip to install Whisper with all its dependencies
$ pip install openai-whisper
The installation will download and install Whisper along with required packages including tiktoken, numba, and other dependencies. This may take a few minutes depending on your internet connection.
Verify Whisper GPU Access: Confirm that Whisper can detect and use your GPU
The output should display cuda:0, indicating that Whisper will use your NVIDIA GPU for processing. If you see cpu instead, review your PyTorch CUDA installation.
INSTALLATION COMPLETE
Whisper is now installed and configured to use GPU acceleration. You can proceed to transcribe audio files.
Basic Whisper Usage
Whisper provides a simple command-line interface for transcribing audio files. The basic syntax is straightforward and accepts various audio formats including MP3, WAV, M4A, and others.
Download Test Audio File: First, download a sample speech audio file for testing
This downloads a free test audio file containing English speech from the Open Speech Repository.
Basic Transcription: Transcribe the test audio file using the base model
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda
This command transcribes the audio file using the base model with GPU acceleration. Whisper will automatically detect the language and create several output files including text, SRT subtitles, and VTT format. The transcription should complete in just a few seconds with GPU acceleration.
Specify Language: When you know the audio language, specifying it can improve accuracy and speed
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda --language English
By specifying the language, Whisper skips the language detection phase and starts transcription immediately.
Choose Output Format: Control which output formats are generated
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda --output_format txt
Available formats include txt, srt, vtt, json, and tsv. You can specify multiple formats separated by commas.
Use Different Models: Whisper offers several model sizes with different accuracy-speed tradeoffs
$ whisper OSR_us_000_0010_8k.wav --model small --device cuda
Available models from smallest to largest: tiny, base, small, medium, large. Larger models provide better accuracy but require more GPU memory and processing time.
Translate to English: Automatically translate non-English audio to English text
$ whisper your-audio-file.mp3 --model base --device cuda --task translate
This is useful when you have audio in another language but want English text output. Replace your-audio-file.mp3 with your actual audio file.
Understanding Whisper Models
Whisper provides five different model sizes, each offering different trade-offs between accuracy, speed, and memory requirements:
Model
Parameters
VRAM Required
Use Case
tiny
39M
~1 GB
Fast processing, lower accuracy, real-time applications
base
74M
~1 GB
Good balance for most tasks, recommended starting point
small
244M
~2 GB
Better accuracy, still relatively fast
medium
769M
~5 GB
High accuracy, slower processing
large
1550M
~10 GB
Best accuracy, requires powerful GPU
CHECK AVAILABLE VRAM
To check how much VRAM your GPU has, run nvidia-smi and look at the “Memory-Usage” column. For example, “2048MiB / 8192MiB” means you have 8GB total VRAM. Choose models based on your available VRAM: tiny/base (1GB), small (2GB), medium (5GB), large (10GB).
MODEL SELECTION TIP
Start with the base model for testing. If accuracy is insufficient, try the small or medium model. Only use the large model if you have sufficient GPU memory and require the highest possible accuracy.
GPU vs CPU Performance Comparison
One of the main advantages of using GPU acceleration with Whisper is the dramatic speed improvement over CPU processing. To demonstrate this difference, we can transcribe the same audio file using both CPU and GPU, then compare the processing times.
CPU Transcription: Time the transcription process using CPU
$ time whisper OSR_us_000_0010_8k.wav --model base --device cpu 2> /dev/null
This command transcribes the audio file using only the CPU and measures the total time taken. The 2> /dev/null suppresses error output for cleaner timing results.
GPU Transcription: Time the same transcription using GPU acceleration
$ time whisper OSR_us_000_0010_8k.wav --model base --device cuda 2> /dev/null
This performs the identical transcription but leverages your NVIDIA GPU through CUDA. The speed difference is immediately noticeable.
The performance difference becomes even more pronounced with larger models and longer audio files. For the base model, GPU processing is typically 10-15x faster than CPU. This speed advantage makes GPU acceleration essential for processing large volumes of audio files or when working with higher accuracy models like medium or large.
Whisper supports numerous command-line options to customize transcription behavior. Here are the most useful ones:
--model MODEL: Choose the model size (tiny, base, small, medium, large)
--device cuda: Force GPU usage (though Whisper uses GPU by default when available)
--language LANGUAGE: Specify the audio language to skip detection
--task transcribe|translate: Either transcribe in original language or translate to English
--output_format FORMAT: Choose output format (txt, srt, vtt, json, tsv)
--output_dir DIRECTORY: Specify where to save output files
--verbose False: Reduce console output during processing
--temperature 0: Use deterministic decoding for consistent results
Monitoring GPU Usage
To verify that Whisper is actually using your GPU during transcription, you can monitor GPU activity in real-time:
$ watch -n 1 nvidia-smi
Run this command in a separate terminal window while Whisper is processing audio. You should see GPU utilization increase and memory usage spike during transcription. This confirms that GPU acceleration is working properly.
If this returns False, your PyTorch installation does not have CUDA support. Reinstall PyTorch with CUDA following the prerequisites guide.
FFmpeg Not Found Error: If you encounter “No such file or directory: ‘ffmpeg'”
# apt install ffmpeg
Whisper requires FFmpeg to decode audio files. Install it using your distribution’s package manager.
Out of Memory Error: If you see CUDA out of memory errors
$ whisper audiofile.mp3 --model tiny --device cuda
Try using a smaller model that requires less GPU memory. The tiny or base models work well on GPUs with 4GB or less VRAM.
Model Download Issues: If model downloads fail or are interruptedWhisper downloads models to ~/.cache/whisper/ on first use. Delete this directory and try again if downloads are corrupted:
$ rm -rf ~/.cache/whisper/
Conclusion
OpenAI Whisper with GPU acceleration provides fast and accurate speech-to-text transcription on Debian and Ubuntu systems. By leveraging NVIDIA CUDA, transcription tasks that would take minutes on CPU can complete in seconds. The command-line interface is straightforward, making it accessible for users who need to transcribe audio files without complex setup or programming knowledge. Start with the base model and adjust to larger models if you need better accuracy or smaller models if you need faster processing.