If you are someone like me who tends to forget everything by the end of a meeting, remembering what happened during the session can be difficult. In that case, meeting recordings become the only way to recap the discussion. But going through an hour-long meeting recording can be tedious. That’s where transcription can ease the process.

Transcribing a long meeting recording means uploading your audio to a remote server, relying on cloud processing, and trusting third-party services with potentially sensitive conversations. Though cloud transcription services are convenient, they come with several trade-offs. While dealing with the same problem, I came across Whisper, an open-source speech-to-text model from OpenAI released under the MIT License.

Local AI models are changing the equation, and modern GPUs today are powerful enough to run fairly sized large models locally. Since I already had Whisper and a capable GPU on my PC, I built a small dashboard around OpenAI’s speech-to-text model that runs entirely offline. The results were surprisingly fast. A 30-minute audio clip was processed in just under a minute without making any external API calls.

Running Whisper locally is faster than most people expect

A 30-minute recording processed in under a minute

Credit: Shekhar Vaidya

It is commonly assumed that transcription always requires cloud infrastructure, and this was largely true earlier when models were large and slow. The situation is very different today; modern GPUs are far more powerful, and local models are significantly more optimized. Local speech-to-text models can now process an hour-long recording faster than real-time playback speed.

I tested this myself and got surprisingly good results. For context, I used Faster-Whisper, a re-implementation of OpenAI's Whisper model using CTranslate2, and my PC features a Ryzen 7 7700X CPU and an RTX 4070 Ti GPU. I used a 30-minute, 34-second audio file, which was processed in 55.85 seconds at roughly 32× real-time speed.

Translating these benchmarks into real-world expectations, an hour-long meeting recording can be processed in under two minutes and, similarly, a 90-minute podcast in under three minutes. In other words, the transcription finished way before the actual playback in a fraction of the time.

A simple dashboard turns Whisper into a practical transcription tool

Upload audio, pick a model, and watch real-time performance

The local Whisper model is easy to use and can be utilized even in a terminal window, but I built a small dashboard to manage the workflow. The dashboard helped me test Whisper performance locally and monitor transcription speed through a simple interface instead of running CLI commands.

The flow was simple enough. The process begins by uploading the audio, either by selecting the recording through File Explorer or dragging it onto the interface. Next, choose the Whisper model you would like to use, such as Tiny, Base, Small, Medium, or Large, as per your hardware headroom. Now, select whether you want to process it via your CPU or GPU. Finally, once the process starts, the results begin appearing in real time.

On completion, you get a structured and timestamped transcript along with essential metrics such as audio duration, processing time, speed factor, word count, segment count, language, and the model and device used. It is easy to read and understand; you can even export it in various formats such as TXT, SRT, and VTT.

The stack behind the project is surprisingly modern

FastAPI, faster-whisper, and SSE handle the heavy lifting

Credit: Shekhar Vaidya/XDA

As earlier mentioned, the local Whisper model can run directly from a terminal as well, but I have a knack for building applications with good UI and UX. I carefully chose the tech stack to be lightweight and modular. The full-stack application uses a FastAPI backend running on the faster-whisper model and a React frontend.

Why did I choose the faster-whisper model over the original Whisper from OpenAI? It is an optimized implementation of the original model built on CTranslate2. It is vastly faster and more memory-friendly than OpenAI’s native port, making it practical for local deployments.

The FastAPI and Uvicorn backend make it easier to handle audio uploads, model execution, transcription pipelines, and metrics generation. I used ffprobe, part of the FFmpeg toolkit, to help detect audio length immediately, which allows real-time speed calculations. Although Whisper has built-in audio length detection, I chose this approach because it is more reliable and faster.

The React frontend, built on Vite and Tailwind CSS, helps maintain a modern feel and responsive interface. It renders the dashboard while updating the metrics in real-time and displaying transcript segments. It uses Server-Sent Events (SSE) and the browser EventSource API to allow the speed meter and metrics panel to update in real time.

A few small tweaks make Whisper transcripts much easier to read

Simple backend tweaks dramatically improve transcript readability

Credit: Shekhar Vaidya/XDA

The default Whisper output and local ML implementations already work well, but a couple of minor practical tweaks made the tool easier to use and the transcript easier to read.

By default, Whisper produces output in very small fragments when the speaker pauses. The result often appears every few seconds, sometimes in broken sentences, resulting in a messy transcript. To make the transcripts easier to read, I implemented intelligent sentence-based segmenting. Instead of the previous 2–3 second segments, the system now buffers these chunks in memory until an actual sentence-ending punctuation appears or a certain time threshold is reached.

Here is how the code looked:

# Intelligent Sentence-Based Segmenting
current_merged_segment = None

for segment in segments:
 if current_merged_segment is None:
 current_merged_segment = Segment(start=segment.start, end=segment.end, text=segment.text)
 else:
 # Buffer and append the tiny audio chunk
 current_merged_segment.end = segment.end
 current_merged_segment.text += " " + segment.text.strip()

 # Wait for a full stop or a 30-second failsafe before broadcasting to the UI
 text_stripped = current_merged_segment.text.strip()
 if text_stripped.endswith(('.', '?', '!')) or (current_merged_segment.end - current_merged_segment.start >= 30.0):
 results.append(current_merged_segment)
 current_merged_segment = None # Reset buffer

Another massive pain point of running a local AI model on a Windows system is getting the CUDA paths right. Local ML on Windows requires manually setting the CUDA paths and editing the environment variables. To make it more efficient, I used an automatic DLL discovery CUDA approach, which can easily be installed via pip. The script then automatically finds the CUDA libraries installed on your system and injects them into the runtime path.

Here is how the code looked:

# Auto-discovery for CUDA DLLs if installed via pip (common for Windows)
if os.name == 'nt':
 import site, os
 # Find all site-packages directories, including Python 3.13 quirks
 packages_dirs = site.getsitepackages()
 if hasattr(site, 'getusersitepackages'):
 packages_dirs.append(site.getusersitepackages())
 
 for base_dir in packages_dirs:
 for pkg in ['nvidia/cublas/bin', 'nvidia/cudnn/bin', 'nvidia/cudnn/lib']:
 dll_path = os.path.abspath(os.path.join(base_dir, pkg))
 if os.path.exists(dll_path):
 # Dynamically inject into Windows DLL search path and system PATH
 os.add_dll_directory(dll_path)
 os.environ['PATH'] = dll_path + os.pathsep + os.environ.get('PATH', '')

Whisper proves local AI can replace cloud tools

We have come a long way from depending upon cloud infrastructure to entirely running it on personal hardware without interacting with external servers for a tedious task like speech-to-text generation. By building this small dashboard, I realized that powerful models can be turned into practical tools without the need for large infrastructure.

For someone dealing with meetings, lectures, or seminars regularly, running Whisper locally on their personal hardware can be surprisingly efficient compared to cloud-based services. No wait time. No network dependency. No billing stress.