Last indexed: 19 April 2025 (0747dc)

Audio Manipulation

This page documents how to capture, save, and manipulate audio data using the SpeechRecognition library. It covers the core audio handling capabilities, including recording from audio sources, converting between formats, and saving audio data to files.

For information about calibrating speech detection sensitivity, see Energy Threshold Adjustment.

Audio Data Structure

The SpeechRecognition library represents audio using the AudioData class, which encapsulates the raw audio frames along with metadata about sample rate and sample width.

Sources: speech_recognition/__init__.py28 This appears to be imported from a separate module, but we can infer its structure from usage throughout the codebase.

Audio Capture

Audio can be captured from various sources such as a microphone or an audio file. The Recognizer class provides methods for this purpose.

Capturing from Microphone

The most common way to capture audio is via the listen() method, which records a phrase based on energy threshold, or via the record() method, which records for a specific duration.

Sources: speech_recognition/__init__.py333-365 speech_recognition/__init__.py442-568

The key difference between these methods:

record() - Records for a specified duration or until there is no more audio input
listen() - Waits for speech to begin, then records until the person stops speaking

Capturing from Audio Files

The AudioFile class allows you to use an existing audio file as an audio source.

Sources: speech_recognition/__init__.py202-215 speech_recognition/__init__.py333-365

Audio Format Conversion

The AudioData class provides methods to convert audio to various formats:

Sources: examples/write_audio.py13-27

Example: Converting Between Formats

The following table shows the typical workflow for converting between audio formats:

From	To	Method
AudioData	Raw bytes	`audio_data.get_raw_data()`
AudioData	WAV	`audio_data.get_wav_data()`
AudioData	AIFF	`audio_data.get_aiff_data()`
AudioData	FLAC	`audio_data.get_flac_data()`

Each conversion method can accept parameters to adjust the output, such as sample rate and width.

Manipulating Audio Properties

When obtaining audio data, you can specify or adjust various properties:

Sample Rate

The sample rate (in Hz) determines how many audio samples are taken per second. Higher sample rates provide better audio quality but require more bandwidth and processing power.

Sample Width

The sample width (in bytes) determines the precision of each audio sample. The SpeechRecognition library typically uses 16-bit (2-byte) samples.

Channels

The library generally works with mono audio (1 channel). If stereo audio is provided, it's automatically converted to mono.

Sources: speech_recognition/__init__.py89-92

Saving Audio to Files

To save captured audio to a file, obtain the appropriate format and write it to a file:

Sources: examples/write_audio.py13-27

Common Audio Manipulation Tasks

Here are some common tasks involving audio manipulation with the SpeechRecognition library:

Recording and Saving Audio

Sources: examples/write_audio.py8-19

Recording for a Specific Duration

To record audio for a specific duration rather than waiting for a phrase:

Sources: speech_recognition/__init__.py333-365

Recording with an Offset

You can start recording after a specific offset time:

Sources: speech_recognition/__init__.py333-365

Converting Between Formats

The library allows for seamless conversion between different audio formats:

Sources: examples/write_audio.py13-27

Audio Processing Flow

The following diagram illustrates the typical flow of audio data through the SpeechRecognition library:

Sources: speech_recognition/__init__.py333-365 speech_recognition/__init__.py442-568 examples/write_audio.py13-27

Behind the Scenes: AudioData Creation

When audio is captured using record() or listen(), an AudioData instance is created with the following parameters:

frame_data: The raw audio frames captured (as bytes)
sample_rate: The sample rate of the audio source (in Hz)
sample_width: The sample width of the audio source (in bytes)

Sources: speech_recognition/__init__.py362-364

Advanced: Energy-Based Audio Processing

The listen() method uses energy thresholds to determine when speech starts and ends. This process involves:

Calculating the energy (RMS) of each audio chunk
Comparing it to the energy threshold
Detecting when energy rises above threshold (speech starts)
Detecting when energy falls below threshold for sufficient time (speech ends)

This energy-based processing is crucial for automatically determining when someone is speaking. For details on adjusting this behavior, see Energy Threshold Adjustment.

Sources: speech_recognition/__init__.py485-506 speech_recognition/__init__.py535-548

Refresh this wiki

URL: https://deepwiki.com/Uberi/speech_recognition/5.2-audio-manipulation