Voice conversion (VC) can be used to improve the performance of speech recognition systems in low-resource languages by augmenting limited training data. However, VC has not been extensively used for this task considering the practical issues like compute speed and limitations when converting to and from unseen speakers. Considering this, it becomes necessary to design a voice conversion model that can convert to and from unseen speakers (speech) during training and be computationally efficient and fast.

In light of the above, Matthew Baas and Herman Kamper have examined whether the VC system trained on a high-resource language can be used to produce additional training data for unseen low-resource language, which we will be exploring in this article.

Highlights

It was assessed whether a VC system could be used cross-lingually to improve low-resource speech recognition.
Several recent techniques were combined to design and train a practical VC system in English. Then this system was used to augment data for training speech recognition models in low-resource languages.
When a reasonable amount of VC augmented data is used, the performance of speech recognition is improved in all four low-resource languages.
VC-based augmentation is superior to SpecAugment (a widely used signal processing augmentation method) in the low-resource languages taken under consideration.

What is Voice Conversion?

Voice Conversion (VC) is a technique in which the utterance spoken by a speaker is processed in such a manner that it sounds as if a different target speaker spoke it while the content remains unchanged. Some recent solutions even provide fine-grained ways to control different aspects of the generated speech, such as naturalness, pitch, speed, etc. Furthermore, they can be used cross-lingually too.

Voice Conversion Method Overview

Since the objective is to use Voice Conversion (VC) for low-resource data augmentation, it becomes necessary to design a voice conversion model that can convert to and from unseen speakers (speech) during training in addition to being computationally efficient and fast.

Given the preceding, the method illustrated in Figures 1a and 1b has been proposed. In this, an input waveform is first transformed into a series of feature vectors, and the linguistic content (the words being spoken) is then separated from the speaker information. This is done with the aid of specially designed style and content encoders, which are indicated by dashed boxes. While the style encoder generates a single style vector (s), the content encoder generates a sequence of content embedding vectors.

👁 Voice Conversion

Figure 1: Voice Conversion Approach. (a) During training, the same utterance is passed through a CPC encoder into the content and style encoders; the decoder attempts to reconstruct the spectrogram of the input utterance. (b) At inference, separate input and reference utterances are fed into the style and content encoders, respectively, before using HiFi-GAN to vocoding the spectrogram.

During training (Figure 1a), the style vector (s = ssrc) and content embeddings are derived from the same input utterance. The model is trained to reconstruct the input by summing ssrc with each content embedding, feeding the resulting vectors into a decoder module.

At testing (Figure 1b), the content and speaker encoders receive speech from different utterances. This indicates that the source or reference utterance (or both) may originate from languages and speakers not encountered during training. If the content and speaker information are properly separated, the source utterance will be generated in the speaker’s voice, supplying the reference utterance.

Results

1. First, it was noted that using more augmented data tends to boost performance, with the lowest WER attained with 500% additional data generated by chaining VC and SpecAugment in conjunction (See Table1). Second, VC augmentation offers comparable performance gains to SpecAugment when applied independently. Third, using both VC and SpecAugment in conjunction enhances performance more than either alone. This indicates that these two approaches are complementary in English.

👁 Voice Conversion

Table 1: WERs (%) on LibriSpeech test data for ASR models trained with increasing amounts of VC- and SpecAug-augmented data, with and without 4-gram LM decoding

2. Very low-resource settings: When an adequate amount of VC augmented data is used, speech recognition performance is improved in all four low-resource languages studied. Afrikaans, Setswana, isiXhosa and Sepedi. Table 2 indicates that when 10 minutes of additional data is generated using the English-trained VC system, ASR performance improves for all four unknown languages. For isiXhosa, adding the augmented data results in a 7% absolute gain in WER. Chaining VC in conjunction with SpecAugment degrades the performance. However, W/CER worsened in all cases compared to training with just the original data. SpecAugment was also used on Afrikaans in isolation, resulting in poor performance compared to no augmentation.

👁 Voice Conversion

Table 2: ASR results (%) on test data of 4 low-resource languages when trained on 10 minutes of real audio data plus different amounts of additional VC- and combined VC-SpecAug augmented data.

For the low-resource languages evaluated, VC-based augmentation outperforms SpecAugment. Furthermore, a VC system trained on a single well-resourced language (English) can be used cross-lingually to generate additional training data for unseen low-resource languages. When labelled resources are scarce (approximately 10 minutes), ASR performance improves in all four low-resource languages studied.

Limitation

The interaction between the VC and the conventional augmentation was not thoroughly investigated, nor was it assessed how various design decisions within the VC model itself affect the performance of the downstream ASR task.

Conclusion

To sum it up, in this article, we learned the following:

1. It was examined whether a VC system can be leveraged cross-lingually to improve low-resource speech recognition.

2. Several recent techniques were combined to design and train a practical VC system in English. Then this system was used to augment data for training speech recognition models in low-resource languages.

3. For the low-resource languages evaluated, VC-based augmentation outperformed SpecAugment. Furthermore, it was found that a VC system trained on a single high-resourced language (English) can be used cross-lingually to produce additional training data for unseen low-resource languages.

4. Using both VC and SpecAugment in conjunction improves performance more than alone; this indicates that these two approaches are complementary in English.

That concludes this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy learning!

Read here| https://arxiv.org/pdf/2111.02674.pdf

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

👁 Drishti

Drishti

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Artificial Intelligence Audio Database Intermediate

Login to continue reading and enjoy expert-curated content.

Free Courses

👁 Generative AI
4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

👁 Generative AI
4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

👁 Generative AI
4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

👁 Generative AI
4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

👁 Generative AI
4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Cancel reply

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

👁 imag

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

👁 Av Logo White

Continue your learning for FREE

👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner

👁 AI Popup Banner

URL: https://www.analyticsvidhya.com/blog/2022/09/can-voice-conversion-improve-asr-in-low-resource-settings/