VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/09/can-voice-conversion-improve-asr-in-low-resource-settings/

⇱ Can Voice Conversion Improve ASR in Low-Resource Settings?


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Can Voice Conversion Improve ASR in Low-Resource Settings?

Drishti Last Updated : 12 Oct, 2022
5 min read

This article was published as a part of the Data Science Blogathon.

πŸ‘ Image
Source: Canva

Introduction

Voice conversion (VC) can be used to improve the performance of speech recognition systems in low-resource languages by augmenting limited training data. However, VC has not been extensively used for this task considering the practical issues like compute speed and limitations when converting to and from unseen speakers. Considering this, it becomes necessary to design a voice conversion model that can convert to and from unseen speakers (speech) during training and be computationally efficient and fast. 

In light of the above, Matthew Baas and Herman Kamper have examined whether the VC system trained on a high-resource language can be used to produce additional training data for unseen low-resource language, which we will be exploring in this article.

Highlights

  • It was assessed whether a VC system could be used cross-lingually to improve low-resource speech recognition.

  • Several recent techniques were combined to design and train a practical VC system in English. Then this system was used to augment data for training speech recognition models in low-resource languages.

  • When a reasonable amount of VC augmented data is used, the performance of speech recognition is improved in all four low-resource languages.

  • VC-based augmentation is superior to SpecAugment (a widely used signal processing augmentation method) in the low-resource languages taken under consideration.

What is Voice Conversion?

Voice Conversion (VC) is a technique in which the utterance spoken by a speaker is processed in such a manner that it sounds as if a different target speaker spoke it while the content remains unchanged. Some recent solutions even provide fine-grained ways to control different aspects of the generated speech, such as naturalness, pitch, speed, etc. Furthermore, they can be used cross-lingually too.

Voice Conversion Method Overview

Since the objective is to use Voice Conversion (VC) for low-resource data augmentation, it becomes necessary to design a voice conversion model that can convert to and from unseen speakers (speech) during training in addition to being computationally efficient and fast.

Given the preceding, the method illustrated in Figures 1a and 1b has been proposed. In this, an input waveform is first transformed into a series of feature vectors, and the linguistic content (the words being spoken) is then separated from the speaker information. This is done with the aid of specially designed style and content encoders, which are indicated by dashed boxes. While the style encoder generates a single style vector (s), the content encoder generates a sequence of content embedding vectors.

Figure 1: Voice Conversion Approach. (a) During training, the same utterance is passed through a CPC encoder into the content and style encoders; the decoder attempts to reconstruct the spectrogram of the input utterance. (b) At inference, separate input and reference utterances are fed into the style and content encoders, respectively, before using HiFi-GAN to vocoding the spectrogram.

During training (Figure 1a), the style vector (s = ssrc) and content embeddings are derived from the same input utterance. The model is trained to reconstruct the input by summing ssrc with each content embedding, feeding the resulting vectors into a decoder module.

At testing (Figure 1b), the content and speaker encoders receive speech from different utterances. This indicates that the source or reference utterance (or both) may originate from languages and speakers not encountered during training. If the content and speaker information are properly separated, the source utterance will be generated in the speaker’s voice, supplying the reference utterance.

Results

1. First, it was noted that using more augmented data tends to boost performance, with the lowest WER attained with 500% additional data generated by chaining VC and SpecAugment in conjunction (See Table1). Second, VC augmentation offers comparable performance gains to SpecAugment when applied independently. Third, using both VC and SpecAugment in conjunction enhances performance more than either alone. This indicates that these two approaches are complementary in English.

πŸ‘ Voice Conversion

Table 1: WERs (%) on LibriSpeech test data for ASR models trained with increasing amounts of VC- and SpecAug-augmented data, with and without 4-gram LM decoding

2. Very low-resource settings: When an adequate amount of VC augmented data is used, speech recognition performance is improved in all four low-resource languages studied. Afrikaans, Setswana, isiXhosa and Sepedi. Table 2 indicates that when 10 minutes of additional data is generated using the English-trained VC system, ASR performance improves for all four unknown languages. For isiXhosa, adding the augmented data results in a 7% absolute gain in WER. Chaining VC in conjunction with SpecAugment degrades the performance. However, W/CER worsened in all cases compared to training with just the original data. SpecAugment was also used on Afrikaans in isolation, resulting in poor performance compared to no augmentation.

πŸ‘ Voice Conversion

Table 2: ASR results (%) on test data of 4 low-resource languages when trained on 10 minutes of real audio data plus different amounts of additional VC- and combined VC-SpecAug augmented data.

For the low-resource languages evaluated, VC-based augmentation outperforms SpecAugment. Furthermore, a VC system trained on a single well-resourced language (English) can be used cross-lingually to generate additional training data for unseen low-resource languages. When labelled resources are scarce (approximately 10 minutes), ASR performance improves in all four low-resource languages studied.

Limitation

The interaction between the VC and the conventional augmentation was not thoroughly investigated, nor was it assessed how various design decisions within the VC model itself affect the performance of the downstream ASR task.

Conclusion

To sum it up, in this article, we learned the following:

1. It was examined whether a VC system can be leveraged cross-lingually to improve low-resource speech recognition.

2. Several recent techniques were combined to design and train a practical VC system in English. Then this system was used to augment data for training speech recognition models in low-resource languages.

3. For the low-resource languages evaluated, VC-based augmentation outperformed SpecAugment. Furthermore, it was found that a VC system trained on a single high-resourced language (English) can be used cross-lingually to produce additional training data for unseen low-resource languages.

4. Using both VC and SpecAugment in conjunction improves performance more than alone; this indicates that these two approaches are complementary in English.

That concludes this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy learning!

Read here| https://arxiv.org/pdf/2111.02674.pdf

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
πŸ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
πŸ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

πŸ‘ Popup Banner
πŸ‘ AI Popup Banner