ArTST

SpeechT5 for Arabic (TTS task)

Here we use the pretained weights from ArTST and fine-tuned using huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech).

ArTST was first released in this repository, pretrained weights.

Uses

🤗 Transformers Usage

You can run ArTST TTS locally with the 🤗 Transformers library.

First install the 🤗 Transformers library, sentencepiece, soundfile and datasets(optional):

pip install --upgrade pip
pip install --upgrade transformers sentencepiece datasets[audio]

Run inference via the Text-to-Speech (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!

from transformers import pipeline
from datasets import load_dataset
import soundfile as sf

synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.

speech = synthesiser("لأنه لا يرى أنه على السفه ثم من بعد ذلك حديث منتشر", forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])

Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset

processor = SpeechT5Processor.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
model = SpeechT5ForTextToSpeech.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(text="لأنه لا يرى أنه على السفه ثم من بعد ذلك حديث منتشر", return_tensors="pt")

# load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)

speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

sf.write("speech.wav", speech.numpy(), samplerate=16000)

Citation

BibTeX:

@inproceedings{toyin-etal-2023-artst,
 title = "{A}r{TST}: {A}rabic Text and Speech Transformer",
 author = "Toyin, Hawau and
 Djanibekov, Amirbek and
 Kulkarni, Ajinkya and
 Aldarmaki, Hanan",
 editor = "Sawaf, Hassan and
 El-Beltagy, Samhaa and
 Zaghouani, Wajdi and
 Magdy, Walid and
 Abdelali, Ahmed and
 Tomeh, Nadi and
 Abu Farha, Ibrahim and
 Habash, Nizar and
 Khalifa, Salam and
 Keleg, Amr and
 Haddad, Hatem and
 Zitouni, Imed and
 Mrini, Khalil and
 Almatham, Rawan",
 booktitle = "Proceedings of ArabicNLP 2023",
 month = dec,
 year = "2023",
 address = "Singapore (Hybrid)",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2023.arabicnlp-1.5",
 pages = "41--51"
}
@inproceedings{ao-etal-2022-speecht5,
 title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
 author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
 booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
 month = {May},
 year = {2022},
 pages={5723--5738},
}

Downloads last month: 1,581

Model tree for MBZUAI/speecht5_tts_clartts_ar

Finetunes

16 models

Spaces using MBZUAI/speecht5_tts_clartts_ar 12

Collection including MBZUAI/speecht5_tts_clartts_ar

Open source project for Arabic Speech Recognition and Generation • 12 items • Updated Mar 2 • 13

URL: https://huggingface.co/MBZUAI/speecht5_tts_clartts_ar