A Massively Multilingual Speech-to-Speech Translation Corpus



Computerized translation of speech from one language to speech in one other language, referred to as speech-to-speech translation (S2ST), is vital for breaking down the communication obstacles between folks talking completely different languages. Conventionally, automated S2ST programs are constructed with a cascade of automated speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, in order that the system total is text-centric. Lately, work on S2ST that doesn’t depend on intermediate textual content illustration is rising, akin to end-to-end direct S2ST (e.g., Translatotron) and cascade S2ST based mostly on discovered discrete representations of speech (e.g., Tjandra et al.). Whereas early variations of such direct S2ST programs obtained decrease translation high quality in comparison with cascade S2ST fashions, they’re gaining traction as they’ve the potential each to scale back translation latency and compounding errors, and to higher protect paralinguistic and non-linguistic data from the unique speech, akin to voice, emotion, tone, and so forth. Nonetheless, such fashions normally need to be educated on datasets with paired S2ST knowledge, however the public availability of such corpora is extraordinarily restricted.

To foster analysis on such a brand new technology of S2ST, we introduce a Widespread Voice-based Speech-to-Speech translation corpus, or CVSS, which incorporates sentence-level speech-to-speech translation pairs from 21 languages into English. Not like present public corpora, CVSS may be straight used for coaching such direct S2ST fashions with none additional processing. In “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation”, we describe the dataset design and improvement, and reveal the effectiveness of the corpus by coaching of baseline direct and cascade S2ST fashions and exhibiting efficiency of a direct S2ST mannequin that approaches that of a cascade S2ST mannequin.

Constructing CVSS
CVSS is straight derived from the CoVoST 2 speech-to-text (ST) translation corpus, which is additional derived from the Widespread Voice speech corpus. Widespread Voice is a massively multilingual transcribed speech corpus designed for ASR wherein the speech is collected by contributors studying textual content content material from Wikipedia and different textual content corpora. CoVoST 2 additional offers skilled textual content translation for the unique transcript from 21 languages into English and from English into 15 languages. CVSS builds on these efforts by offering sentence-level parallel speech-to-speech translation pairs from 21 languages into English (proven within the desk beneath).

To facilitate analysis with completely different focuses, two variations of translation speech in English are offered in CVSS, each are synthesized utilizing state-of-the-art TTS programs, with every model offering distinctive worth that doesn’t exist in different public S2ST corpora:

  • CVSS-C: All the interpretation speech is in a single canonical speaker’s voice. Regardless of being artificial, the speech is very pure, clear, and constant in talking model. These properties ease the modeling of the goal speech and allow educated fashions to provide prime quality translation speech appropriate for normal user-facing functions the place speech high quality is of upper significance than precisely reproducing the audio system’ voices.
  • CVSS-T: The interpretation speech captures the voice from the corresponding supply speech. Every S2ST pair has the same voice on the 2 sides, regardless of being in numerous languages. Due to this, the dataset is appropriate for constructing fashions the place correct voice preservation is desired, akin to for film dubbing.

Along with the supply speech, the 2 S2ST datasets comprise 1,872 and 1,937 hours of speech, respectively.

Code     Supply
  speech (X)  
  goal speech (En)  
  goal speech (En)  
French fr 309.3



German de 226.5



Catalan ca 174.8



Spanish es 157.6



Italian it 73.9



Persian fa 58.8



Russian ru 38.7



Chinese language zh 26.5



Portuguese     pt 20.0



Dutch nl 11.2



Estonian et 9.0



Mongolian mn 8.4



Turkish tr 7.9



Arabic ar 5.8



Latvian lv 4.9



Swedish sv 4.3



Welsh cy 3.6



Tamil ta 3.1



Indonesian id 3.0



Japanese ja 3.0



Slovenian sl 2.9



Complete 1,153.2



Quantity of supply and goal speech of every X-En pair in CVSS (hours).

Along with translation speech, CVSS additionally offers normalized translation textual content matching the pronunciation within the translation speech (on numbers, currencies, acronyms, and so forth., see knowledge samples beneath, e.g., the place “100%” is normalized as “100%” or “King George II” is normalized as “king george the second”), which may profit each mannequin coaching in addition to standardizing the analysis.

CVSS is launched underneath the Inventive Commons Attribution 4.0 Worldwide (CC BY 4.0) license and it may be freely downloaded on-line.

Information Samples

Instance 1:
Supply audio (French)   
Supply transcript (French)    Le style musical de la chanson est entièrement le disco.
CVSS-C translation audio (English)   
CVSS-T translation audio (English)   
Translation textual content (English)    The musical style of the tune is 100% Disco.
Normalized translation textual content (English)        the musical style of the tune is 100% disco
Instance 2:
Supply audio (Chinese language)       
Supply transcript (Chinese language)        弗雷德里克王子,英国王室成员,为乔治二世之孙,乔治三世之幼弟。
CVSS-C translation audio (English)       
CVSS-T translation audio (English)       
Translation textual content (English)        Prince Frederick, member of British Royal Household, Grandson of King George II, brother of King George III.
Normalized translation textual content (English)        prince frederick member of british royal household grandson of king george the second brother of king george the third

Baseline Fashions
On every model of CVSS, we educated a baseline cascade S2ST mannequin in addition to two baseline direct S2ST fashions and in contrast their efficiency. These baselines can be utilized for comparability in future analysis.

Cascade S2ST: To construct sturdy cascade S2ST baselines, we educated an ST mannequin on CoVoST 2, which outperforms the earlier states of the artwork by +5.8 common BLEU on all 21 language pairs (detailed within the paper) when educated on the corpus with out utilizing additional knowledge. This ST mannequin is related to the identical TTS fashions used for setting up CVSS to compose very sturdy cascade S2ST baselines (ST → TTS).

Direct S2ST: We constructed two baseline direct S2ST fashions utilizing Translatotron and Translatotron 2. When educated from scratch with CVSS, the interpretation high quality from Translatotron 2 (8.7 BLEU) approaches that of the sturdy cascade S2ST baseline (10.6 BLEU). Furthermore, when each use pre-training the hole decreases to solely 0.7 BLEU on ASR transcribed translation. These outcomes confirm the effectiveness of utilizing CVSS to coach direct S2ST fashions.

Translation high quality of baseline direct and cascade S2ST fashions constructed on CVSS-C, measured by BLEU on ASR transcription from speech translation. The pre-training was executed on CoVoST 2 with out different additional knowledge units.

We now have launched two variations of multilingual-to-English S2ST datasets, CVSS-C and CVSS-T, every with about 1.9K hours of sentence-level parallel S2ST pairs, masking 21 supply languages. The interpretation speech in CVSS-C is in a single canonical speaker’s voice, whereas the identical in CVSS-T is in voices transferred from the supply speech. Every of those datasets offers distinctive worth not present in different public S2ST corpora.

We constructed baseline multilingual direct S2ST fashions and cascade S2ST fashions on each datasets, which can be utilized for comparability in future works. To construct sturdy cascade S2ST baselines, we educated an ST mannequin on CoVoST 2, which outperforms the earlier states of the artwork by +5.8 common BLEU when educated on the corpus with out additional knowledge. However, the efficiency of the direct S2ST fashions approaches the sturdy cascade baselines when educated from scratch, and with solely 0.7 BLEU distinction on ASR transcribed translation when utilized pre-training. We hope this work helps speed up the analysis on direct S2ST.

We acknowledge the volunteer contributors and the organizers of the Widespread Voice and LibriVox tasks for his or her contribution and assortment of recordings, the creators of Widespread Voice, CoVoST, CoVoST 2, Librispeech and LibriTTS corpora for his or her earlier work. The direct contributors to the CVSS corpus and the paper embrace Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen. We additionally thank Ankur Bapna, Yiling Huang, Jason Pelecanos, Colin Cherry, Alexis Conneau, Yonghui Wu, Hadar Shemtov and Françoise Beaufays for useful discussions and help.