Google AI Weblog: VDTTS: Visually-Pushed Textual content-To-Speech



Latest years have seen an incredible improve within the creation and serving of video content material to customers internationally in a wide range of languages and over quite a few platforms. The method of making prime quality content material can embrace a number of levels from video capturing and captioning to video and audio enhancing. In some instances dialogue is re-recorded (known as dialog alternative, post-sync or dubbing) in a studio with a view to obtain prime quality and exchange authentic audio that may have been recorded in noisy circumstances. Nonetheless, the dialog alternative course of could be tough and tedious as a result of the newly recorded audio must be properly synced with the video, requiring a number of edits to match the precise timing of mouth actions.

In “Greater than Phrases: In-the-Wild Visually-Pushed Prosody for Textual content-to-Speech”, we current a proof-of-concept visually-driven text-to-speech mannequin, referred to as VDTTS, that automates the dialog alternative course of. Given a textual content and the unique video frames of the speaker, VDTTS is educated to generate the corresponding speech. Versus customary visible speech recognition fashions, which concentrate on the mouth area, we detect and crop full faces utilizing MediaPipe to keep away from probably excluding info pertinent to the speaker’s supply. This provides the VDTTS mannequin sufficient info to generate speech that matches the video whereas additionally recovering facets of prosody, reminiscent of timing and emotion. Regardless of not being explicitly educated to generate speech that’s synchronized to the enter video, the discovered mannequin nonetheless does so.

Given a textual content and video frames of a speaker, VDTTS generates speech with prosody that matches the video sign.

VDTTS Mannequin

The VDTTS mannequin resembles Tacotron at its core and has 4 principal elements: (1) textual content and video encoders that course of the inputs; (2) a multi-source consideration mechanism that connects encoders to a decoder; (3) a spectrogram decoder that includes the speaker embedding (equally to VoiceFilter), and produces mel-spectrograms (that are a type of compressed illustration within the frequency area); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.

The general structure of VDTTS. Textual content and video encoders course of the inputs after which a multisource consideration mechanism connects these to a decoder that produces mel-spectrograms. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output.

We prepare VDTTS utilizing video and textual content pairs from LSVSR through which the textual content corresponds to the precise phrases spoken by an individual in a video. All through our testing, we have now decided that VDTTS can’t generate arbitrary textual content, thus making it much less prevalent for misuse (e.g., the era of pretend content material).

High quality

To showcase the distinctive power of VDTTS on this submit, we have now chosen two inference examples from the VoxCeleb2 take a look at dataset and evaluate the efficiency of VDTTS to a typical text-to-speech (TTS) mannequin. In each examples, the video frames present prosody and phrase timing clues, visible info that isn’t out there to the TTS mannequin.

Within the first instance, the speaker talks at a specific tempo that may be seen as periodic gaps within the ground-truth mel-spectrogram (proven under). VDTTS preserves this attribute and generates audio that’s a lot nearer to the ground-truth than the audio generated by customary TTS with out entry to the video.

Equally, within the second instance, the speaker takes lengthy pauses between among the phrases. These pauses are captured by VDTTS and are mirrored within the video under, whereas the TTS doesn’t seize this side of the speaker’s rhythm.

We additionally plot basic frequency (F0) charts to check the pitch generated by every mannequin to the ground-truth pitch. In each examples, the F0 curve of VDTTS matches the ground-truth significantly better than the TTS curve, each within the alignment of speech and silence, and likewise in how the pitch adjustments over time. See extra authentic movies and VDTTS generated movies.

We current two examples, (a) and (b), from the VoxCeleb2 take a look at set. From high to backside: enter face pictures, ground-truth (GT) mel-spectrogram, mel-spectrogram output of VDTTS, mel-spectrogram output of a typical TTS mannequin, and two plots displaying the normalized F0 (normalized by imply non-zero pitch, i.e., imply is just over voiced intervals) of VDTTS and TTS in comparison with the ground-truth sign.

Video Samples

Unique VDTTS VDTTS video-only TTS
Unique shows the unique video clip. VDTTS, shows the audio predicted utilizing each the video frames and the textual content as enter. VDTTS video-only shows audio predictions utilizing video frames solely. TTS shows audio predictions utilizing textual content solely. High transcript: “of area for individuals to make their very own judgments and to come back to their very own”. Backside transcript: “completely love dancing I’ve no dance expertise in any respect however as that”.

Mannequin Efficiency

We’ve measured the VDTTS mannequin’s efficiency utilizing the VoxCeleb2 dataset and in contrast it to TTS and the TTS with size trace (a TTS that receives the scene size) fashions. We reveal that VDTTS outperforms each fashions by massive margins in a lot of the facets we measured: increased sync-to-video high quality (measured by SyncNet Distance) and higher speech high quality as measured by mel cepstral distance (MCD), and decrease Gross Pitch Error (GPE), which measures the proportion of frames the place pitch differed by greater than 20% on frames for which voice was current on each the expected and reference audio.

SyncNet distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is best).
Mel cepstral distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is best).
Gross Pitch Error comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is best).

Dialogue and Future Work

One factor to notice is that, intriguingly, VDTTS can produce video synchronized speech with none specific losses or constraints to advertise this, suggesting complexities reminiscent of synchronization losses or specific modeling are pointless.

Whereas it is a proof-of-concept demonstration, we consider that sooner or later, VDTTS could be upgraded for use in situations the place the enter textual content differs from the unique video sign. This sort of a mannequin can be a priceless device for duties reminiscent of translation dubbing.


We want to thank the co-authors of this analysis: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We’re additionally grateful to the valued contributions, discussions, and suggestions from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.