Lastly, GWM Avatars combines generative video and speech in a unified model to produce human-like avatars that emote and move ...
Abstract: In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal ...