A groundbreaking AI application developed by a team of researchers at the Institute for Intelligent Computing, Alibaba Group, is revolutionizing the way we animate voices. Their innovative AI app, dubbed Emote Portrait Alive (EMO), can transform a single photograph of a person's face and an accompanying soundtrack into a lifelike video of the person speaking or singing.
Published on the arXiv preprint server, the team's work represents a significant advancement beyond previous AI applications that could only generate semi-animated versions of faces from photographs. What sets EMO apart is its ability to incorporate sound without relying on 3D models or facial landmarks.
Using diffusion modeling and extensive training on vast datasets of audio and video files—approximately 250 hours—the AI directly converts audio waveforms into video frames. This approach captures subtle facial gestures and speech nuances, producing animated images that closely resemble human expressions.
The resulting videos faithfully replicate mouth shapes and facial expressions corresponding to the spoken words or sung lyrics, enhancing realism and expressiveness. The length of the generated video matches that of the original audio track, providing seamless synchronization between voice and visuals.
The team has shared several demonstration videos showcasing the remarkable accuracy and fidelity of their AI-generated performances. They assert that EMO surpasses existing applications in terms of realism and expressiveness, marking a significant leap forward in AI-driven animation technology.
In these videos, viewers can witness the original still image alongside the animated rendition, seamlessly synchronized with the recorded audio. EMO's capability to transform static images into dynamic, lifelike videos holds vast potential for various applications, from entertainment and storytelling to personalized communication and beyond.
More: https://techxplore.com/news/2024-03-ai-voice-track-video-person.html
