ByteDance, the owner of TikTok, has revealed a new AI that can speak from still photos.
ByteDance recently announced the launch of an AI system called INFP, which enables static avatars to “speak” and respond through voice input. Unlike traditional technologies, INFP doesn’t require manual assignment of speaking and listening roles; the system can automatically assign roles based on the flow of the conversation.
The INFP workflow consists of two main steps. The first step, known as “motion-based head mimicry,” involves the system analyzing facial expressions and head movements during conversations to extract details from videos. This motion data is then converted into a format that can be used for subsequent animations, allowing still images to match the movements of the original person.
The second step is “audio-guided motion generation,” where the system generates natural motion patterns based on audio input. The research team developed a tool called a “motion vector,” which analyzes audio from both ends of a conversation to generate speaking and listening motion patterns. An AI component called a diffusion transformer then gradually refines these patterns to generate smooth, realistic motions that perfectly match the audio content.
To effectively train the system, the research team also created a conversational dataset called DyConv, which collects over 200 hours of real conversation videos. Compared to existing conversational datasets (such as ViCo and RealTalk), DyConv has unique advantages in emotional expression and video quality.
Although INFP currently only supports voice input, the research team is exploring the possibility of expanding the system to include images and text, with the future goal of creating realistic animations of full-bodied characters. However, since such technology could be used to create fake videos and spread misinformation, the research team plans to restrict the underlying technology to research institutions, similar to how Microsoft operates its advanced voice cloning system.
The technology is part of ByteDance's broader AI strategy, leveraging its popular apps TikTok and CapCut, which provide a broad platform for AI innovation applications.
Project link