Talking Face Generation by Conditional Recurrent Adversarial Network

Published in IJCAI, 2019

Given an arbitrary face image and an arbitraryspeech clip, the proposed work attempts to gen-erate the talking face video with accurate lip syn-chronization. Existing works either do not con-sider temporal dependency across video framesthus yielding abrupt facial and lip movement or arelimited to the generation of talking face video fora specific person thus lacking generalization capac-ity. We propose a novel conditional recurrent gen-eration network that incorporates both image andaudio features in the recurrent unit for temporaldependency. To achieve both image- and video-realism, a pair of spatial-temporal discriminatorsare included in the network for better image/videoquality. Since accurate lip synchronization is es-sential to the success of talking face video gener-ation, we also construct a lip-reading discrimina-tor to boost the accuracy of lip synchronization.We also extend the network to model the naturalpose and expression of talking face on the ObamaDataset. Extensive experimental results demon-strate the superiority of our framework over thestate-of-the-arts in terms of visual quality, lip syncaccuracy, and smooth transition pertaining to bothlip and facial movement.Download paper here