Are You Speaking: Real-Time Speech Activity Detection via Landmark Pooling Network
Published in FG, 2019
In this paper, we propose a novel visual information based framework to solve the real-time speech activity detection problem. Unlike conventional methods which commonly use the audio signal as input, our approach incorporates facial information into a deep neural network for feature learning. Instead of using the whole input image, we further develop a novel end-to-end landmark pooling network to act as an attention-guide scheme to help the deep neural network focus more on only a small portion of the input image. This helps the network to precisely and efficiently learn highly discriminative features for speech activities. What’s more, we implement a recurrent neural network with the gated recurrent unit scheme to make use of the sequential information from video to produce the final decision. To give a comprehensive evaluation of the proposed method, we collect a large-scale dataset from unconstrained speech activities, which consists of a large number of speech/non-speech video sequences under various kinds of degradation. Experiment results and performance comparisons with other methods demonstrate our method’s capability to achieve very promising results while being realtime and memory-efficient. [Paper will come soon]