Egocentric Early Action Prediction via Adversarial
Knowledge Distillation

Egocentric early action prediction aims to recognize actions from the first-person view by only observing the partial video segment, which is challenging due to the limited context information of the partial video. In this paper, to tackle the egocentric early action prediction problem, we propose a novel multi-modal adversarial knowledge distillation framework. Particularly, our approach involves a teacher network aiming to learn the enhanced representation of the partial video by considering the future unobserved video segment, and a student network concentrating on mimicking the teacher network to produce the powerful representation of the partial video and based on that predicting the action label. To promote the knowledge distillation between the teacher network and the student network, we seamlessly integrate the adversarial learning with latent and discriminative knowledge regularizations encouraging the learned representations of the partial video to be more informative and discriminative towards the action prediction. Finally, we devise a multi-modal fusion module towards comprehensively predicting the action label. Extensive experiments on two public egocentric datasets validate the superiority of our method over the state-of-the-art methods.
Framework

Illustration of the proposed egocentric early action prediction scheme with Adversarial Knowledge Distillation (ADK). To simplify, we only show two different modalities: visual content and audio signals, corresponding to two teacher sub-networks and two student sub-networks. In particular, the important knowledge from the teacher network is distilled to the student network with an adversarial learning strategy, where LKR and DKR are incorporated to regularize the learned representation from the student network to be more informative and discriminative to the egocentric early action predictions.