Real-time semantics in video sequences

The study of semantics intends to provide meaning to data. In the case of video sequences, semantics allows to derive an analysis of the scene, on which many real-world applications can rely. To this extent, we start by defining two levels of semantics that can be extracted from videos. First, we define as low-level semantics every information describing the natural content of the video, comprising the objects and the environment of the scene. Second, high-level semantics characterizes the interpretation of the events occurring in the scene, which relates to a deeper understanding of the role of the elements composing this scene.

In the first part, we explore several approaches to extract low-level semantics from video sequences in real time, as most current state-of-the-art methods are rather slow. In particular, we focus on three types of low-level semantics: motion detection, semantic segmentation and object detection. As a first contribution, we develop an asynchronous combination method to leverage the output of a slow segmentation network to improve the performances of a real-time background subtraction algorithm, while keeping real-time inference. As a second work, we present a novel method to train a fast segmentation network by leveraging the output of another slow, but performant, segmentation network while constantly adapting to the latest video conditions. Then, we show that this method, called online knowledge distillation, also proves to be effective for detecting players on a soccer field, even when the two networks process videos with different modalities and fields of view.

In the second part, we focus on high-level semantics describing the events taking place during a soccer game. First, we leverage low-level semantics to progressively produce a higher-level understanding of the game and present a simple, yet effective, semantic-based decision tree to segment the following game phases: goal or goal opportunity, attack, middle and defense. In a second approach, we develop a novel network architecture coupled with a context-aware loss function to spot game events such as goals, card and substitution, and show that it achieves state-of-the-art performances on the SoccerNet dataset. As a final contribution, we publicly release a novel dataset containing high-level semantic annotations, comprising a complete set of game events and semantics related to the editing of the TV broadcast. This allows us to define four challenging tasks: action spotting, camera shot temporal segmentation, camera shot boundary detection, and replay grounding. We hope that this dataset will become the reference for high-level semantics in soccer videos.

Video presentation of the Defense