SoccerNet-Tracking: Multiple Object Tracking Dataset and Benchmark in Soccer Videos
A. Cioppa, S. Giancola, A. Deliège, L. Kang, X. Zhou, Z. Cheng, B. Ghane, and M. Van Droogenbroeck.In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, Virtual event, June 2022.
Tracking objects in soccer videos is extremely important to gather both player and team statistics, whether it is to estimate the total distance run, the ball possession or the team formation. Video processing can help automating the extraction of those information, without the need of any invasive sensor, hence applicable to any team on any stadium. Yet, the availability of datasets to train learnable models and benchmarks to evaluate methods on a common testbed is very limited. In this work, we propose a novel dataset for multiple object tracking composed of 200 sequences of 30s each, representative of challenging soccer scenarios, and a complete 45-minutes half-time for long-term tracking. The dataset is fully annotated with bounding boxes and tracklet IDs, enabling the training of MOT baselines in the soccer domain and a full benchmarking of those methods on our segregated challenge sets. Our analysis shows that multiple player, referee and ball tracking in soccer videos is far from being solved, with several improvement required in case of fast motion or in scenarios of severe occlusion.
Scaling up SoccerNet with multiview spatial localization and re-identifcation
A. Cioppa, A. Deliège, S. Giancola, B. Ghanem, and M. Van Droogenbroeck.Scientific Data, 9(1), June 2022.
Soccer videos are a rich playground for computer vision, involving many elements, such as players, lines, and specifc objects. Hence, to capture the richness of this sport and allow for fne automated analyses, we release SoccerNet-v3, a major extension of the SoccerNet dataset, providing a wide variety of spatial annotations and cross-view correspondences. SoccerNet’s broadcast videos contain replays of important actions, allowing us to retrieve a same action from diferent viewpoints. We annotate those live and replay action frames showing same moments with exhaustive local information. Specifcally, we label lines, goal parts, players, referees, teams, salient objects, jersey numbers, and we establish player correspondences between the views. This yields 1,324,732 annotations on 33,986 soccer images, making SoccerNet-v3 the largest dataset for multi-view soccer analysis. Derived tasks may beneft from these annotations, like camera calibration, player localization, team discrimination and multi-view re-identifcation, which can further sustain practical applications in augmented reality and soccer analytics. Finally, we provide Python codes to easily download our data and access our annotations.
Semi-Supervised Training to Improve Player and Ball Detection in Soccer
R. Vandeghen, A. Cioppa, and M. Van Droogenbroeck.In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, Virtual event, June 2022.
Accurate player and ball detection has become increasingly important in recent years for sport analytics. As most state-of-the-art methods rely on training deep learning networks in a supervised fashion, they require huge amounts of annotated data, which are rarely available. In this paper, we present a novel generic semi-supervised method to train a network based on a labeled image dataset by leveraging a large unlabeled dataset of soccer broadcast videos. More precisely, we design a teacher-student approach in which the teacher produces surrogate annotations on the unlabeled data to be used later for training a student which has the same architecture as the teacher. Furthermore, we introduce three training loss parametrizations that allow the student to doubt the predictions of the teacher during training depending on the proposal confidence score. We show that including unlabeled data in the training process allows to substantially improve the performances of the detection network trained only on the labeled data. Finally, we provide a thorough performance study including different proportions of labeled and unlabeled data, and establish the first benchmark on the new SoccerNet-v3 detection task, with an mAP of 52.3%.
Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting
A. Cioppa, A. Deliège, F. Magera, S. Giancola, O. Barnich, B. Ghanem, and M. Van Droogenbroeck.In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, Virtual event, June 2021.
In this work, we focus on the topic of camera calibration and on its current limitations for the scientific community. More precisely, we tackle the absence of a large-scale calibration dataset and of a public calibration network trained on such a dataset. Specifically, we distill a powerful commercial calibration tool in a recent neural network architecture on the large-scale SoccerNet dataset, composed of untrimmed broadcast videos of 500 soccer games. We further release our distilled network, and leverage it to provide 3 ways of representing the calibration results along with player localization. Finally, we exploit those representations within the current best architecture for the action spotting task of SoccerNet-v2, and achieve new state-of-the-art performances.
SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos
A. Deliège, A. Cioppa, S. Giancola, M. Seikavandi, J. Dueholm, K. Nasrollahi, B. Ghanem, T. Moeslund, and M. Van Droogenbroeck.In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, Virtual event, June 2021.
In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet’s 500 untrimmed broadcast soccer videos. We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. For each task, we provide and discuss benchmark results, reproducible with our open-source adapted implementations of the most relevant works in the field.
A Context-Aware Loss Function for Action Spotting in Soccer Videos
A. Cioppa, A. Deliège, S. Giancola, B. Ghanem, M. Van Droogenbroeck, R. Gade, and T. Moeslund.In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 13123–13133, Seattle, Washington, USA, June 2020.
In this paper, we propose a novel loss function that specifically considers the temporal context naturally present around each action, rather than focusing on the single annotated frame to spot. We benchmark our loss on a large dataset of soccer videos, SoccerNet, and achieve an improvement of 12.8% over the baseline. We show the generalization capability of our loss for generic activity proposals and detection on ActivityNet, by spotting the beginning and the end of each activity. Furthermore, we provide an extended ablation study and display challenging cases for action spotting in soccer videos. Finally, we qualitatively illustrate how our loss induces a precise temporal understanding of actions and show how such semantic knowledge can be used for automatic highlights generation.
Multimodal and multiview distillation for real-time player detection on a football field
A. Cioppa, A. Deliège, N. Ul Huda, R. Gade, M. Van Droogenbroeck, and T. Moeslund. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, pages 3846–3855, Seattle, Washington, USA, June 2020.
In this paper, we developed a system that detects players from a unique cheap and wide-angle fisheye camera assisted by a single narrow-angle thermal camera. In this work, we train a network in a knowledge distillation approach in which the student and the teacher have different modalities and a different view of the same scene. In particular, we design a custom data augmentation combined with a motion detection algorithm to handle the training in the region of the fisheye camera not covered by the thermal one. We show that our solution is effective in detecting players on the whole field filmed by the fisheye camera. We evaluate it quantitatively and qualitatively in the case of an online distillation, where the student detects players in real time while being continuously adapted to the latest video conditions.
Arthus: Adaptive real-time human segmentation in sports through online distillation
A. Cioppa, A. Deliège, M. Istasse, C. De Vlesschouwer, and M. Van Droogenbroeck. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, pages 2505–2514, Long Beach, California, USA, June 2019.
Semantic segmentation can be regarded as a useful tool for global scene understanding in many areas, including sports, but has inherent difficulties, such as the need for pixel-wise annotated training data and the absence of well-performing real-time universal algorithms. To alleviate these issues, we sacrifice universality by developing a general method, named ARTHuS, that produces adaptive real-time match-specific networks for human segmentation in sports videos, without requiring any manual annotation. This is done by an online knowledge distillation process, in which a fast student network is trained to mimic the output of an existing slow but effective universal teacher network, while being periodically updated to adjust to the latest play conditions. As a result, ARTHuS allows to build highly effective real-time human segmentation networks that evolve through the match and that sometimes outperform their teacher. The usefulness of producing adaptive match-specific networks and their excellent performances are demonstrated quantitatively and qualitatively for soccer and basketball matches.
Arthus: Adaptive real-time human segmentation in sports through online distillation
A. Cioppa, A. Deliège, and M. Van Droogenbroeck.In IEEE International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVsports, pages 1846–1855, Salt Lake City, Utah, USA, June 2018.
This paper describes a bottom-up approach based on the extraction of semantic features from the video stream of the main camera in the particular case of soccer using scene-specific techniques. In our approach, all the features, ranging from the pixel level to the game event level, have a semantic meaning. First, we design our own scene-specific deep learning semantic segmentation network and hue histogram analysis to extract pixel-level semantics for the field, players, and lines. These pixel-level semantics are then processed to compute interpretative semantic features which represent characteristics of the game in the video stream that are exploited to interpret soccer. For example, they correspond to how players are distributed in the image or the part of the field that is filmed. Finally, we show how these interpretative semantic features can be used to set up and train a semantic-based decision tree classifier for major game events with a restricted amount of training data. The main advantages of our semantic approach are that it only requires the video feed of the main camera to extract the semantic features, with no need for camera calibration, field homography, player tracking, or ball position estimation.