Academia.eduAcademia.edu

Is end-to-end learning enough for fitness activity recognition?

2023, arXiv (Cornell University)

End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.

arXiv:2305.08191v1 [cs.CV] 14 May 2023 Is end-to-end learning enough for fitness activity recognition? Antoine Mercier1,2 Guillaume Berger1,2 Sunny Panchal1,2 Florian Letsch1 Cornelius Boehm1 Nahua Kang1 Ingo Bax1,2 Roland Memisevic1,2 Twenty Billion Neurons GmbH1 Qualcomm AI Research2∗ Abstract End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting. 1 Introduction Action recognition in videos has slowly been transitioning to real-world applications following extensive advancements in feature representation and deep learning-based architectures. In many applications, models need to extract detailed information of the underlying spatio-temporal dynamics. Towards this, end-to-end learning has recently had a lot of success on generic action recognition datasets comprised of varied everyday activities [1, 2, 3]. However, pose-based pipelines seem to remain the preferred solution when the task is strongly related to analyzing body motions [4, 5, 6, 7, 8], such as in the rapidly growing application domain of virtual fitness, where an AI system can be used to deliver real-time form feedback and count exercise repetitions. In this paper, we present a new fitness action recognition dataset with granular intra-exercise labels and compare few-shot learning abilities of pose estimation-based pipelines with end-to-end learning from raw pixels. We also compare the influence of using different pre-training datasets on the chosen models and additionally train them for repetition counting. Common approaches to generic video understanding based on end-to-end learning include combinations of 2D-CNNs for spatial feature extraction followed by an LSTM module for learning temporal dynamics [9, 10], directly learning spatio-temporal dynamics with a 3D-CNN [11], or combining a 3D-CNN with an LSTM [12]. The temporal understanding can be further improved in a two-stream approach with a second CNN-based stream trained on optical flow [1, 13, 14]. The large parameter space of 3D-CNNs can be prohibitive and efforts to reduce this include dual-pathway approaches to low/high frame-rate [15] and resolution [16], temporally shifting frames in a 2D-CNN [17], and non-uniformly aggregating features temporally [18]. Using a multi-task approach, an end-to-end model jointly trained for pose estimation and subsequent action classification was shown to improve performance of individual components [19] – but pose information is still needed for training. ∗ Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. Preprint. Under review. Datasets Focus on body motions Fine-grained labels Controlled environment “In the wild" Large-scale QAR-EVD NTU RGBD FineGym Jester Smth-smth Charades Kinetics MomentsIT X X X X × X X X × X X X × X X X × X X X × X × X X × X × X X × × × X X × × × X X Table 1: Side-by-side comparison of QAR-EVD (ours) versus common video datasets including NTU RGBD+D [6], FineGym [38], Jester [3], Something-something [2], Charades [39], Kinetics [40] and Moments [41] based on five criteria: a) focus on body motion, b) fine-grained label taxonomy (e.g. presence of intra-activity variations), c) controlled environment (e.g. fixed camera angle in a home environment), d) “in the wild" (as opposed to e.g. recorded in a lab), and e) dataset size sufficient for stand-alone pre-training. Pose-based solutions for action recognition have two main stages: pose extraction and action classification. While bottom-up pose estimation approaches extract skeletons in one step [20, 21, 22, 23], top-down methods split pose estimation into first localization and then pose extraction [4, 24, 25, 26]. The classification stage is then optimized independently, with no end-to-end finetuning of the whole pipeline. Pose-based action classifiers typically use either hand-crafted features [27, 28, 29] or, increasingly, deep learning-based modules. Recent approaches have employed CNNs [30, 31, 19], LSTMs [32, 33, 5, 34], Graph CNNs [8, 35, 36], or 3D-CNNs on top of pose heatmaps [37]. S In addition to an appropriate model architecture, a dataset with a fine-grained action taxonomy is crucial to learning robust action representations. Existing RGB-based video datasets such as Kinetics [40], Moments in Time [41] and Sports-1M [42] are based on a high-level taxonomy and further, possess correlated scene-action pairings resulting in pronounced representation bias [43, 44]. These concerns can be mitigated through crowd-sourced collections of predefined labels where the same action can be collected from multiple workers such as in the Something-Something [2], and Charades [39] datasets. However, the "everyday general human actions" within these datasets are loosely specified and left to the worker’s interpretation resulting in a high inter-worker action variance. On the other hand, FineGym [5] focuses on specific fine-grained body motions but includes variability in camera position resulting in lower overall action salience. In contrast, gesture recognition datasets such as Jester [3] control camera and worker positioning and additionally, constrain human motion to appropriately specified hand gestures. A similarly constrained dataset for exact human body movement, that also controls camera motion, does not exist and we believe home fitness is the perfect domain in which to create one as workers can be instructed to move in very specific ways to perform exercises. Pose-specific datasets contain an additional layer of annotated skeletal joints obtained either through annotation of scraped video datasets (either manually or using a pose estimation model [7]) or a sensor-derived approach in constrained lab settings [5, 6]. We present a new crowd-sourced benchmark dataset to fill a gap in the dataset landscape (see Table 1): videos of fitness exercises in a home setting are recorded in the wild providing challenging scene variety while also following a fine-grained label taxonomy. We compare end-to-end action classification models with state-of-the-art pose estimation-based action classifiers and show that the end-to-end approaches can outperform the pose estimation-based alternatives, if the end-to-end models are pre-trained on a large and granular labelled video corpus. We also show that the pose estimation models themselves can greatly benefit from pre-training on the large labelled dataset. 2 The Qualcomm AI Research - Exercise Videos Dataset – a new benchmark dataset Fitness activities are defined by a well-constrained set of body movements outside of which an individual risks injury or ineffectiveness. There is an opportunity for AI systems to detect mistakes and provide real-time form-correcting feedback. To this end, we present the Qualcomm AI Research Exercise Videos Dataset, referred later in this paper as QAR-EVD, comprised of granular video-level activity classes capturing subtle variations, including common mistakes. The dataset spans four fitness exercises recorded in a home environment by crowd workers: 2 Figure 1: Samples from the QAR-EVD dataset for all four exercises. From top to bottom: Dead bug, inchworm, alternating lateral lunges, spiderman pushups. Best viewed on a screen. Number of videos Number of unique workers Train Validation Test Overall 4000 129 711 20 800 165 5511 314 Table 2: Dataset statistics: Number of videos and unique crowd workers in each split • Dead bug: The user lies on the back with arms and legs raised and moves them back and forth asynchronously. • Inchworm: From a standing position, the user touches the floor with both hands, walks them forwards, and then back again. • Alternating lateral lunges: The user performs a lunge step in sideways direction, alternating in both directions. • Spiderman pushups: A pushup variation where one leg is moving to touch knee and elbow. Example frames from the dataset for each exercise can be seen in figure 1. Each exercise was recorded with deliberate variations such as increased pace or incorrect execution of different aspects of each exercise, some which are visible from a static frame (foot touching the floor), and others which are only apparent across multiple consecutive frames (being too fast or too slow). In total, a fine-grained taxonomy of 40 video-level classes is available to trigger direct feedback. Each of the 40 classes contains between 130 and 140 videos, with each video lasting between 5 and 8 seconds. The dataset is split into train, validation and test sets with no worker overlap between them. All videos are provided in MP4 format at a frame rate of 30 fps. The dataset contains 5511 videos in total across all splits (see Table 2 for details on the data split). For few-shot experiments, we prepared different versions of the train splits, containing fewer examples per class. We release splits that contain 5, 10, 20, 50 and 100 samples per class. In addition to delivering form feedback in real-time, another challenging task for fitness AI applications is repetition counting. It relies on precisely parsing the temporal extent of an action segment within an activity and, as such, benefits greatly from the availability of temporal annotations. To this end, in addition to providing video-level labels, we tagged a subset of each exercise within QAR-EVD with frame-level classes, thus making it possible to benchmark models on repetition counting. More details will be provided in Section 3.6. The data has been collected in the wild by individual crowd workers who performed the actions following instructions from an example video. To match the desired viewing angle of a phone placed on the floor (fitness app scenario), the workers recorded themselves using a camera at a low position. All recorded videos were reviewed to confirm the execution was performed correctly. Because of the distributed nature of the data collection, the recorded samples show a large variety of scene settings, backgrounds and illumination (see figure 2). Each worker recorded videos for multiple action classes, so that the performed action cannot be learned by the visible video setting, but only by learning feature representations of the actual body motion. 3 Figure 2: The videos in the dataset provide a wide range of lighting and scene settings. From left to right: cluttered background, textured background; high contrast, low contrast. Best viewed on a screen. QAR-EVD has been collected for the purpose of discerning fine variations of exercise execution performed by a worker. In order to create the label taxonomy and recording instructions, the domain knowledge of several fitness experts had been consulted to collect a list of common mistakes and frequent variations of the individual exercises. Some examples of subtle variations are: • Dead bug: A foot touches the floor; arms are not moving; the wrong leg is moving; execution is too fast • Inchworm: Feet are too narrow or too wide; hands are too far from the body in the initial position; hands are stepping too far forward with each step • Alternating lateral lunges: Bending the wrong leg; low range of motion; execution is too fast • Spiderman pushups: Execution is too fast or too slow; leg movement is not in sync with pushup (three different error variations are labeled); pushup is too shallow The full label taxonomy can be found in the supplementary materials. We plan to release the dataset under a non-commercial license, which permits non-profit research only. 3 Experiments All models were trained on subsets of the QAR-EVD training split, with 5, 10, 20, 50, and 100 samples per class, to evaluate few-shot behavior. Different initialization approaches were tested for each model, including training from scratch, starting from a pre-trained model and fine-tuning the final classification layer, all layers, or a subset of the layers. The approaches are described in more detail in section 3.4. 3.1 Architectures Three end-to-end and two pose estimation-based architectures are compared in our experiments. End-to-end architectures include I3D [1], SI-EN (ours) and SI-BlazePose (ours). For the pose-based pipelines, we use BlazePose [4] to localize and extract human poses followed by one of two stateof-the-art graph-based classifiers: ST-GCN [8] and MS-G3D [7]. We selected BlazePose for the pose extraction part because it is optimized for real-time fitness applications and comparable to the end-to-end architectures in terms of FLOPs and model size (see section 3.1.3 for more details). 3.1.1 End-to-end I3D. As an end-to-end baseline model for video action recognition, we used the 3D-CNN architecture, I3D-RGB, proposed in [1]. Strided-Inflated EfficientNet (SI-EN). We present SI-EN, which uses EfficientNet-Lite4 [45], a 2D-CNN, as a backbone, with a few modifications to some of the inverted residual blocks. Specifically, we inflate 8 of the blocks in the temporal dimension (blocks 3, 7, 11, 14, 17, 20, 23 and 25), using a temporal kernel of 3, effectively turning them into 3D convolutional modules taking inspiration from [1]. More precisely, it is only the first point-wise convolution in the inverted residual block that is inflated. Two of the inflated convolutions (blocks 7 and 14) are implemented with a stride of 2, enabling a lower footprint output of 4 fps from the 16 fps input stream. SI-BlazePose. As a method to back-propagate through a pose feature bottleneck during an end-toend classification task, we propose the following architecture which we call SI-BlazePose. It is based 4 QAR-EVD exercises Closest BigFitness classes Closest Kinetics classes spiderman pushups pushups - sloppy burpee - no upright position burpee - no jump push up crawling baby headbanging dead bug bicycle crunches - small torso rotation bicycle crunches - medium torso rotation bicycle crunches - head down situp knitting unboxing alternating lateral lunges skaters - single jump (right to left) grabbing an off-screen towel skaters - slow lunge side kick squat inchworm burpee (no pushup) - stepping feet forward burpee (no pushup) - stepping feet back roll down dribbling basketball deadlifting push up Table 3: Comparing dataset similarity: for each QAR-EVD exercise (column 1), we compute a prototypical feature vector and show its 3 closest class centroids in feature space within BigFitness (column 2) and Kinetics (column 3). on the BlazePose model [4] using inflation to extend it in the temporal dimension. We inflate the last 8 point-wise convolutions with a temporal kernel of size 3, adding a temporal stride of 2 to the 2nd and 4th one. We freeze all layers before the first inflated layer. We use the full image as input, and resize it to 256 × 256 preserving the aspect ratio. We did not crop around the person as a first step, in contrast to what is done within MediaPipe2 . Since QAR-EVD is a classification dataset, we replace BlazePose’s body part regression head with a softmax layer. 3.1.2 Pose-based classifiers ST-GCN. Spatial-temporal graph convolution networks (ST-GCN) use graph convolutions across spatial joint connections and temporal connections from frame to frame [8]. Following the original authors’ approach, we included their suggested edge importance weighting method with a spatial partitioning strategy. As our results did not benefit from dropout regularization, we disabled it. MS-G3D. Multi-scale graph convolutional networks (MS-G3D) adjust the node weighting in the graph for improved multi-scale aggregation and introduce skip connections to the graph for better modeling of spatio-temporal dependencies across longer distances. As these models are able to work on generic graph layouts, we added support for the BlazePose layout by providing the adjacency matrix of the 33 graph nodes. 3.1.3 A note on computational efficiency A pipeline based on pose-estimation typically consists of 3 components: a detection network producing rough person positions, a pose estimation network producing skeletons for each person (BlazePose in our case), and a classifier mapping a sequence of skeletons to an activity label (STGCN or MSG3D in our case). The first two components are image-based while the action classifier is video-based in 2 https://google.github.io/mediapipe/ 5 Figure 3: QAR-EVD top-1 accuracy of selected existing architectures, pretrained on various datasets. We report results using 5, 10, 20, 50, and 100 training samples per class. For each model, we use the following convention: {architecture}-{pretraining dataset}-{optional: number of finetuned layers}. the sense that it needs a sequence of skeletons. While the detection network can run fairly infrequently (at least, if the person is not moving their position much), the framerate at which the pose estimation component needs to run is determined by the temporal granularity required by the action classifier to obtain high accuracy. An end-to-end neural network on the other hand provides a variety of flexible ways to reduce computational footprint, e.g. by using temporally strided convolutions which reduces the framerate of subsequent layers and outputs. SI-EN specifically exploits this by introducing two 3D convolutions with a temporal stride of 2 early in the architecture. As a result, most of the SI-EN layers only need to run at 4fps rather than the 16fps input framerate, greatly reducing the computational footprint of our end-to-end solution. At an input framerate of 16, SI-EN only requires 4.0 GMACs/s, whereas running BlazePose alone (i.e. without counting localization and action classification) already amounts to 6.7 GMACs/s. 3.2 Datasets used for pre-training In addition to the dataset we are releasing along with this paper, we use a larger internal video dataset, which we refer to as BigFitness, for pre-training in some experiments. This dataset consists of around 300, 000 videos of fitness exercises with a fine-grained label taxonomy across 1, 536 classes that are disjoint from the data in QAR-EVD. Similar to QAR-EVD, the videos were recorded and curated by crowd-workers. In addition to this internal dataset, we also made use of Kinetics [40] and ImageNet [46] for pretraining, as will be described in the results section. To elucidate the relationship between the pre-training datasets used in our experiments and QAR-EVD, we visualize examples that are the closest in feature space to each QAR-EVD exercise in Table 3. It shows that BigFitness has multiple labels that are conceptually similar, which is to be expected as it contains fitness actions with a disjoint, but also fine-grained, label taxonomy. More general action recognition datasets like Kinetics have some fitness actions, such as push up and lunge, which resemble the exercises from QAR-EVD. However, because of the more coarse label taxonomy, the next nearest neighbors can be very different (such as labels “head banging" or “unboxing"). 3.3 3.3.1 Implementation details End-to-end End-to-end models were trained on raw pixels from the QAR-EVD videos. The native resolution was down-scaled to a resolution of 256 × 256 pixels. To keep the original aspect ratio, frames were padded with black pixels to be in a square format before downscaling. Videos were subsampled to 6 Figure 4: Effect of pre-training ST-GCN and MS-G3D on Kinetics and BigFitness. 16 fps which showed improved performance over the native 30 fps in preliminary experiments. For training, we took random crops of 63 frames from each video, which corresponds to roughly 4 second long video clips. 63 was chosen because of memory constraints. For evaluation, all frames of a video were passed to the model. As additional augmentation, we applied random color jittering to the 3 input channels. RGB values were scaled to the range from 0 to 1. 3.3.2 Pose-based To pre-train pose-based models in a way that is comparable to the end-to-end models, we extracted pose features from BigFitness using BlazePose [4] as provided by the MediaPipe library 3 . The same method was used to extract pose features to train on QAR-EVD. In our experiments, we used all 33 joints and 3 input channels per joint: x position, y position and confidence score. The resulting pose sequences were created at 16 fps, because preliminary experiments showed better results than using the raw 30 fps (just like in the end-to-end experiments). For training, we took random crops of 90 consecutive poses. For evalutation, we passed in the full pose sequence of each sample. Following [8], we used simulated camera movement on top of keypoint coordinates as a data augmentation technique during training. The Kinetics-Skeleton dataset [8], that we use for pre-training some of the models, uses the OpenPose [47] layout, which has fewer key points than the BlazePose layout (18 instead of 33). In our experiments, we mapped BlazePose keypoints to the OpenPose format with the neck position being defined as the center between the two shoulder joints. 3.4 Results The performance on QAR-EVD across architectures is reported in Figure 3. For each model, we have tried multiple fine-tuning strategies (e.g. freezing all layers, fine-tuning a subset of the layers, fine-tuning the whole network). Figure 3 only reports the approach that worked the best for each model. Results obtained using the other strategies can be found in Table 4. Regarding pose-based baselines, to the best of our knowledge, there are no versions of MS-G3D and ST-GCN pre-trained on the 33 joints returned by BlazePose and we therefore train the two graph CNNs from scratch in this experiment. We investigate the effect of pre-training MSG-3D and ST-GCN in the next section. Interesting findings from Figure 3 can be summarized as follows: Best performance is obtained by an end-to-end network. SI-EN-BigFitness-10 tops all other approaches with a significant margin, including pose-based solutions that use a graph CNN initialized from random weights. The gap with pose-based pipelines is higher when training data is scarce 3 https://mediapipe.dev/. Note that we use the GHUM Full version of BlazePose in all our experiments. 7 Number of samples per class: 5 10 20 50 100 End-to-end SI-EN-ImageNet SI-BlazePose I3D-Kinetics-1 I3D-Kinetics-4 I3D-Kinetics-all SI-EN-BigFitness-1 SI-EN-BigFitness-10 SI-EN-BigFitness-all 8.9 25.3 12.2 18.9 19.4 38.1 45.2 36.2 15.5 31.1 17.1 28.6 28.7 44.4 50.7 43.6 23.9 39.9 22.5 39.8 43.3 49.5 56.0 51.5 33.5 47.1 25.9 51.5 53.9 56.0 63.5 60.8 38.1 52.9 28.4 56.1 60.9 58.9 66.8 63.6 Pose-based pipeline ST-GCN-Scratch MS-G3D-Scratch ST-GCN-Kinetics MS-G3D-Kinetics ST-GCN-BigFitness MS-G3D-BigFitness 26.7 25.5 30.8 38.9 38.7 41.9 39.4 32.3 39.1 47.2 49.1 51.6 45.4 44.7 49.9 53.5 53.6 56.3 53.7 57.6 58.0 62.2 59.7 62.2 57.4 62.1 60.7 65.6 63.0 65.5 Table 4: Results across all experiments. We report the test set accuracy in percentage on QAR-EVD. (45.2% vs 26.7% for ST-GCN-Scratch in the 5-shots case) but shrinks as more training samples are available (66.8% vs 62.1% for MS-G3D-Scratch when the full trainset is used). Pre-training on a large video dataset is key. Unsurprisingly, the type of data used to pre-train each baseline plays an important role in downstream performance. Best results are obtained by the model that was pre-trained on BigFitness, which is by far the most granular pre-training dataset considered in this experiment. The exact same SI-EN architecture pre-trained on ImageNet performs poorly. The Kinetics baseline, I3D, is roughly on par with pose-based pipelines. On the other hand, the inflated pose 2D CNN, SI-BlazePose-9, obtains decent results when few samples are available but gets significantly outperformed as more samples are available. MS-G3D seems more prone to overfitting than ST-GCN. While MS-G3D outperforms ST-GCN when more than 50 training samples are available, ST-GCN gets better results in the 5, 10 and 20-shot cases. 3.5 Closing the gap between pose-based and end-to-end approaches In this section, we investigate the effect of pre-training the graph CNN component of pose-based pipelines. Pre-training is performed with two datasets: Kinetics and BigFitness. Results can be found in Figure 4. Figure 4 shows that, even for a pose-based pipeline, pre-training on a large video dataset can boost classification accuracy. While an accurate frame-level pose representation alone obtains decent results, the overall solution greatly benefits from pre-training on videos. This suggests that training data that provides some understanding of the temporal aspects of human body motions is highly beneficial, even for pose-based models. While pre-training on Kinetics produces good downstream performance, pre-training on a more granular dataset such as BigFitness works better overall. When it is pre-trained on BigFitness, the MS-G3D-based pipeline is on par with the end-to-end baseline, and the advantage that ST-GCN has over MS-G3D in the lower data regimes vanishes. Additional metrics (e.g. confusion matrices, f-measures) can be found in the supplementary material. 3.6 Learning to count To explore a more temporally fine-grained recognition task, we also experiment with end-to-end repetition counting ("how many times has a given exercise been performed?"). This is a common task, in particular, in many fitness applications. Repetition counting is an inherently temporal prediction task. To train the networks on this task, we temporally annotated a subset of videos with frame-by-frame labels describing which phase of the exercise the subject is at any moment in time. We use the same frame-rates as before (16 fps input, 4 fps output) and annotated 100 videos for each of the exercises in the training set. We use the same train/test split as above for evaluation. We experiment with various temporal annotation schemes that can be turned into counts after training: (1) marking frames as within-repetition vs. end8 Temporal annotations schemes pushup dead bug lateral lunges inchworm SI-EN (1) within-repetition vs end-of-repetition (2) within vs middle-of vs end-of-repetition (3) first half vs second half 25.9 17.1 4.6 39.3 22.3 7.2 33.3 13.4 2.2 109.0 49.2 21.5 MSG3D (1) within-repetition vs. end-of-repetition (2) within vs middle-of vs end-of-repetition (3) first half vs second half 22.2 10.6 4.9 40.1 27.5 8.5 38.4 9.0 4.2 102.0 51.0 17.2 STGCN (1) within-repetition vs. end-of-repetition (2) within vs middle-of vs end-of-repetition (3) first half vs second half 37.3 11.9 6.0 81.8 12.8 13.7 66.3 7.0 3.6 144.0 46.5 22.0 Table 5: Repetition counting results across all experiments (mean absolute percentage error). of-repetition, (2) marking frames as within-repetition vs. end-of-repetition vs. middle-of-repetition, (3) using a different encoding of the 3-way annotations in (2), by marking frames as first-half of a repetition (between end-of-repetition and middle-of-repetition) vs. second-half of a repetition (between middle-of-repetition and end-of-repetition). We train the models by treating these annotations as a simple temporal classification task. For training, we concatenate the videos within a mini-batch along the temporal axis rather than stacking videos on the batch-axis. Since annotation schemes (1) and (2) are highly imbalanced, we weight the classification cost by 0.2 for the under-represented class "within-repetition" during training. For SI-EN, we only train the 10 final layers (as for the best model above). We turn temporal classifications into counts at inference time by incrementing the count when the end of a repetition is detected. For annotation schemes (1) and (2), we only increment the count if an end-of-repetition event is followed by at least one middle-of-repetition event to avoid overcounting. Table 5 shows the performance of the models in terms of mean absolute percentage error (MAE) [48, 49]. It shows that accurate counting performance can be obtained from the relatively small number of annotated videos. While performance is comparable across models, interestingly, even in this setup, the end-to-end approach SI-EN performs roughly on par with or better than the other approaches in most cases. In fact, it shows the best performance in all exercises except for "Inchworm" which unlike the other exercises, has a much smaller number of repetitions per video and yields overall lower accuracy. Note that only SI-EN can make predictions on-line. Overall, while a deeper analysis and comparison with other counting approaches is beyond the scope of this paper, we find that it is possible to obtain very accurate repetition counts entirely end-to-end. We also find that accuracy depends strongly on how temporal annotations are represented during training. 4 Conclusion In conclusion, our experiments show that end-to-end training on large-scale labeled video datasets without any form of frame-by-frame intermediate representation can compete with pose-based approaches, even in the context of fitness activity recognition where one could assume that an accurate pose representation is all you need. More importantly, regardless of the selected approach, pre-training on a large and granular video dataset is a key ingredient to achieving good downstream performance. In fact, our experiments show that good performance in action recognition tasks is mostly a function of dataset size and label granularity and less of the choice of model. Limitations and broader impact. The introduced dataset subserves research on end-to-end reasoning about human activities using an RGB camera. It can be used to study and benchmark model architectures and to rethink workflows in the development of end-to-end neural networks. However, the dataset in its current size and form may contain biases. Training on this dataset alone may, for example, lead to models whose behaviors could depend on a subject’s age, gender, ethnic background, etc. As such, the dataset as defined is suitable only for performing the research needs described above. In addition, model behavior will be a function of camera angle, lighting, and possibly other random aspects of the scene, camera, camera-angle and the subject interacting with the model. As for positive impact, research towards enabling quantitative assessment of health and fitness-related activities with just a camera can democratize access to and can greatly improve individuals’ understanding of such activities and help unlock their benefits. 9 References [1] João Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. [2] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017. [3] Joanna Materzynska, Guillaume Berger, Ingo Bax, and Roland Memisevic. The jester dataset: A large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. [4] Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. BlazePose: On-device Real-time Body Pose tracking. arXiv, 2020. [5] Amir Shahroudy, Jun Liu, Tian Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016-December, 2016. doi: 10.1109/CVPR.2016.115. [6] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling Yu Duan, and Alex C. Kot. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2020. ISSN 19393539. doi: 10.1109/TPAMI.2019.2916873. [7] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 140–149, 3 2020. URL http: //arxiv.org/abs/2003.14111. [8] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeletonbased action recognition. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 2018. [9] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 2017. ISSN 01628828. doi: 10.1109/TPAMI.2016.2599174. [10] Joe Yue Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June-2015, 2015. doi: 10.1109/CVPR.2015.7299101. [11] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 2013. ISSN 01628828. doi: 10.1109/TPAMI.2012.59. [12] Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016-December, 2016. doi: 10.1109/CVPR.2016.456. [13] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016-December, 2016. doi: 10.1109/CVPR.2016.213. [14] Karen Simonyan and Andrew Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pages 568–576, 2014. URL http://papers.nips.cc/paper/ 5353-two-stream-convolutional-networks-for-action-recognition-in-videos. pdf. [15] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, volume 2019October, 2019. doi: 10.1109/ICCV.2019.00630. 10 [16] Quanfu Fan, Chun Fu Chen, Hilde Kuehne, Marco Pistoia, and David Cox. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. In Advances in Neural Information Processing Systems, volume 32, 2019. [17] Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, volume 2019-October, 2019. doi: 10.1109/ICCV.2019.00718. [18] Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Hao Chen, and Joseph Tighe. NUTA: Non-uniform temporal aggregation for action recognition, 2020. ISSN 23318422. [19] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Skeleton-based action recognition with convolutional neural networks. In 2017 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2017, 2017. doi: 10.1109/ICMEW.2017.8026285. [20] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields, 2016. URL https://arxiv.org/abs/1611.08050. [21] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. Bottomup higher-resolution networks for multi-person pose estimation. CoRR, abs/1908.10357, 2019. URL http://arxiv.org/abs/1908.10357. [22] Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. CoRR, abs/1611.05424, 2016. URL http://arxiv.org/abs/1611.05424. [23] Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. Bottom-up human pose estimation via disentangled keypoint regression. CoRR, abs/2104.02300, 2021. URL https://arxiv.org/ abs/2104.02300. [24] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. CoRR, abs/1603.06937, 2016. URL http://arxiv.org/abs/1603.06937. [25] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. CoRR, abs/1902.09212, 2019. URL http://arxiv.org/abs/1902.09212. [26] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. CoRR, abs/1804.06208, 2018. URL http://arxiv.org/abs/1804.06208. [27] Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy. Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 25(1):24–38, 1 2014. ISSN 10473203. doi: 10.1016/j.jvcir. 2013.04.007. [28] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012. doi: 10.1109/CVPR.2012.6247813. [29] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. Human action recognition by representing 3D skeletons as points in a lie group. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014. doi: 10.1109/CVPR.2014.82. [30] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3D action recognition. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, volume 2017-January, 2017. doi: 10.1109/CVPR.2017.486. [31] Tae Soo Kim and Austin Reiter. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, volume 2017-July, 2017. doi: 10.1109/CVPRW.2017.207. [32] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal LSTM with trust gates for 3D human action recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9907 LNCS, 2016. doi: 10.1007/978-3-319-46487-9{\_}50. [33] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. CoOccurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, 2016. 11 [34] Songyang Zhang, Xiaoming Liu, and Jun Xiao. On Geometric Features for Skeleton-Based Action Recognition using Multilayer LSTM Networks. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 148–157, 2017. [35] Kalpit Thakkar and P. J. Narayanan. Part-based graph convolutional network for action recognition. In British Machine Vision Conference 2018, BMVC 2018, 2019. [36] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1227–1236, 2019. [37] Haodong Duan, Yue Zhao, Kai Chen, Dian Shao, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. CoRR, abs/2104.13586, 2021. URL https://arxiv.org/abs/2104.13586. [38] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [39] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016. [40] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The Kinetics human action video dataset, 2017. ISSN 23318422. [41] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–8, 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2019.2901464. [42] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. [43] Jinwoo Choi, Chen Gao, Joseph C. E. Messou, and Jia-Bin Huang. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. 12 2019. URL http://arxiv.org/abs/ 1912.05534. [44] Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), volume 11210 LNCS, pages 513–528, 2018. doi: 10.1007/978-3-030-01231-1{\_}32. [45] Mingxing Tan and Quoc V Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning, pages 6105–6114, 2019. [46] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 2015. ISSN 15731405. doi: 10.1007/s11263-015-0816-y. [47] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih En Wei, and Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. ISSN 19393539. doi: 10.1109/TPAMI.2019.2929257. [48] Tom F. H. Runia, Cees G. M. Snoek, and Arnold W. M. Smeulders. Real-world repetition estimation by div, grad and curl. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [49] Ofir Levy and Lior Wolf. Live repetition counting. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 12 Supplementary material Fitness exercises Video-level classes Frame-level classes knee over toe left leg bent low range of motion right leg bent no obvious mistakes end-of-repetition no stepping not alternating alternating lateral lunges stepping foot pointing away too fast torso bent forward torso bent sideways wrong knee bent foot touching the floor middle-of-repetition moving opposite leg end-of-repetition moving same side dead bug not moving arms opposite knee too bent or too close to chest too fast arms too narrow plank pose arms too wide end-of-repetition excessively short feet too narrow feet too wide getting into position inchworm getting into position - hands too far good form head up hips too low not far out enough stepping too big too fast arms too narrow low pushup position arms too wide end-of-repetition good form no pushup not alternating spiderman pushups not synced (down - leg in - up - leg out) not synced (down - leg - up) not synced (down - up - leg) shallow too fast too slow Table 6: Label taxonomy of the QAR-EVD dataset 13 Additional results Figure 5: Confusion matrix for the SI-EN-BigFitness-10 model on the FinestFitness test set, averaged over five training runs. Exercise SI-EN-BigFitness-10 MS-G3D-BigFitness alternating lateral lunges deadbugs inchworm spider man pushups 0.71 0.73 0.55 0.62 0.75 0.79 0.48 0.60 Table 7: Aggregated f-measures per exercise and model as an indicator of the intra-exercise performance, computed by averaging across all class-wise f-measures that belong to the same exercise. Each of the two examined models performs better on two out of the four exercises. 14 Figure 6: Confusion matrix for the MS-G3D-BigFitness model on the FinestFitness test set, averaged over five training runs. Figure 7: Confusion matrices from Figure 5 and Figure 6, aggregated into exercise-wide scores. Results have been obtained by summing up the scores for all classes belonging to the same exercise. 15 Figure 8: F-measures per class for SI-EN-BigFitness-10 and MS-G3D-BigFitness, obtained over 5 runs on the FinestFitness test set and sorted according to SI-EN-BigFitness-10 results. Figure 9: Absolute differences of f-measures for the two models from figure 8, sorted by decreasing performance of SI-EN-BigFitness-10. 16