Academia.eduAcademia.edu

Detect-and-Track: Efficient Pose Estimation in Videos

2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Detect-and-Track: Efficient Pose Estimation in Videos ∗ ∗ Rohit Girdhar1 Georgia Gkioxari2 Lorenzo Torresani2,3 Manohar Paluri2 Du Tran2 1 The Robotics Institute, Carnegie Mellon University 2 Facebook 3 Dartmouth College arXiv:1712.09184v2 [cs.CV] 2 May 2018 https://rohitgirdhar.github.io/DetectAndTrack Abstract Stage #1 (3D) Mask R-CNN: Detect person tubes + keypoints in clip This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection [17] and video understanding [5]. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge [1]. Stage #2 Bipartite Matching: Link the predictions over time Figure 1. We propose a two-stage approach to keypoint estimation and tracking in videos. For the first stage, we propose a novel video pose estimation formulation, 3D Mask R-CNN, that takes a short video clip as input and produces a tubelet per person and keypoints within those. In the second stage, we perform lightweight optimization to link the detections over time. one of the more interesting aspects of video understanding, namely modeling the changes in appearance and semantics of scenes, objects and humans over time [6, 13, 15, 37]. In this work, we focus on the problem of human pose tracking in complex videos, which entails tracking and estimating the pose of each human instance over time. The challenges here are plenty, including pose changes, occlusions and the presence of multiple overlapping instances. The ideal tracker needs to accurately predict the pose of all human instances at each time step by reasoning about the appearance and pose transitions over time. Hence, the effort to materialize a pose tracker should closely follow the state of the art in pose prediction but also enhance it with the tools necessary to successfully integrate time information at an instance-specific level. Most recent video pose estimation methods use handdesigned graphical models or integer program optimizations on top of frame-based keypoint predictions to compute the final predictions over time [21,26,48]. While such approaches have shown good performance, they require hand-coding of optimization constraints and may not be scalable beyond short video clips due to their computational complexity. Most importantly, the tracking optimization is only responsible for linking frame-level predictions, and the system has no mechanism to improve the estimation of keypoints by 1. Introduction In recent years, visual understanding, such as object and scene recognition [17, 40, 44, 55], has witnessed a significant bloom thanks to deep visual representations [18, 31, 47, 50]. Modeling and understanding human behaviour in images has been in the epicenter of a variety of visual tasks due to its importance for numerous downstream practical applications. In particular, person detection and pose estimation from a single image have emerged as prominent and challenging visual recognition problems [36]. While single-image understanding has advanced steadily through the introduction of tasks of increasing complexity, video understanding has made slower progress compared to the image domain. Here, the preeminent task involves labeling whole videos with a single activity type [5, 7, 10, 14, 29, 30, 32, 46, 49, 51, 52]. While still relevant and challenging, this task shifts the focus away from ∗ Work done as a part of RG’s internship at Facebook 1 leveraging temporal information (except [48], though it is limited to the case of single person video). This implies that if a keypoint is poorly localized in a given frame, e.g., due to partial occlusion or motion blur, the prediction cannot be improved despite correlated, possibly less ambiguous, information being at hand in adjacent frames. To address this limitation, we propose a simple and effective approach which leverages the current state of the art method in pose prediction [17] and extends it by integrating temporal information from adjacent video frames by means of a novel 3D CNN architecture. It is worth noting that this architecture maintains the simplicity of our two-stage procedure: keypoint estimation is still performed at a frame-level by deploying space-time operations on short clips in a slidingwindow manner. This allows our 3D model to propagate useful information from the preceding and the subsequent frames in order to make the prediction in each frame more robust, while using a lightweight module for long-term tracking, making our method applicable to arbitrarily long videos. Fig. 1 illustrates our approach. We train and evaluate our method on the challenging PoseTrack dataset [24], which contains real-world videos of people in various everyday scenes, and is annotated with locations of human joints along with their identity index across frames. First, and in order to convince of the efficacy of our method, we build a competitive baseline approach which links frame-level predictions, obtained from Mask R-CNN [17], in time with a simple heuristic. Our baseline approach achieves state of the art performance in the ICCV’17 PoseTrack Challenge [1], proving that it performs competitively on this new dataset. We then propose a 3D extension of Mask R-CNN, which leverages temporal information in short clips to produce more robust predictions in individual frames. For the same base architecture and image resolution, our proposed 3D model outperforms our very strong 2D baseline by 2% on keypoint mAP and 1% on the MOTA metric (details about the metrics in Sec. 4.1). In addition, our top-performing model runs at 2 minutes on a 100-frame video, with the tracking itself running in the order of seconds, showing strong potential for practical usage. As we show in Sec. 4.2, this is nearly two orders of magnitude faster than IP based formulations [26] using state-of-the-art solvers [16]. 2. Related Work Multi-person pose estimation in images: The application of deep convolutional neural networks (CNNs) to keypoint prediction [4, 17, 22, 40] has led to significant improvements over the last few years. Some of the most recent efforts in multi-person keypoint estimation from still images can be classified into bottom-up versus top-down techniques. Topdown approaches [17, 40] involve first locating instances by placing a tight box around them, followed by estimation of the body joint landmarks within that box. On the other hand, bottom-up methods [4,22] involve detecting individual keypoints, and in some cases the affinities between those keypoints, and then grouping those predictions into instances. Our proposed approach builds upon these ideas by extending the top-down models to the video domain. We first predict spatio-temporal tubes over human instances in the video, followed by joint keypoint prediction within those tubes. Multi-person pose estimation in video: Among the most dominant approaches to pose estimation from videos is a two-stage approach, which first deploys a frame-level keypoint estimator, and then connects these keypoints in space and time using optimization techniques. In [21, 26], it is shown that a state of the art pose model followed by an integer programming optimization problem can result in very competitive performance in complex videos. While these approaches can handle both space-time smoothing and identity assignment, they are not applicable to long videos due to the NP-hardness of the IP optimization. Song et al. [48] propose a CRF with space-time edges and jointly optimize for the pose predictions. Although they show an improvement over frame-level predictions, their method does not consider body identities and does not address the challenging task of pose tracking. In addition, their approach is hard to generalize to an unknown number of person instances, a number that might vary even between consecutive frames due to occlusions and disocclusions. Our approach also follows a two-stage pipeline, albeit with a much less computationally expensive tracking stage, and is able to handle any number of instances per frame in a video. Multi-object tracking in video: There has been significant effort towards multi-object tracking from video [12, 43]. Prior to deep learning, the proposed solutions to tracking consisted of systems implementing a pipeline of several steps, using computationally expensive hand-crafted features and separate optimization objectives [54] for each of the proposed steps. With the advent of deep learning, end-to-end approaches for tracking have emerged. Examples include [39, 45] which use recurrent neural networks (RNNs) on potentially diverse visual cues, such as appearance and motion, in order to track objects. In [11], a tracker is built upon the state of the art object detection system by adding correlation features between pair of consecutive frames in order to predict frame-level candidate boxes as well as their time deformations. More recent works have attempted to tackle detection and tracking in end-to-end fashion [20, 27, 28], and some works have further used such architectures for down-stream tasks such as action recognition [20]. Our work is inspired by these recent efforts but extends the task of object box tracking to address for the finer task of tracking poses in time. 3. Technical Approach In this section, we describe our method in detail. We propose a two-stage approach that efficiently and accurately tracks human instances and their poses across time. We build a 3D human pose predictor by extending Mask RCNN [17] with spatiotemporal operations by inflating the 2D convolutions into 3D [5]. Our model takes as input short clips and predicts the poses of all people in the clips by integrating temporal information. We show that our 3D model outperforms its 2D frame-level baseline for the task of pose estimation. To track the instances in time, we perform a lightweight optimization that links the predictions. To address exponential complexities with respect to the number of frames in the video and the number of detections per frame, we employ a simple yet effective heuristic. This yields a model that achieves state of the art accuracy on the challenging PoseTrack benchmark [24] and runs orders of magnitude faster than most recent approaches [21, 26]. 3.1. Two-Stage Approach for Pose Tracking Stage 1: Spatiotemporal pose estimation over clips. The first stage in our two-stage approach for human keypoint tracking is pose estimation using a CNN-based model. Although our approach can build upon any frame-based pose estimation system, for this work we use Mask R-CNN [17] due to its simple formulation and robust performance. Mask RCNN is a top-down keypoint estimation model that extends the Faster R-CNN object detection framework [44]. It consists of a standard base CNN, typically a ResNet [18], used to extract image features, which are then fed into task-specific small neural networks trained to propose object candidates (RPN [44]), classify them or predict their mask/pose through an accurate feature alignment operation called RoIAlign. We take inspiration from the recent advancements in action recognition achieved by I3D [5], which introduces a video model obtained by converting a state of the art image recognition model [23] by inflating its 2D convolutional kernels to 3D. Analogously, starting from the vanilla Mask R-CNN model, we transform the 2D convolutions to 3D. Note that the receptive field of these 3D kernels spans over the space and time dimensions and integrates spatiotemporal cues in an end-to-end learnable fashion. Now, the input to our model is no longer a single frame, but a clip of length T composed of adjacent frames sourced from a video. We extend the region proposal network (RPN) [44], to predict object candidates which track each hypothesis across the frames of the input clip. These tube proposals are used to extract instance-specific features via a spatiotemporal RoIAlign operation. The features are then fed into the 3D CNN head responsible for pose estimation. This pose-estimation head outputs heatmap activations for all keypoints across all frames conditioned on the tube hypothesis. Thus, the output of our 3D Mask R-CNN is a set of tube hypotheses with keypoint estimates. Fig. 2 illustrates our proposed 3D Mask R-CNN model, which we describe in detail next. Base network: We extend a standard ResNet [18] architecture to a 3D ResNet architecture by replacing all 2D convolutions with 3D convolutions. We set the temporal extent of our kernels (KT ) to match the spatial width, except for the first convolutional layer, which uses filters of size 3 × 7 × 7. We temporally pad the convolutions as for the spatial dimensions: padding of 1 for KT = 3 and 0 for KT = 1. We set temporal strides to 1, as we empirically found that larger stride values lead to lower performance. Inspired by [5, 8], we initialize the 3D ResNet using a pretrained 2D ResNet. Apart from their proposed “mean” initialization, which replicates the 2D filter temporally and divides the coefficients by the number of repetitions, we also experiment with a a “center” initialization method, which has earlier been used for action recognition tasks [9]. In this setup, we initialize the central 2D slice of the 3D kernel with the 2D filter weights and set all the other 2D slices (corresponding to temporal displacements) to zero. We empirically show in Sec. 4.3 that center initialization scheme leads to better performance. The final feature map output of the base 3D network for a T ×H ×W input is T × H8 × W 8 , as we clip the network after the fourth residual block and perform no temporal striding. Tube proposal network: We design a tube proposal network inspired by the region proposal network (RPN) in Faster R-CNN [44]. Given the feature map from the base network, we slide a small 3D-conv network connected to two sibling fully connected layers – tube classification (cls) and regression (reg). The cls and reg labels are defined with respect to tube anchors. We design the tube anchors to be similar to the bounding box anchors used in Faster R-CNN, but here we replicate them in time. We use A (typically 12) different anchors at every sliding position, differing in scale and/or aspect ratio. Thus, we have H8 × W 8 × A anchors in total. For each of these anchors, cls predicts a binary value indicating whether a foreground tube originating at that spatial position has a high overlap with our proposal tube. Similarly, reg outputs for each anchor a 4T -dimensional vector encoding displacements with respect to the anchor coordinates for each box in the tube. We use the softmax classification loss for training the cls layer, and the smoothed L1 loss for the reg layer. We scale the reg loss by T1 , in order to keep its values comparable to those of the loss for the 2D case. We define these losses as our tracking loss. 3D Mask R-CNN heads: Given the tube candidates produced by the tube proposal network, the next step classifies and regresses them into a tight tube around a person track. We compute the region features for this tube by designing a 3D region transform operator. In particular, we extend the RoIAlign operation [17] to extract a spatiotemporal feature map from the output of the base network. Since the temporal Tube Proposal Network Classification Head “𝐴” proposals at every pixel, Trained with cls and reg loss cls loss reg loss Keypoints Head Base Network SpatioTemporal RoIAlign kps loss Figure 2. Proposed 3D Mask R-CNN network architecture: Our architecture, as described in Sec. 3.1, has three main parts. The base network is a standard ResNet, extended to 3D. It generates a 3D feature blob, which is then used to generate proposal tubes using the Tube Proposal Network (TPN). The tubes are used to extract region features from the 3D feature blob, using a spatiotemporal RoIAlign operation, and are fed into heads that classify/regress for a tight tube and another to predict keypoint heatmaps. extent of the feature map and the tube candidates is the same (of dimension T ), we split the tube into T 2D boxes, and use RoIAlign to extract a region from each of the T temporal slices in the feature map. The regions are then concatenated in time to produce a T × R × R feature map, where R is the output resolution of RoIAlign operation, which is kept 7 for the cls/reg head, and 14 for the keypoint head. The classification head consists of a 3D ResNet block, similar to the design of the 3D ResNet blocks from the base network; and the keypoint head consists of 8 3D conv layers, followed by 2 deconvolution layers to generate the keypoint heatmap output for each time frame input. The classification head is trained with a softmax loss for the cls output and a smoothed L1 loss for the reg output, while the keypoint head is trained with a spatial softmax loss, similar to [17]. Stage 2: Linking keypoint predictions into tracks. Given these keypoint predictions grouped in space by person identity (i.e., pose estimation), we need to link them in time to obtain keypoint tracks. Tracking can be seen as a data association problem over these detections. Previous approaches, such as [41], have formulated this task as a bipartite matching problem, which can be solved using the Hungarian algorithm [33] or greedy approaches. More recent work has incorporated deep recurrent neural networks (RNN), such as an LSTM [19], to model the temporal evolutions of features along the tracks [39, 45]. We use a similar strategy, and represent these detections in a graph, where each detected bounding box (representing a person) in a frame becomes a node. We define edges to connect each box in a frame to every box in the next frame. The cost of each edge is defined as the negative likelihood of the two boxes linked on that edge to belong to the same person. We experimented with both hand-crafted and learned likelihood metrics, which we describe in the next paragraph. Given these likelihood values, we compute tracks by simplifying the problem to bipartite matching between each pair of adjacent frames. We initialize tracks on the first frame and propagate the labels forward using the matches, one frame at a time. Any boxes that do not get matched to an existing track instantiate a new track. As we show in Sec. 4.2, this simple approach is very effective in getting good tracks, is highly scalable, is able to deal with a varying number of person hypotheses, and can run on videos of arbitrary length. Likelihood metrics: We experiment with a variety of handcrafted and learned likelihood metrics for linking the tracks. In terms of hand-crafted features, we specifically experiment with: 1) Visual similarity, defined as the cosine distance between CNN features extracted from the image patch represented by the detection; 2) Location similarity, defined as the box intersection over union (IoU) of the two detection boxes; and 3) Pose similarity, defined as the PCKh [53] distance between the poses in the two frames. We also experiment with a learned distance metric based on a LSTM model that incorporates track history in predicting whether a new detection is part of the track or not. At test time, the predicted confidence values are used in the matching algorithm, and the matched detection is used to update the LSTM hidden state. Similar ideas have also shown good performance for traditional tracking tasks [45]. In Sec. 4 we present an extensive ablative analysis of the various design choices in our two-stage architecture described above. While being extremely lightweight and simple to implement, our final model obtains state of the art performance on the benchmark, out-performing all submissions in the ICCV’17 PoseTrack challenge [1]. 4. Experiments and Results We introduce the PoseTrack challenge benchmark and experimentally evaluate the various design choices of our model. We first build a strong baseline with our two-stage keypoint tracking approach that obtains state of the art performance on this challenging dataset. Then, we show how our 3D Mask R-CNN formulation can further improve upon that model by incorporating temporal context. Threshold mAP mAP mAP mAP mAP mAP mAP mAP MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTP Prec Rec Head Shou Elbo Wri Hip Knee Ankl Total Head Shou Elb Wri Hip Knee Ankl Total Total Total Total 0.0, random tracks 0.0 0.5 0.95 72.8 72.8 72.8 67.5 75.6 65.3 54.3 63.5 60.9 75.6 65.3 54.3 63.5 60.9 75.6 65.3 54.3 63.5 60.9 70.2 62 51.7 60.7 58.7 51.8 51.8 51.8 49.8 64.1 64.1 64.1 60.6 -11.6 60.3 61.0 61.7 -6.6 65.3 66 65.5 -8.5 55.8 56.3 57.3 -12.9 43.5 44.1 45.7 -11.1 52.5 52.9 54.3 -10.2 50.7 51.1 53.1 -9.7 43.9 44.3 45.7 -10.2 53.6 54.2 55.2 55.8 55.7 55.7 61.5 83.3 83.3 83.3 88.1 70.8 70.8 70.8 66.5 Table 1. Effect of the detection cut-off threshold. We threshold the detections computed by Mask R-CNN before matching them to compute tracks. While keypoint mAP goes down, the tracking MOTA performance increases as there are fewer spurious detections to confuse the tracker. The first row also shows the random baseline; i.e. the performance of the model that randomly assigns a track ID between 0 and 1000 (maximum allowed) to each detection. 4.1. Dataset and Evaluation PoseTrack [24, 25] is a recently released large-scale challenge dataset for human body keypoint estimation and tracking in diverse, in-the-wild videos. It consists of a total of 514 video sequences with 66,374 frames, split into 300, 50 and 208 videos for training, validation and testing, respectively. The training videos come with the middle 30 frames densely labeled with human body keypoints. The validation and test videos are labeled at every fourth frame, apart from the middle 30 frames. This helps evaluate the long term tracking performance of methods without requiring expensive annotations throughout the entire video. In total, the dataset contains 23,000 labeled frames and 153,615 poses. The test set annotations are held-out, and evaluation are performed by submitting the predictions to an evaluation server. The annotations consist of human head bounding boxes and 15 body joint keypoint locations per labeled person. Since all our proposed approaches are top-down and depend on the detection of the extent of the person before detecting keypoints, we compute a bounding box by taking the min and max extents of labeled keypoints, and dilating that box by 20%. Also, to make the dataset compatible with COCO [35, 36], we permute the keypoint labels to match the closest equivalent labels in COCO. This allows us to pretrain our models on COCO, augmenting the PoseTrack dataset significantly and giving a large improvement in performance. The dataset is designed to evaluate methods on three different tasks: 1) Single-frame pose estimation; 2) Pose estimation in video; 3) Pose tracking in the wild. Task 1) and 2) are evaluated at a frame level, using the mean average precision (mAP) metric [42]. Task 3) is evaluated using a multi-object tracking metric (MOT) [3]. Both evaluations require first computing the distance of each prediction from each ground truth labeled pose. This is done using the PCKh metric [2], which computes the probability of correct keypoints normalized by the head size. The mAP is computed as in [42], and the MOT is as described in [38]. Their MOT evaluation penalizes false positives equally regardless of their confidence. For this, we drop keypoint predictions with low confidence (1.95 after grid-search on the validation set). We use the PoseTrack evaluation toolkit for computing all results presented in this paper, and report final test numbers as obtained from the evaluation server. 4.2. Baseline In an effort to build a very competitive baseline, we first evaluate the various design elements of our two stage tracking pipeline with a vanilla Mask R-CNN base model. This model disregards time-sensitive cues when making pose predictions. Throughout this section, our models are initialized from ImageNet and are pretrained on the COCO keypoint detection task. We then finetune the Mask R-CNN model on PoseTrack, keeping most hyper-parameters fixed to the defaults used for COCO [36]. At test time, we run the model on each frame and store the bounding box and keypoint predictions, which are linked over time in the tracking stage. This model is competitive as it achieves state of the art results on the PoseTrack dataset. In Sec. 4.3, we prove that our approach can further improve the performance by incorporating temporal context from each video clip via a 3D Mask R-CNN model. Thresholding initial detections: Before linking the detections in time, we drop the low-confidence and potentially incorrect detections. This helps prevent the tracks from drifting and reduces false positives. Table 1 shows the effect of thresholding the detections. As expected, the MOTA tracking metric [3] improves with higher thresholds, indicating better and cleaner tracks. The keypoint mAP performance, however, decreases by missing out on certain low-confidence detections, which tend to improve the mAP metric. Since we primarily focus on the tracking task, we threshold our detections at 0.95 for our final experiments. Deeper base networks: As in most vision problems, we observed an improvement in frame-level pose estimation by using a deeper base model. The improvements also directly transferred to the tracking performance. Replacing ResNet-50 in Mask R-CNN with Resnet-101 gave us about 2% improvement in MOTA. We also observed a gain in performance on using feature pyramid networks (FPN) [34] in the base network. Ultimately, we got best performance by using a ResNet-101 model with FPN as the body, a 2-layer MLP for the classification head, and a stack of eight conv and deconv layers as the keypoint head. Method MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTP Prec Rec Head Shou Elb Wri Hip Knee Ankl Total Total Total Total Hungarian 61.7 Greedy 61.7 65.5 65.4 57.3 57.3 45.7 45.6 54.3 54.2 53.1 53 45.7 45.6 55.2 55.1 61.5 88.1 66.5 61.5 88.1 66.5 Table 2. Comparison between Hungarian and Greedy algorithm for matching. Effect of matching algorithm in tracking performance, computed over the bounding-box overlap cost criterion. The hungarian algorithm obtains slightly higher performance than the simple greedy matching. Matching algorithm: We experimented with two bipartite matching algorithms: the Hungarian algorithm [33] and a greedy algorithm. While the Hungarian algorithm computes an optimal matching given an edge cost matrix, the greedy algorithm takes inspiration from the evaluation algorithms for object detection and tracking. We start from the highest confidence match, select that edge and remove the two connected nodes out of consideration. This process of connecting each predicted box in the current frame with previous frame is repeatedly applied from the first to the last frame of the video. Table 2 compares the two algorithms, using the “bounding box overlap” as cost metric (details in Sec. 4.2). We observe that the Hungarian method performs slightly better, thus we use it as our final model. Tracking cost criterion: We experimented with three handdefined cost criteria as well as the learned LSTM metric to compute the likelihoods for matching. First, we use bounding box overlap over union (IoU) as the similarity metric. This metric expects the person to move and deform little from one frame to next, which implies that matching boxes in adjacent frames should mostly overlap. Second, we used pose PCKh [53] as the similarity metric, as the pose of the same person is expected to change little between consecutive frames. Third, we used the cosine similarity between CNN features as a similarity metric. In particular, we use the res3 layer of a ResNet-18 pretrained on ImageNet, extracted from the image cropped using the person bounding box. Finally, as a learned alternative, we use a LSTM model described in Sec. 3.1 to learn to match detections to the tracks. Table 3 shows that the performance is relatively stable across different cost criteria. We also experimented with different layers of the CNN, as well as combinations of these cost criteria, all of which performed similarly. Given the strong performance of bounding box overlap, we use the box xmin , ymin , xmax , ymax as the input feature for a detection. Despite extensive experimentation with different LSTM architectures (details in supplementary), we found that the learned metric did not perform as well as the simpler hand-crafted functions, presumably due the small size of the training set. Hence for simplicity and given robust performance, we use box overlap for our final model. Upper bounds: One concern about the linking stage is that it is relatively simple, and does not handle occlusions or Method MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTP Prec Rec Head Shou Elb Wri Hip Knee Ankl Total Total Total Total Bbox IoU Pose PCK CNN cos-dist All combined 61.7 60.7 61.9 61.9 65.5 64.5 65.7 65.7 57.3 56.5 57.5 57.4 45.7 44.8 45.8 45.7 54.3 53.3 54.4 54.4 53.1 52.0 53.3 53.2 45.7 44.6 45.8 45.7 55.2 54.2 55.4 55.3 61.5 61.5 61.5 61.5 88.1 88.1 88.1 88.1 66.5 66.5 66.5 66.5 LSTM 54.2 58 50.4 39.4 47.4 46.6 39.8 48.4 61.4 88.1 66.5 Table 3. Comparison between different similarity cost criteria. We compare various different hand-crafted and learned cost criterion for the matching stage to generate tracks. Interestingly, simple hand-crafted approaches perform very well for the task. We choose to go with the simple bounding box overlap due to low computational cost and strong performance. (MOTA) Ours Perfect association Perfect keypoints Both 55.2 57.7 78.4 82.9 Table 4. Upper bounds: We compare our performance, with our potential performance, if we had the following perfect information. a) Perfect association: We modify the evaluation code to copy over the track IDs from ground truth (GT) to our predictions (PD), after assignment is done for evaluation. This shows what our model would achieve, if we could track perfectly (i.e. incurring 0 ID switches). b) Perfect keypoints: We replace our PD keypoints with GT keypoint, where GT and PD are matched using box overlap. This shows what our model would achieve, if we predict keypoints perfectly. c) Finally we combine both, and show the performance with perfect keypoints and tracking, given our detections. missed detections. However we find that even without explicit occlusion handling, our model is not significantly affected by it. To prove this, we compute the upper bound performance given our detections and given perfect data association. Perfect data association indicates that tracks are preserved in time even when detections are missed at the frame level. As explained in Table 4, we obtain a small 2.5% improvement in MOTA (55.2 → 57.5) compared to our box-overlap based association, indicating that our simple heuristic is already close to optimal in terms of combinatorial matching. As shown in Table 4, the biggest challenge is the quality of the pose estimates. Note that a very substantial boost is observed when perfect pose predictions are assumed, given our detections (55.2 → 78.4). This shows that the biggest challenge in PoseTrack is building better pose predictors. Note that our approach can be modified to handle jumps or holes in tracks matching over the previous K frames as opposed to only the last frame, similar to [41]. This would allow for propagation of tracks even if a detection is missed in K − 1 frames, at a cost linear in K. Comparison with state of the art: We now compare our baseline tracker to previously published work on this dataset. Since this data was released only recently, there are no published baselines on the complete dataset. However, previous work [26] from the authors of the challenge reports results on a subset of this data. We compare our performance on Method Dataset Final model (Mini) Test v1.0 PoseTrack [26] Test (subset) MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTP Prec Rec Head Shou Elb Wri Hip Knee Ankl Total Total Total Total 55.9 - 59.0 - 51.9 - 43.9 - 47.2 - 46.3 - 40.1 - 49.6 46.1 34.1 64.6 81.9 74.8 67.4 70.5 Table 5. Final performance on test set. We compare our method with the previously reported method on a subset of this dataset [26]. Note that [26] reports performance at PCKh0.34; the comparable PCKh0.5 performance was provided via personal communication. Our performance was obtained by submitting our predictions to the evaluation server. Our model was a ResNet-101 base trained on train+val sets, and tracking was performed at 0.95 initial detection threshold, hungarian matching and bbox overlap cost criterion. This model also out-performed all competing approaches in the ICCV’17 PoseTrack challenge [1]. Figure 3. Sample results. Visualization of predictions from our two-stage model on the PoseTrack validation set. We show five frames per video, with each frame labeled with the detections and keypoints. The detections are color coded by predicted track id. Note that our model is able to successfully track people (and hence, their keypoints) in highly cluttered environments. One failure case of our model, as illustrated by the last video clip above, is loss of track due to occlusion. As the skate-boarder goes behind the pole, the model loses the track and assigns a new track ID after it recovers the person. the test set (obtained from the evaluation server) to their performance in Table 5. Note that their reported performance in [26] was at PCKh0.34, and the PCKh0.5 performance was provided via personal communication. We note that while the numbers are not exactly comparable due to differences in the test set used, it helps put our approach in perspective to the state of the art IP based approaches. We also submitted our final model to the ICCV’17 challenge [1]. Our final MOTA performance on the full test set was 51.8, and out-performed all competing approaches submitted to the challenge. Fig. 3 shows some sample results of our approach. Run-time comparison: Finally, we compare our method in terms of run-time, and show that our method is nearly two orders of magnitude faster than previous work. The IPbased method [26], using provided code takes 20 hours for a 256-frame video, in 3 stages: a) multiscale pose heatmaps: 15.4min, b) dense matching: 16 hours & c) IP optimization: 4 hours. Our method for the same video takes 5.2 minutes, in 2 stages: a) Box/kpt extract: 5.1 min & b) Hungarian: 2s, leading to a 237× speedup. More importantly, the run time for [26] grows non-linearly, making it impractical for longer videos. Our run time, on the other hand, grows linearly with number of frames, making it much more scalable to long videos. Init Style mAP mAP mAP mAP mAP mAP mAP mAP MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTA MOTP Prec Rec Elb Wri Hip Knee Ankl Total Total Total Total Head Shou Elbo Wri Hip Knee Ankl Total Head Shou 2D ImNet 23.4 20 12.3 3D ImNet center 27 22 13.3 3D ImNet mean 26.7 20.1 11.6 7.8 7.8 6.9 16.9 11.2 19.2 12.8 19.2 12.4 8 9.4 9.1 14.8 16.7 15.9 17.5 21.5 21.1 12.9 13.3 12.1 -2.4 -2.1 -4 -11.6 -11.7 -14.7 8.2 8.4 8.1 -1.6 -2 -2.4 -7.6 -6.3 -7.1 3.2 4.3 3.2 0 10.3 10 57.8 22.6 56.9 24.6 55.6 24.6 2D COCO 28.7 25.4 17.5 10.8 24.4 17.1 11.3 19.9 3D COCO center 32.5 30.4 19.9 12 26.6 18.7 13.5 22.6 3D COCO mean 29.3 26.4 18.2 10.4 24.9 16.8 12.1 20.4 24.6 27.7 25 20.8 24.5 21.5 11.9 12.1 10.8 4.7 4.5 1.1 17.9 18.7 18 11.1 11.2 9.6 2.6 3.3 2.3 14.1 15.4 13.4 5.4 31.3 15 73.6 24.6 70.7 28 69.7 25.7 Table 6. 3D Mask R-CNN performance. We compare our proposed 3D Mask R-CNN model with the baseline 2D model that achieves state of the art performance on the PoseTrack challenge. Due to GPU memory limitations, we use a ResNet-18 base architecture for both models with frames resized to 256 pixels (this leads to a drop in performance compared to ResNet-50, over 800px images). Our 3D model outperforms the 2D frame-level model for the tracking task in both MOTA and mAP metrics. We observe slightly higher performance with our proposed “center” initialization (as opposed to the “mean” initialization proposed in [5]). 4.3. Evaluating 3D Mask R-CNN So far we have shown results with our baseline model, running frame-by-frame (stage 1), and constructing the tracks on top of those predictions (stage 2). Now we experiment with our proposed 3D Mask R-CNN model, which naturally encodes temporal context by taking a short clip as input and produces spatiotemporal tubes of humans with keypoint locations (described in Sec. 3.1). At test time, we run this model densely in a sliding window fashion on the video, and perform tracking on the center frame outputs. One practical limitation with 3D CNN models is the GPU memory usage. Due to limits of the current hardware, we choose to experiment with a lightweight setup of our proposed baseline model. We use a ResNet-18 base architecture with an image resolution of 256 pixels. For simplicity, we experiment with T = 3 frame clips without temporal striding, although our model can work with arbitrary length clips. Our model predicts tubes of T frames, with keypoints corresponding to each box in the tube. We experiment with inflating the weights from both ImageNet and COCO, using either “mean” or “center” initialization as described in Sec. 3.1. Table 6 shows a comparison between our 3D Mask R-CNN and the 2D baseline. We re-train the COCO model on ResNet-18 (without the 256px resizing) to be able to initialize our 3D models. We obtain a mAP of 62.7% on COCO minival, which is comparable to the reported performance of ResNet-50 (64.2% mAP, Table 6 [17]). While the initial performance of the 2D model drops due to small input resolution and shallower model, we see clear gains by using our 3D model on the same resolution data with same depth of the network. This suggests potentially similar gains over the deeper model as well, once GPU/systems limitations are resolved to allow us to efficiently train deeper 3D Mask R-CNN models. 5. Conclusion and Future Work We have presented a simple, yet efficient approach to human keypoint tracking in videos. Our approach combines the state-of-the-art in frame-level pose estimation with a fast and effective person-level tracking module to connect keypoints over time. Through extensive ablative experiments, we explore different design choices for our model, and present strong results on the PoseTrack challenge benchmark. This shows that a simple Hungarian matching algorithm on top of good keypoint predictions is sufficient for getting strong performance for keypoint tracking, and should serve as a strong baseline for future research on this problem and dataset. For frame-level pose estimation we experiment with both a Mask R-CNN as well as our own proposed 3D extension of this model, which leverages temporal information from small clips to generate more robust predictions. Given the same base architecture and input resolution, we found our 3D Mask R-CNN to yield superior results to the 2D baseline. However, our 2D baseline requires less GPU memory and as a result can be applied to high image resolutions (up to 800 pixels) with high-capacity models (ResNet-101), which elevate the performance of this simple 2D baseline to state of the art results on the challenging PoseTrack benchmark. We believe that as GPU capacity increases and systems become capable splitting and training models across multiple GPUs, there is a strong potential for 3D Mask R-CNN based approaches, especially when applied to high-resolution input and high-capacity base models. We plan to explore those directions as future work. Acknowledgements: Authors would like to thank Deva Ramanan and Ishan Misra for many helpful discussions. This research is based upon work supported in part by NSF Grant 1618903, the Intel Science and Technology Center for Visual Cloud Systems (ISTC-VCS), Google, and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. References [1] Posetrack challenge: ICCV 2017. https://posetrack. net/iccv-challenge/. 1, 2, 4, 7 [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014. 5 [3] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008. 5 [4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017. 2 [5] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017. 1, 3, 8 [6] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng. Hico: A benchmark for recognizing human-object interactions in images. In ICCV, 2015. 1 [7] J. Donahue, L. A. Hendricks, S. Guadarrama, S. V. M. Rohrbach, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 1 [8] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016. 3 [9] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017. 3 [10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016. 1 [11] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track and track to detect. In ICCV, 2017. 2 [12] T. E. Fortman, Y. Bar-Shalom, and M. Scheffe. Multi-target tracking using joint probabilistic data association. 1980. 2 [13] R. Girdhar and D. Ramanan. Attentional pooling for action recognition. In NIPS, 2017. 1 [14] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. ActionVLAD: Learning spatio-temporal aggregation for action classification. In CVPR, 2017. 1 [15] G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with R*CNN. In ICCV, 2015. 1 [16] I. Gurobi Optimization. Gurobi optimizer reference manual, 2016. 2 [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017. 1, 2, 3, 4, 8 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 3 [19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. 4 [20] R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In ICCV, 2017. 2 [21] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, B. Andres, and B. Schiele. Articulated multi-person tracking in the wild. In CVPR, 2017. 1, 2, 3 [22] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In ECCV, 2016. 2 [23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 3 [24] U. Iqbal, A. Milan, M. Andriluka, E. Ensafutdinov, L. Pishchulin, J. Gall, and S. B. PoseTrack: A benchmark for human pose estimation and tracking. arXiv:1710.10000 [cs], 2017. 2, 3, 5 [25] U. Iqbal, A. Milan, M. Andriluka, E. Ensafutdinov, L. Pishchulin, J. Gall, and S. B. PoseTrack dataset. PoseTrack / CC INT’L 4.0 / https://posetrack.net/about. php, 2017. 5 [26] U. Iqbal, A. Milan, and J. Gall. Pose-track: Joint multi-person pose estimation and tracking. In CVPR, 2017. 1, 2, 3, 6, 7 [27] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, et al. T-CNN: Tubelets with convolutional neural networks for object detection from videos. arXiv preprint arXiv:1604.02532, 2016. 2 [28] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. In CVPR, 2016. 2 [29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1 [30] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 1 [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1 [32] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. 1 [33] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics (NRL), 1955. 4, 6 [34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 5 [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. COCO Dataset. COCO / CC INT’L 4.0 / http://cocodataset.org/ #termsofuse. 5 [36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 1, 5 [37] A. Mallya and S. Lazebnik. Learning models for actions and person-object interactions with transfer to question answering. In ECCV, 2016. 1 [38] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 5 [39] A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler. Online multi-target tracking using recurrent neural networks. In AAAI, 2017. 2, 4 [40] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017. 1, 2 [41] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011. 4, 6 [42] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016. 5 [43] D. B. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 1979. 2 [44] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 3 [45] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV, 2017. 2, 4 [46] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 1 [47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1 [48] J. Song, L. Wang, L. Van Gool, and O. Hilliges. Thin-slicing network: A deep structured model for pose estimation in videos. In CVPR, 2017. 1, 2 [49] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCVTR-12-01, 2012. 1 [50] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017. 1 [51] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 1 [52] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. 1 [53] Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. PAMI, 2013. 4, 6 [54] S.-I. Yu, D. Meng, W. Zuo, and A. Hauptmann. The solution path algorithm for identity-aware multi-object tracking. In CVPR, 2016. 2 [55] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014. 1