Recurrent Tubelet Proposal and Recognition Networks For Action Detection
Recurrent Tubelet Proposal and Recognition Networks For Action Detection
Recurrent Tubelet Proposal and Recognition Networks For Action Detection
Dong Li1 , Zhaofan Qiu1 , Qi Dai2 , Ting Yao3 , and Tao Mei3
1
University of Science and Technology of China, Hefei, China
2
Microsoft Research, Beijing, China
3
JD AI Research, Beijing, China
{dongli1995.ustc,zhaofanqiu,tingyao.ustc}@gmail.com
[email protected], [email protected]
1 Introduction
Action detection with accurate spatio-temporal location in videos is one of the
most challenging tasks in video understanding. Compared to action recognition,
this task is more difficult due to complex variations and large spatio-temporal
search space. The solutions to this problem have evolved from handcrafted
feature-based methods [18,34,40] to deep learning-based approaches [7]. Promis-
ing progresses have been made recently [22,28,36,39] with the prevalence of deep
Convolutional Neural Networks (CNN) [10, 16, 30].
2 D. Li, Z. Qiu, Q. Dai, T. Yao, T. Mei
Fig. 1. Action detection comparisons on traditional method (first row) and ours (sec-
ond row). Traditional method extracts per-frame proposals independently, which may
result in some failures. In the example of horse riding (left), it fails when the person
is partially blocked by the fence. In the example of long jump (right), an unwanted
person (red bounding box) is also detected. In contrast, our approach can solve these
problems by utilizing the temporal context across frames.
Inspired by the recent advances of CNN based image object detection meth-
ods [4, 6, 26], previous action detection approaches first detect either frame-
level [22, 39] or clip-level proposals [11] independently. Then these fragmental
proposals are associated to generate a complete action proposal by linking or
tracking based approaches. However, such methods rarely exploit the temporal
relations across frames or clips, which may result in unstable proposals when the
single detection is unreliable. Figure 1 illustrates two examples of such limita-
tions. In the example of horse riding, detection fails in the second frame where
the person is partially blocked by the fence. In the other example of long jump,
an unwanted person (red bounding box) is also detected, bringing in noises for
future proposal recognition. Such noise is long-standing and inevitable due to
independent detection on each frame or clip. One possible way to solve the above
problems is to model the action by leveraging temporally contextual relations.
For example, when the person is blocked in current frame, we could leverage the
proposals in previous frames to infer the current ones. Motivated by this idea, we
consider exploiting the action proposals in previous frames plus the correspond-
ing contextual information when determining the action regions in current one,
instead of detecting proposals from each frame or clip independently. Through
involving the temporal context, the inevitable failures in per frame or clip pro-
posal generation scheme could be mostly alleviated.
In this paper we present Recurrent Tubelet Proposal and Recognition (RTPR)
networks — a novel architecture for action detection, as shown in Figure 2. Our
proposed RTPR networks consist of two components: Recurrent Tubelet Pro-
posal (RTP) networks and Recurrent Tubelet Recognition (RTR) networks. The
RTP produces action proposals in a recurrent manner. Specifically, it initial-
izes action proposals of the start frame through a Region Proposal Network
(RPN) [26] on the feature map. Then the movement of each proposal in next
frame is estimated from three inputs: feature maps of both current and next
frames, and the proposal in current frame. Simultaneously, a sibling proposal
classifier is utilized to infer the actionness of the proposal. To form the tubelet
Recurrent Tubelet Proposal and Recognition Networks for Action Detection 3
proposals, action proposals in two consecutive frames are linked by taking both
their actionness and overlap ratio into account, followed by the temporal trim-
ming on it. The RTR capitalizes on a multi-channel architecture for tubelet
proposal recognition. For each proposal, we extract three different semantic-
level features, i.e., the features on proposal-cropped image, the features with
RoI pooling on proposal, and the features on whole frame. These features im-
plicitly encode the spatial context and scene information, which enhance the
recognition capability on specific categories. After that, each of them is fed into
a Long Short-Term Memory (LSTM) network to model the temporal dynamics.
With both RTP and RTR, our approach can generate better tubelets and boost
recognition, thus leading to promising detection results as shown in Figure 1.
The main contribution of this work is the proposal of RTPR networks for
addressing the issue of action detection. The solution also leads to the elegant
views of what kind of temporal context should be exploited and how to model
the temporal relations in a deep learning framework particularly for the task of
action detection, which are problems not yet fully understood in the literature.
2 Related Work
which 3D action volumes are generated [2]. For example, TrackLocalization [38]
tracks current proposals to obtain anchor ones in next frame by leveraging optical
flow, and selects the best regions in the neighborhood of anchors using a sliding
window. However, distinguishing actions from single frame could be ambiguous.
To address this issue, ACT [14] takes as input a sequence of frames and outputs
tube proposals instead of operating on single frames. T-CNN [11] further extends
2D Region-of-Interest pooling to 3D Tube-of-Interest (ToI) pooling with 3D
convolution. It directly generates tube proposals on each fixed-length clip.
The aforementioned action detection methods treat each frame or clip in-
dependently, while ignoring the temporally contextual relations. Instead, our
approach generates the tubelet proposals in a recurrent manner, which fully
leverages the temporal information in videos. The most closely related work is
CPLA [39], which solely relies on the detected proposals of current frame to
estimate the proposal movements in the next frame. Ours is different from [39]
in the way that we effectively model the temporal correlations of proposals be-
tween two consecutive frames to predict movements. Moreover, Faster R-CNN
is required for each frame in [39], while our approach only runs RPN at ini-
tialization. In addition, our work also contributes by reliably capturing different
semantic-level information and long-term temporal dynamics for recognition.
In this section we present our proposed Recurrent Tubelet Proposal and Recogni-
tion (RTPR) networks for video action detection. An overview of our framework
is shown in Figure 2. It consists of two main components: Recurrent Tubelet
Proposal (RTP) networks and Recurrent Tubelet Recognition (RTR) networks.
In RTP, we first utilize CNNs for feature extraction. The RPN is applied on
the feature map of the start frame for proposal initialization. Our RTP then
generates proposals of subsequent frames in a recurrent manner. In particular,
given action proposals in current frame, RTP estimates the movements of them
to produce proposals in the next frame. Furthermore, action proposals from
consecutive frames are linked according to their actionness and overlap ratio to
form video-level tubelet proposals, followed by temporal trimming on tubelets.
The obtained tubelets are finally fed into RTR for recognition. RTR employs a
multi-channel architecture to capture different semantic-level information, where
an individual LSTM is utilized to model temporal dynamics on each channel.
The RTP aims to generate action proposals for all frames. Instead of producing
proposals in each frame independently, we exploit the localized proposals in
previous frame when determining the action regions in current one. Such scheme
could help avoid the failures caused by unreliable single detection. Note that RTP
only indicates the locations where an action exists irrespective of the category.
Recurrent Tubelet Proposal and Recognition Networks for Action Detection 5
Proposal
Score H
Multi-level
I LSTM
Region Proposal Context Pooling
Offset
Network Regressor S
Frame
Proposal
Score
H
Multi-level
I LSTM
* Movement
Context Pooling
CBP S
Regressor
Frame
Proposal
Score
H
Multi-level
I LSTM
* Movement
Context Pooling
CBP S
Regressor
Frame
Proposal
Score
H
Multi-level
I LSTM
* Movement
Context Pooling
CBP S
Regressor
Frame
Recurrent Tubelet Proposal Recurrent Tubelet Recognition
Fig. 2. The overview of RTPR networks. It consists of two components: RTP networks
and RTR networks. CNNs are first utilized in RTP for feature extraction. Then RPN
is applied on the start frame for proposal initialization. RTP estimates the movements
of proposals in current frame to produce proposals in next frame. After proposal link-
ing and temporal trimming, obtained tubelet proposals are fed into RTR. The RTR
employs a multi-channel architecture to capture different semantic-level information,
where an individual LSTM is utilized to model temporal dynamics on each channel.
Architecture. The RTP begins with the initial anchor action proposals
obtained by RPN on the start frame, and then produces the action proposals
of subsequent frames in a recurrent manner. Given the video frame It and its
proposal set Bt = {bit |i = 1, ..., N } at time t, RTP aims to generate the proposal
set Bt+1 for the next frame It+1 . Let bit = (xit , yti , wti , hit ) denote the i-th proposal
at time t, where x, y, w and h represent two coordinates of the proposal center,
width and height of it. As shown in Figure 3(a), two consecutive frames, It
and It+1 , are first fed into a shared CNN to extract features. To predict the
i-th proposal bit+1 at time t+1, we need to estimate the movement mit+1 =
(∆xit+1 , ∆yt+1
i i
, ∆wt+1 , ∆hit+1 ) between bit+1 and bit , which is defined as
Instead, to estimate this movement, visual features of proposals bit+1 and bit are
required. It is a chicken-and-egg problem, as we do not have the exact location of
bit+1 in advance. Observing the fact that the receptive fields of deep convolutional
layers are generally large enough to capture possible movements, we then solve
the problem by simply performing RoI pooling at the same proposal location as
bit , as shown in Figure 3(a).
Formally, suppose Fti and Ft+1
i
∈ RW ×H×D denote the RoI pooled features
of It and It+1 w.r.t. the same location of proposal bit , where W , H and D are the
width, height and channel numbers. The objective is to estimate the proposal
movement mit+1 based on Fti and Ft+1 i
. The movement between two consecutive
6 D. Li, Z. Qiu, Q. Dai, T. Yao, T. Mei
Proposal
H
CNN Frame
Proposal
RoI
Pooling Score
RoI I
Frame CNN LSTM
CBP Pooling
Frame
RoI
Movement Proposal S
Pooling CNN Frame
Regressor
Frame
Fig. 3. (a) RTP networks. Two consecutive frames It and It+1 are first fed into CNNs.
Given the proposal bt in current frame It , we perform RoI pooling on both It and It+1
w.r.t. the same proposal bt . The two pooled features are fed into a CBP layer to generate
the correlation features, which are used to estimate the movement of proposal bt and
the actionness score. (b) RTR networks. We capitalize on a multi-channel network for
tubelet recognition. Three different semantic clues, i.e., human only (H), human-object
interaction (I), and scene (S), are exploited, where the features on proposal-cropped
image, the features with RoI pooling on the proposal, and the features on whole frame
are extracted. Each of them is fed into an LSTM to model the temporal dynamics.
where S = W ×H is the size of the feature map, φ(·) is a low dimensional projec-
tion function, and h·i is the second order polynomial kernel. Finally, the outputs
of CBP are fed into two sibling fully connected layers. One is the regression layer
that predicts the movement mit+1 , and the other is the classification layer that
predicts the actionness confidence score of bit+1 .
Training Objective. When training RTP, the network inputs include two
consecutive frames It and It+1 , the proposal set Bt of It obtained by RPN, and
the ground-truth bounding boxes B̂t+1 . Assuming that the action movement
across two consecutive frames is not big, those correctly extracted proposals in
Bt will have large Intersection-over-Union (IoU) ratios with one ground-truth
proposal in B̂t+1 . Consequently, we assign a positive label to the proposals bit
if: (a) bit has an IoU overlap higher than 0.7 with any ground-truth box in
B̂t+1 , or (b) bit is with the highest IoU overlap with a ground-truth box in
B̂t+1 . Otherwise, we assign a negative label to bit if its IoU ratio is lower than
0.3 with all ground-truth boxes in B̂t+1 . The network is jointly optimized with
classification loss Lcls and regression loss Lreg . For classification, the standard
log loss Lcls (bit ) is utilized. It outputs an actionness score in range [0, 1] for the
output proposal. For regression, the smooth L1 loss [4] Lreg (mit+1 ) is exploited.
Recurrent Tubelet Proposal and Recognition Networks for Action Detection 7
It forces the proposal bit to move towards a nearby ground-truth proposal in the
next frame. The overall objective L is formulated as
1 X 1 X i
L= Lcls (bit ) + λ y Lreg (mit+1 ) , (3)
N i Nreg i t
where ait and ajt+1 are the actionness scores of bit and bjt+1 , iou(·) is the IoU over-
lap ratio of two proposals, and γ is a scalar parameter for balancing the action-
ness scores and overlaps. We define ψ(iou) as the following threshold function:
(
1, if iou(bit , bjt+1 ) > τ,
ψ(iou) = (5)
0, otherwise.
According to Equation (4), two proposals bit and bjt+1 will be linked if their
spatial regions significantly overlap and their actionness scores are both high.
8 D. Li, Z. Qiu, Q. Dai, T. Yao, T. Mei
To find the optimal path across the video, we maximize the linking scores
over the duration T of the video, which is calculated by
T −1
1 X
P ∗ = arg max S(bt , bt+1 ). (6)
P T − 1 t=1
We solve it with Viterbi algorithm, whose complexity is O(T ×N 2 ), where N is
the average number of proposals per frame. Once an optimal path is found, we
remove all the proposals in it and seek the next one from the remaining proposals.
Temporal trimming. The above tubelet linking produces tubelets span-
ning the whole video. In realistic videos, human actions typically occupy only a
fraction. In order to determine the temporal extent of an action instance within
the tubelet, we employ a similar temporal trimming approach as in [28], which
solves an energy maximization problem via dynamic programming. The idea be-
hind is to restrict consecutive proposals to have smooth actionness scores. Note
that temporal trimming is only performed on untrimmed datasets in this work.
4 Implementations
Details of RTPR networks. For RTP networks, we adopt the pre-trained
VGG-16 [30] as our basic CNN model, and build RTP at the top of its last
Recurrent Tubelet Proposal and Recognition Networks for Action Detection 9
5 Experiments
Recall
Recall
Recall
Recall
Fig. 4. The frame-level Recall vs. IoU curves of different action proposal methods on
UCF-Sports, J-HMDB (split 1), UCF-101, and AVA datasets.
are conducted on the first split. AVA [8] is a challenging dataset published very
recently. It densely annotates 80 atomic visual actions in 57.6K video segments
collected from 192 movies. The duration of each segment is 3 seconds. Different
from the above datasets where annotations are provided for all frames, only the
middle frame of each 3-second-long video segment is annotated in AVA. To take
full advantage of the temporal information around the annotated key-frame, 15
consecutive frames (i.e., the key-frame, 7 frames before it, and 7 frames after
it) are treated as a video clip to be processed in our framework. Note that each
bounding box may be associated with multiple action categories, making the
dataset more challenging. Experiments are performed on the standard splits [8].
Evaluation metrics. We adopt both frame-mAP and video-mAP as our
evaluation metrics [7,22]. A frame or tubelet detection is treated as positive if its
IoU with the ground-truth is greater than a threshold δ and the predicted label is
correct. Specifically, for UCF-Sports, J-HMDB, and UCF-101, we follow [9,11,22]
to exploit video-mAP as evaluation metric. And for AVA, we follow the standard
evaluation scheme in [8] to measure frame-mAP. The reported metric is the mAP
at IoU threshold δ = 0.5 for spatial localization (UCF-Sports, J-HMDB and
AVA) and δ = 0.2 for spatio-temporal localization (UCF-101) by default.
RPN-RGB
Optical
Flow
FG-RGB
RTP-RGB
Fig. 5. An example of proposal generation results with three methods on action “Run.”
Table 1. Evaluation on RTR networks with different channels on the four datasets.
100%
80%
60%
40%
Jump Sit Stand Shoot Shoot Swing Kick Ball Brush Climb
Bow Gun Baseball Hair Stairs
Kicking 0.792
0.93 0.88 0.74 0.94 0.78 0.89 0.92 Golf Swing 0.103
0.69
Run 0.086
Walk 0.009
Diving 0.002
Shoot bow 0.613
0.79 0.65 0.83 0.86 0.72 0.75 0.83 0.86 Shoot gun 0.211
Wave 0.064
Walk 0.012
Clap 0.007
Biking 0.707
0.86 0.79 Horse Riding 0.115
0.72 0.85 0.71 0.74 0.78 0.72
Salsa Spin 0.035
Tennis Swing 0.023
Pole Vault 0.015
Fig. 7. Four detection examples of our method from UCF-Sports, J-HMDB, UCF-101,
and AVA. The proposal score is given for each bounding box. Top predicted action
classes for each tubelet are on the right. Red labels indicate ground-truth.
to predict movements in RTP rather than relying on RPN for each individual
frame in CPLA. In addition, our ResNet-101 based model outperforms the best
competitors [9, 14] by 2.9%, 4.3% and 0.7% on the three datasets respectively.
Table 4 shows the comparisons on AVA. Since AVA is a very recent dataset,
there are very few studies on it and we only compare with [8], which implements
Multi-Region Two-Stream CNN [22] with ResNet-101. We can observe that ours
with both VGG-16 and ResNet-101 basic models outperform the baseline. Figure
7 showcases four detection examples from UCF-Sports, J-HMDB, UCF-101, and
AVA. Even in complex cases, e.g., varying scales (third row) and multi-person
plus multi-label (last row), our approach can still work very well.
6 Conclusion
References
1. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan,
S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual
recognition and description. In: CVPR (2015)
2. Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In:
CVPR (2017)
3. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: CVPR
(2016)
4. Girshick, R.: Fast r-cnn. In: ICCV (2015)
5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: CVPR (2014)
6. Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* cnn.
In: ICCV (2015)
7. Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)
8. Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li,
Y., Ricco, S., Sukthankar, R., Schmid, C., et al.: Ava: A video dataset of spatio-
temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
9. He, J., Ibrahim, M.S., Deng, Z., Mori, G.: Generic tubelet proposals for action
localization. In: WACV (2018)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
11. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (t-cnn) for action
detection in videos. In: ICCV (2017)
12. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding
action recognition. In: ICCV (2013)
13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
In: ACM MM (2014)
14. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector
for spatio-temporal action localization. In: ICCV (2017)
15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-
scale video classification with convolutional neural networks. In: CVPR (2014)
16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS (2012)
17. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action
localization and recognition. In: ICCV (2011)
18. Laptev, I., Pérez, P.: Retrieving actions in movies. In: ICCV (2007)
19. Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Action recognition by learning
deep multi-granular spatio-temporal video representation. In: ICMR (2016)
20. Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Learning hierarchical video
representation for action recognition. IJMIR 6(1), 85–98 (2017)
21. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd:
Single shot multibox detector. In: ECCV (2016)
22. Peng, X., Schmid, C.: Multi-region two-stream r-cnn for action detection. In: ECCV
(2016)
23. Qiu, Z., Yao, T., Mei, T.: Deep quantization: Encoding convolutional activations
with deep generative model. In: CVPR (2017)
24. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d
residual networks. In: ICCV (2017)
16 D. Li, Z. Qiu, Q. Dai, T. Yao, T. Mei
25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: CVPR (2016)
26. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In: NIPS (2015)
27. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum
average correlation height filter for action recognition. In: CVPR (2008)
28. Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for de-
tecting multiple space-time action tubes in videos. In: BMVC (2016)
29. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-
nition in videos. In: NIPS (2014)
30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: ICLR (2015)
31. Singh, G., Saha, S., Cuzzolin, F.: Online real time multiple spatiotemporal action
localisation and prediction. In: ICCV (2017)
32. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes
from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
33. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video
representations using lstms. In: ICML (2015)
34. Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for
action detection. In: CVPR (2013)
35. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-
poral features with 3d convolutional networks. In: ICCV (2015)
36. Wang, L., Qiao, Y., Tang, X., Van Gool, L.: Actionness estimation using hybrid
fully convolutional networks. In: CVPR (2016)
37. Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream sr-cnns for
action recognition in videos. In: BMVC (2016)
38. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal
action localization. In: ICCV (2015)
39. Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade pro-
posal and location anticipation. In: BMVC (2017)
40. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action
detection. In: CVPR (2009)
41. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R.,
Toderici, G.: Beyond short snippets: Deep networks for video classification. In:
CVPR (2015)