You Only Watch One Frame (YOWOF) for Online Spatio-Temporal Action Detection
The whole pipeline of YOWOF:
Online inference of YOWOF:
Paper: coming soon ... [under review]
- We recommend you to use Anaconda to create a conda environment:
conda create -n yowof python=3.6
- Then, activate the environment:
conda activate yowof
- Requirements:
pip install -r requirements.txt
- Google drive
Link: https://drive.google.com/file/d/1Dwh90pRi7uGkH5qLRjQIFiEmMJrAog5J/view?usp=sharing
- BaiduYun Disk
Link: https://pan.baidu.com/s/11GZvbV0oAzBhNDVKXsVGKg
Password: hmu6
You can use instructions from here to prepare AVA dataset.
- [email protected] & [email protected] on UCF24
Model | Clip | FPS | GFLOPs | F-mAP | V-mAP | Weight |
---|---|---|---|---|---|---|
YOWOF-R18 | 16 | 225 | 4.9 | 83.3 | 49.1 | ckpt |
- [email protected] IoU on AVA_v2.2
Model | Clip | FPS | GFLOPs | Params | mAP | Weight |
---|---|---|---|---|---|---|
YOWOF-R18 | 16 | 220 | 5.0 | 23.9 M | 18.1 | ckpt |
YOWOF-R50 | 16 | 125 | 11.1 | 50.5 M | 20.7 | ckpt |
python train.py --cuda -d ucf24 -v yowof-r18 --num_workers 4 --eval_epoch 1 --eval
or you can just run the script:
sh train_ucf.sh
python train.py --cuda -d ava_v2.2 -v yowof-r50 --num_workers 4 --eval_epoch 1 --eval
or you can just run the script:
sh train_ava.sh
- run yowof with clip inference mode
python test.py --cuda -d ucf24 -v yowof-r18 --weight path/to/weight --inf_mode clip --show
- run yowof with stream inference mode
python test.py --cuda -d ucf24 -v yowof-r18 --weight path/to/weight --inf_mode stream --show
- run yowof with clip inference mode
python test.py --cuda -d ava_v2.2 -v yowof-r50 --weight path/to/weight --inf_mode clip --show
- run yowof with stream inference mode
python test.py --cuda -d ava_v2.2 -v yowof-r50 --weight path/to/weight --inf_mode stream --show
- on UCF24
# frame mAP
python eval.py \
--cuda \
-d ucf24 \
-v yowof-r18 \
--weight path/to/weight \
--cal_frame_mAP
Our [email protected]
achieved by YOWOF-R18
on UCF24:
AP: 79.05% (1)
AP: 97.07% (10)
AP: 83.96% (11)
AP: 64.24% (12)
AP: 72.87% (13)
AP: 94.46% (14)
AP: 88.25% (15)
AP: 93.32% (16)
AP: 78.41% (17)
AP: 94.25% (18)
AP: 97.36% (19)
AP: 45.69% (2)
AP: 96.13% (20)
AP: 82.96% (21)
AP: 79.45% (22)
AP: 54.20% (23)
AP: 92.01% (24)
AP: 88.81% (3)
AP: 78.68% (4)
AP: 70.28% (5)
AP: 95.14% (6)
AP: 92.11% (7)
AP: 89.49% (8)
AP: 91.70% (9)
mAP: 83.33%
# video mAP
python eval.py \
--cuda \
-d ucf24 \
-v yowof-r18 \
--weight path/to/weight \
--cal_video_mAP
Our [email protected]
achieved by YOWOF-R18
on UCF24:
-------------------------------
V-mAP @ 0.05 IoU:
--Per AP: [87.71, 93.6, 68.66, 96.26, 79.63, 100.0, 82.72, 100.0, 93.36, 96.08, 44.8, 91.01, 91.87, 99.76, 23.33, 98.87, 90.87, 96.55, 91.46, 65.01, 72.97, 49.67, 86.4, 87.96]
--mAP: 82.86
-------------------------------
V-mAP @ 0.1 IoU:
--Per AP: [87.71, 91.01, 68.66, 93.73, 79.63, 100.0, 82.72, 100.0, 93.36, 96.08, 44.8, 87.62, 91.87, 99.76, 23.33, 98.87, 90.87, 96.55, 91.46, 63.46, 70.97, 49.67, 57.29, 87.96]
--mAP: 81.14
-------------------------------
V-mAP @ 0.2 IoU:
--Per AP: [58.88, 84.02, 64.38, 78.9, 38.04, 100.0, 82.72, 100.0, 82.5, 96.08, 44.8, 84.75, 91.87, 99.76, 22.2, 98.87, 90.87, 96.55, 91.46, 61.48, 49.51, 48.3, 32.85, 87.96]
--mAP: 74.45
-------------------------------
V-mAP @ 0.3 IoU:
--Per AP: [8.66, 28.1, 64.38, 69.77, 12.2, 84.79, 77.96, 100.0, 82.5, 92.68, 44.8, 72.98, 75.96, 99.76, 16.23, 98.87, 90.87, 96.55, 91.46, 51.32, 30.98, 43.29, 3.24, 80.77]
--mAP: 63.25
-------------------------------
V-mAP @ 0.5 IoU:
--Per AP: [0.0, 1.75, 53.12, 35.56, 0.66, 40.99, 62.67, 90.6, 49.47, 89.94, 44.8, 57.89, 48.27, 99.76, 4.81, 98.87, 85.47, 89.53, 87.81, 44.64, 0.45, 20.97, 0.0, 69.87]
--mAP: 49.08
-------------------------------
V-mAP @ 0.75 IoU:
--Per AP: [0.0, 0.0, 22.6, 0.0, 0.0, 1.11, 27.9, 64.31, 11.68, 42.35, 21.51, 34.02, 0.8, 40.54, 0.92, 80.21, 44.81, 19.75, 80.34, 18.34, 0.0, 0.92, 0.0, 35.32]
--mAP: 22.81
- on AVA_v2.2
python eval.py --cuda -d ucf24 -v yowof-r50 --weight path/to/weight
Our SOTA result achieved by YOWOF-R50
on AVA_v2.2:
'[email protected]/answer phone': 0.701,
'[email protected]/bend/bow (at the waist)': 0.358,
'[email protected]/carry/hold (an object)': 0.496,
'[email protected]/climb (e.g., a mountain)': 0.007,
'[email protected]/close (e.g., a door, a box)': 0.087,
'[email protected]/crouch/kneel': 0.178,
'[email protected]/cut': 0.026,
'[email protected]/dance': 0.295,
'[email protected]/dress/put on clothing': 0.006,
'[email protected]/drink': 0.223,
'[email protected]/drive (e.g., a car, a truck)': 0.531,
'[email protected]/eat': 0.209,
'[email protected]/enter': 0.0345,
'[email protected]/fall down': 0.098,
'[email protected]/fight/hit (a person)': 0.348,
'[email protected]/get up': 0.092,
'[email protected]/give/serve (an object) to (a person)': 0.057,
'[email protected]/grab (a person)': 0.046,
'[email protected]/hand clap': 0.019,
'[email protected]/hand shake': 0.014,
'[email protected]/hand wave': 0.004,
'[email protected]/hit (an object)': 0.006,
'[email protected]/hug (a person)': 0.217,
'[email protected]/jump/leap': 0.093,
'[email protected]/kiss (a person)': 0.398,
'[email protected]/lie/sleep': 0.581,
'[email protected]/lift (a person)': 0.027,
'[email protected]/lift/pick up': 0.021,
'[email protected]/listen (e.g., to music)': 0.014,
'[email protected]/listen to (a person)': 0.565,
'[email protected]/martial art': 0.329,
'[email protected]/open (e.g., a window, a car door)': 0.133,
'[email protected]/play musical instrument': 0.282,
'[email protected]/point to (an object)': 0.001,
'[email protected]/pull (an object)': 0.022,
'[email protected]/push (an object)': 0.016,
'[email protected]/push (another person)': 0.087,
'[email protected]/put down': 0.030,
'[email protected]/read': 0.270,
'[email protected]/ride (e.g., a bike, a car, a horse)': 0.251,
'[email protected]/run/jog': 0.283,
'[email protected]/sail boat': 0.250,
'[email protected]/shoot': 0.005,
'[email protected]/sing to (e.g., self, a person, a group)': 0.104,
'[email protected]/sit': 0.790,
'[email protected]/smoke': 0.040,
'[email protected]/stand': 0.778,
'[email protected]/swim': 0.227,
'[email protected]/take (an object) from (a person)': 0.047,
'[email protected]/take a photo': 0.060,
'[email protected]/talk to (e.g., self, a person, a group)': 0.688,
'[email protected]/text on/look at a cellphone': 0.061,
'[email protected]/throw': 0.007,
'[email protected]/touch (an object)': 0.388,
'[email protected]/turn (e.g., a screwdriver)': 0.014,
'[email protected]/walk': 0.542,
'[email protected]/watch (a person)': 0.663,
'[email protected]/watch (e.g., TV)': 0.177,
'[email protected]/work on a computer': 0.099,
'[email protected]/write': 0.041,
'[email protected]': 0.207
python test_video_ava.py --cuda -d ucf24 -v yowof-r50 --weight path/to/weight --video ava/video/name
- detection action instances with UCF24 labels
We provide some test videos of UCF24 in dataset/demo/ucf24_demo/
.
python demo.py --cuda -d ucf24 -v yowof-r18 --weight path/to/weight --video ./dataset/demo/ucf24_demo/v_Basketball_g01_c02.mp4
- detection action instances with AVA labels
python demo.py --cuda -d ava_v2.2 -v yowof-r50 --weight path/to/weight --video path/to/video