You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ The following is a summary of commonly used Computer Vision scenarios that are c
54
54
|[Detection](scenarios/detection)| Base | Object Detection is a technique that allows you to detect the bounding box of an object within an image. |
55
55
|[Keypoints](scenarios/keypoints)| Base | Keypoint detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
56
56
|[Segmentation](scenarios/segmentation)| Base | Image Segmentation assigns a category to each pixel in an image. |
57
-
|[Action recognition](scenarios/action_recognition)| Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
57
+
|[Action recognition](scenarios/action_recognition)| Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. We also implemented the i3d implementation of action recognition that can be found under (contrib)[contrib]. |
58
58
|[Crowd counting](contrib/crowd_counting)| Contrib | Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|
59
59
60
60
We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cv" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.
Copy file name to clipboardExpand all lines: contrib/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ Each project should live in its own subdirectory ```/contrib/<project>``` and co
9
9
| Directory | Project description | Build status (optional) |
10
10
|---|---|---|
11
11
|[Crowd counting](crowd_counting)| Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios. |[](https://dev.azure.com/team-sharat/crowd-counting/_build/latest?definitionId=49&branchName=lixzhang%2Fsubmodule-rev3)|
12
+
|[Action Recognition with I3D](action_recognition)| Action recognition to identify video/webcam footage from what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. Please note, that we also have a R(2+1)D implementation of action recognition that you can find under [scenarios](../sceanrios).||
12
13
13
14
## Tools
14
15
| Directory | Project description | Build status (optional) |
This directory contains resources for building video-based action recognition systems.
4
+
5
+
Action recognition (also known as activity recognition) consists of classifying various actions from a sequence of frames:
6
+
7
+

8
+
9
+
We implemented two state-of-the-art approaches: (i) [I3D](https://arxiv.org/pdf/1705.07750.pdf) and (ii) [R(2+1)D](https://arxiv.org/abs/1711.11248). This includes example notebooks for e.g. scoring of webcam footage or fine-tuning on the [HMDB-51](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/) dataset. The latter can be accessed under [scenarios](../scenarios) at the root level.
10
+
11
+
We recommend to use the **R(2+1)D** model for its competitive accuracy, fast inference speed, and less dependencies on other packages. For both approaches, using our implementations, we were able to reproduce reported accuracies:
12
+
13
+
| Model | Reported in the paper | Our results |
14
+
| ------- | -------| ------- |
15
+
| R(2+1)D-34 RGB | 79.6% | 79.8% |
16
+
| I3D RGB | 74.8% | 73.7% |
17
+
| I3D Optical flow | 77.1% | 77.5% |
18
+
| I3D Two-Stream | 80.7% | 81.2% |
19
+
20
+
21
+
## Projects
22
+
23
+
| Directory | Description |
24
+
| -------- | ----------- |
25
+
| [i3d](i3d) | Scripts for fine-tuning a pre-trained I3D model on HMDB-51
In this section we provide code for training a Two-Stream Inflated 3D ConvNet (I3D), introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\]. Our implementation uses the Pytorch models (and code) provided in [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d) - which have been pre-trained on the Kinetics Human Action Video dataset - and fine-tunes the models on the HMDB-51 action recognition dataset. The I3D model consists of two "streams" which are independently trained models. One stream takes the RGB image frames from videos as input and the other stream takes pre-computed optical flow as input. At test time, the outputs of each stream model are averaged to make the final prediction. The model results are as follows:
4
+
5
+
| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
6
+
| ------- | -------| ------- |
7
+
| RGB | 74.8 | 73.7 |
8
+
| Optical flow | 77.1 | 77.5 |
9
+
| Two-Stream | 80.7 | 81.2 |
10
+
11
+
## Download and pre-process HMDB-51 data
12
+
13
+
Download the HMDB-51 video database from [here](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Extract the videos with
14
+
```
15
+
mkdir rars && mkdir videos
16
+
unrar x hmdb51-org.rar rars/
17
+
for a in $(ls rars); do unrar x "rars/${a}" videos/; done;
18
+
```
19
+
20
+
Use code provided in [https://github.com/yjxiong/temporal-segment-networks](https://github.com/yjxiong/temporal-segment-networks) to preprocess the raw videos into split videos into RGB frames and compute optical flow frames:
0 commit comments