microsoft
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎contrib/README.md‎
Lines changed: 1 addition & 0 deletions b/‎contrib/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎contrib/action_recognition/README.md‎
Lines changed: 26 additions & 0 deletions b/‎contrib/action_recognition/README.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎contrib/action_recognition/i3d/.gitignore‎
Lines changed: 7 additions & 0 deletions b/‎contrib/action_recognition/i3d/.gitignore‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎contrib/action_recognition/i3d/README.md‎
Lines changed: 61 additions & 0 deletions b/‎contrib/action_recognition/i3d/README.md‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎contrib/action_recognition/i3d/config/train_flow.yaml‎
Lines changed: 4 additions & 0 deletions b/‎contrib/action_recognition/i3d/config/train_flow.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎contrib/action_recognition/i3d/config/train_rgb.yaml‎
Lines changed: 4 additions & 0 deletions b/‎contrib/action_recognition/i3d/config/train_rgb.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎contrib/action_recognition/i3d/dataset.py‎
Lines changed: 244 additions & 0 deletions b/‎contrib/action_recognition/i3d/dataset.py‎
Lines changed: 244 additions & 0 deletions
@@ -54,7 +54,7 @@ The following is a summary of commonly used Computer Vision scenarios that are c
 | [Detection](scenarios/detection) | Base | Object Detection is a technique that allows you to detect the bounding box of an object within an image. |
 | [Keypoints](scenarios/keypoints) | Base | Keypoint detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
 | [Segmentation](scenarios/segmentation) | Base | Image Segmentation assigns a category to each pixel in an image. |
-| [Action recognition](scenarios/action_recognition) | Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
+| [Action recognition](scenarios/action_recognition) | Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. We also implemented the i3d implementation of action recognition that can be found under (contrib)[contrib]. |
 | [Crowd counting](contrib/crowd_counting) | Contrib | Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|
 
 We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cv" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.
 
@@ -9,6 +9,7 @@ Each project should live in its own subdirectory ```/contrib/<project>``` and co
 | Directory | Project description | Build status (optional) |
 |---|---|---|
 | [Crowd counting](crowd_counting) | Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios. | [![Build Status](https://dev.azure.com/team-sharat/crowd-counting/_apis/build/status/lixzhang.cnt?branchName=lixzhang%2Fsubmodule-rev3)](https://dev.azure.com/team-sharat/crowd-counting/_build/latest?definitionId=49&branchName=lixzhang%2Fsubmodule-rev3)|
+| [Action Recognition with I3D](action_recognition) | Action recognition to identify video/webcam footage from what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. Please note, that we also have a R(2+1)D implementation of action recognition that you can find under [scenarios](../sceanrios).| |
 
 ## Tools
 | Directory | Project description | Build status (optional) |
 
@@ -0,0 +1,26 @@
+# Action Recognition
+
+This directory contains resources for building video-based action recognition systems.
+
+Action recognition (also known as activity recognition) consists of classifying various actions from a sequence of frames:
+
+![](./media/action_recognition2.gif "Example of action recognition")
+
+We implemented two state-of-the-art approaches: (i) [I3D](https://arxiv.org/pdf/1705.07750.pdf) and (ii) [R(2+1)D](https://arxiv.org/abs/1711.11248). This includes example notebooks for e.g. scoring of webcam footage or fine-tuning on the [HMDB-51](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/) dataset. The latter can be accessed under [scenarios](../scenarios) at the root level.
+
+We recommend to use the **R(2+1)D** model for its competitive accuracy, fast inference speed, and less dependencies on other packages. For both approaches, using our implementations, we were able to reproduce reported accuracies:
+
+| Model | Reported in the paper | Our results |
+| ------- | -------| ------- |
+| R(2+1)D-34 RGB | 79.6% | 79.8% |
+| I3D RGB | 74.8% | 73.7% |
+| I3D Optical flow | 77.1% | 77.5% |
+| I3D Two-Stream | 80.7% | 81.2% |
+
+
+## Projects
+
+| Directory |  Description |
+| -------- |  ----------- |
+| [i3d](i3d) | Scripts for fine-tuning a pre-trained I3D model on HMDB-51
+dataset. |
@@ -0,0 +1,7 @@
+__pycache__/
+models/__pycache__/
+log/
+.vscode/
+checkpoints/
+pretrained_models/
+inference/.ipynb_checkpoints/
@@ -0,0 +1,61 @@
+## Fine-tuning I3D model on HMDB-51
+
+In this section we provide code for training a Two-Stream Inflated 3D ConvNet (I3D), introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\].  Our implementation uses the Pytorch models (and code) provided in [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d) - which have been pre-trained on the Kinetics Human Action Video dataset - and fine-tunes the models on the HMDB-51 action recognition dataset. The I3D model consists of two "streams" which are independently trained models. One stream takes the RGB image frames from videos as input and the other stream takes pre-computed optical flow as input. At test time, the outputs of each stream model are averaged to make the final prediction. The model results are as follows:
+
+| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
+| ------- | -------| ------- |
+| RGB | 74.8 | 73.7 |
+| Optical flow | 77.1 | 77.5 |
+| Two-Stream | 80.7 | 81.2 |
+
+## Download and pre-process HMDB-51 data
+
+Download the HMDB-51 video database from [here](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Extract the videos with
+```
+mkdir rars && mkdir videos
+unrar x hmdb51-org.rar rars/
+for a in $(ls rars); do unrar x "rars/${a}" videos/; done;
+```
+
+ Use code provided in [https://github.com/yjxiong/temporal-segment-networks](https://github.com/yjxiong/temporal-segment-networks) to preprocess the raw videos into split videos into RGB frames and compute optical flow frames:
+ ```
+ git clone https://github.com/yjxiong/temporal-segment-networks
+ cd temporal-segment-networks
+ bash scripts/extract_optical_flow.sh /path/to/hmdb51/videos /path/to/rawframes/output
+```
+Edit the _C.DATASET.DIR option in [default.py](default.py) to point towards the rawframes input data directory.
+
+## Setup environment
+Setup environment
+
+```
+conda env create -f environment.yaml
+conda activate i3d
+```
+
+## Download pretrained models
+Download pretrained models
+
+```
+bash download_models.sh
+```
+
+## Fine-tune pretrained models on HMDB-51
+
+Train RGB model
+```
+python train.py --cfg config/train_rgb.yaml
+```
+
+Train flow model
+```
+python train.py --cfg config/train_flow.yaml
+```
+
+Evaluate combined model
+```
+python test.py
+```
+
+\[1\] J. Carreira and A. Zisserman. Quo vadis, action recognition?
+a new model and the kinetics dataset. In CVPR, 2017.
@@ -0,0 +1,4 @@
+MODEL:
+  NAME: "i3d_flow"
+TRAIN:
+  MODALITY: "flow"
@@ -0,0 +1,4 @@
+MODEL:
+  NAME: "i3d_rgb"
+TRAIN:
+  MODALITY: "RGB"
@@ -0,0 +1,244 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# Adapted from https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/dataset.py
+
+import torch.utils.data as data
+import torch
+
+from PIL import Image
+import os
+import os.path
+import numpy as np
+from numpy.random import randint
+from pathlib import Path
+
+import torchvision
+from torchvision import datasets, transforms
+from videotransforms import (
+    GroupRandomCrop, GroupRandomHorizontalFlip,
+    GroupScale, GroupCenterCrop, GroupNormalize, Stack
+)
+
+from itertools import cycle
+
+
+class VideoRecord(object):
+    def __init__(self, row):
+        self._data = row
+
+    @property
+    def path(self):
+        return self._data[0]
+
+    @property
+    def num_frames(self):
+        return int(
+            len([x for x in Path(
+                self._data[0]).glob('img_*')])-1)
+
+    @property
+    def label(self):
+        return int(self._data[1])
+
+
+class I3DDataSet(data.Dataset):
+    def __init__(self, data_root, split=1, sample_frames=64, 
+            modality='RGB', transform=lambda x:x,
+            train_mode=True, sample_frames_at_test=False):
+
+        self.data_root = data_root
+        self.split = split
+        self.sample_frames = sample_frames
+        self.modality = modality
+        self.transform = transform
+        self.train_mode = train_mode
+        self.sample_frames_at_test = sample_frames_at_test
+
+        self._parse_split_files()
+
+
+    def _parse_split_files(self):
+            # class labels assigned by sorting the file names in /data/hmdb51_splits directory
+            file_list = sorted(Path('./data/hmdb51_splits').glob('*'+str(self.split)+'.txt'))
+            video_list = []
+            for class_idx, f in enumerate(file_list):
+                class_name = str(f).strip().split('/')[2][:-16]
+                for line in open(f):
+                    tokens = line.strip().split(' ')
+                    video_path = self.data_root+class_name+'/'+tokens[0][:-4]
+                    record = (video_path, class_idx)
+                    # 1 indicates video should be in training set
+                    if self.train_mode & (tokens[-1] == '1'):
+                        video_list.append(VideoRecord(record))
+                    # 2 indicates video should be in test set
+                    elif (self.train_mode == False) & (tokens[-1] == '2'):
+                        video_list.append(VideoRecord(record))
+                
+            self.video_list = video_list
+
+
+    def _load_image(self, directory, idx):
+        if self.modality == 'RGB':
+            img_path = os.path.join(directory, 'img_{:05}.jpg'.format(idx))
+            try:
+                img = Image.open(img_path).convert('RGB')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            return img
+        else:
+            try:
+                img_path = os.path.join(directory, 'flow_x_{:05}.jpg'.format(idx))
+                x_img = Image.open(img_path).convert('L')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            try:
+                img_path = os.path.join(directory, 'flow_y_{:05}.jpg'.format(idx))
+                y_img = Image.open(img_path).convert('L')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            # Combine flow images into single PIL image
+            x_img = np.array(x_img, dtype=np.float32)
+            y_img = np.array(y_img, dtype=np.float32)
+            img = np.asarray([x_img, y_img]).transpose([1, 2, 0])
+            img = Image.fromarray(img.astype('uint8'))
+            return img
+
+
+    def _sample_indices(self, record):
+        if record.num_frames > self.sample_frames:
+            start_pos = randint(record.num_frames - self.sample_frames + 1)
+            indices = range(start_pos, start_pos + self.sample_frames, 1)
+        else:
+            indices = [x for x in range(record.num_frames)]
+        if len(indices) < self.sample_frames:
+            self._loop_indices(indices)
+        return indices
+
+
+    def _loop_indices(self, indices):
+        indices_cycle = cycle(indices)
+        while len(indices) < self.sample_frames:
+            indices.append(next(indices_cycle))
+
+
+    def __getitem__(self, index):
+        record = self.video_list[index]
+        # Sample frames from the the video for training, or if sampling
+        # turned on at test time
+        if self.train_mode or self.sample_frames_at_test:
+            segment_indices = self._sample_indices(record)
+        else:
+            segment_indices = [i for i in range(record.num_frames)]
+        # Image files are 1-indexed
+        segment_indices = [i+1 for i in segment_indices]
+        # Get video frame images
+        images = []
+        for i in segment_indices:
+            seg_img = self._load_image(record.path, i)
+            if seg_img is None:
+                raise ValueError("Couldn't load", record.path, i)
+            images.append(seg_img)
+        # Apply transformations
+        transformed_images = self.transform(images)
+
+        return transformed_images, record.label
+
+
+    def __len__(self):
+        return len(self.video_list)
+
+
+if __name__ == '__main__':
+
+    input_size = 224
+    resize_small_edge = 256
+
+    train_rgb = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='RGB',
+        train_mode=True,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupRandomCrop(input_size),
+            GroupRandomHorizontalFlip(),
+            GroupNormalize(modality="RGB"),
+            Stack(),
+        ])
+    )
+    item = train_rgb.__getitem__(10)
+    print("train_rgb:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    val_rgb = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='RGB',
+        train_mode=False,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupCenterCrop(input_size),
+            GroupNormalize(modality="RGB"),
+            Stack(),
+        ])
+    )
+    item = val_rgb.__getitem__(10)
+    print("val_rgb:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    train_flow = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='flow',
+        train_mode=True,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupRandomCrop(input_size),
+            GroupRandomHorizontalFlip(),
+            GroupNormalize(modality="flow"),
+            Stack(),
+        ])
+    )
+    item = train_flow.__getitem__(100)
+    print("train_flow:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    val_flow = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='flow',
+        train_mode=False,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupCenterCrop(input_size),
+            GroupNormalize(modality="flow"),
+            Stack(),
+        ])
+    )
+    item = val_flow.__getitem__(100)
+    print("val_flow:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +MODEL:
 +  NAME: "i3d_flow"
 +TRAIN:
 +  MODALITY: "flow"