This project focuses on detecting human activity anomalies using time series classification models trained on video frame sequences. The objective was to classify actions from video data in two formats:
- Binary Classification: Fall vs Normal
- Multi-Class Classification: Exercise, Falling, Lying, Running, Sitting, Standing, Walking, Walking Downstairs, Walking Upstairs
- Original data was in video format.
- Converted to image frames at:
- 1 FPS for binary classification
- 2 FPS for multi-class classification
- CNN + LSTM
- I3D (Inflated 3D ConvNet)
- YOLOv8 + ResNet50
- YOLOv8 + ResNet101
These models were chosen to capture both spatial and temporal features effectively.
Why This Model
Perfect for modeling spatiotemporal sequences in video-based tasks. Using frames per second, CNN extracts spatial features, LSTM captures temporal dependencies.
How It Works
- CNN extracts spatial features (e.g., postures, objects).
- LSTM models sequence changes over time.
Pros
- Good generalization on small–medium datasets.
- Captures both appearance and time dynamics.
Cons
- Slower training due to sequential LSTM.
- Struggles with very long sequences.
Comparison
- Better than YOLO+ResNet when temporal structure is key.
- Not as optimized as I3D for spatiotemporal features.
Why This Model
Built for video classification, inflates 2D ConvNets to 3D to capture spatial and temporal features together.
How It Works
- Extends 2D CNN (e.g., Inception) into 3D with filters over height, width, and time.
- Processes stacked frames as a single video volume.
Pros
- Strong temporal modeling.
- No separate LSTM needed.
Cons
- Requires high GPU memory and training time.
- Needs large datasets.
Comparison
- Outperforms CNN+LSTM for short, dense action clips.
- Not ideal for low compute setups.
Why This Model
Combines fast object detection (YOLO) with feature extraction (ResNet50). Good for identifying action-relevant regions before classification.
How It Works
- YOLOv8 detects human/action regions in each frame.
- Cropped regions passed to ResNet50 for feature encoding.
- Features classified.
Pros
- Fast, lightweight.
- Excellent for localized motion detection.
Cons
- ResNet50 may miss deep spatial cues.
- No inherent temporal modeling.
Comparison
- Highly interpretable (detections visible).
- Less temporal awareness than I3D or CNN+LSTM.
Why This Model
Same as above, but deeper ResNet101 improves complex feature learning.
How It Works
- YOLOv8 for detection.
- ResNet101 for deeper feature encoding.
Pros
- Higher accuracy than ResNet50.
- Retains YOLO's speed for detection.
Cons
- Heavier, slower than ResNet50.
- Still no temporal modeling.
Comparison
- More accurate than ResNet50.
- Less unified than I3D/CNN+LSTM for spatial-temporal learning.
| Model | Spatial | Temporal | Speed | Accuracy | Data Requirement | Best For |
|---|---|---|---|---|---|---|
| CNN + LSTM | ✅ | ✅ | ✅ Good | Low–Medium | Balanced Tasks | |
| I3D | ✅✅ | ✅✅ | ❌ Slow | ✅✅ High | High | Dense Video Patterns |
| YOLOv8 + ResNet50 | ✅✅ | ❌ | ✅ Fast | Low–Medium | Real-time Detection | |
| YOLOv8 + ResNet101 | ✅✅✅ | ❌ | ✅ Better | Medium–High | Complex Scenes |
| Model | Description | Pros | Cons |
|---|---|---|---|
| TSM (Temporal Shift) | Efficient temporal modeling in CNNs | Low compute, good accuracy | Less explored than I3D |
| SlowFast Networks | Two-stream CNN for slow and fast motions | Strong for activity recognition | High GPU need |
| Transformer-based | Attention across space & time (TimeSFormer, ViViT) | State-of-the-art performance | Large data + compute |
| ST-GCN | Graph model using skeleton/keypoints | Lightweight, accurate | Needs pose estimation step |
- Binary Classification (Fall vs Normal): CNN+LSTM is sufficient and interpretable.
- Multi-Class: I3D or YOLOv8 + ResNet101 if you have enough data and compute.
- Real-Time: YOLOv8 + ResNet50 or fast temporal networks.
| Model | Confusion Matrix |
|---|---|
| CNN + LSTM | ![]() |
| I3D | ![]() |
| YOLOv8 + ResNet50 | ![]() |
| YOLOv8 + ResNet101 | ![]() |
| Model | Confusion Matrix |
|---|---|
| CNN + LSTM | ![]() |
| I3D | ![]() |
| YOLOv8 + ResNet50 | ![]() |
| YOLOv8 + ResNet101 | ![]() |
- Multi-Class Dataset: Google Drive
- Binary Dataset (Fall vs Normal): Google Drive
- End-to-end anomaly detection pipeline using video frame-based time series data.
- Comparison of multiple deep learning architectures for binary and multi-class classification.
- Evaluation through confusion matrices for each model and classification type.







