AI-Driven ETL Anomaly Detection

Project Overview

This project demonstrates an AI-powered ETL pipeline that automatically detects anomalies and data quality issues in structured datasets. It integrates data ingestion, preprocessing, machine learning-based anomaly detection, and a FastAPI deployment to provide actionable insights.

The repository is designed to showcase skills in:

Python-based ETL pipelines
Machine Learning (anomaly detection using Scikit-Learn)
Backend deployment with FastAPI
Data quality monitoring and reporting
Clean, professional project structure for enterprise-level applications

Features

Ingest data from CSV files or databases
Perform data cleaning and preprocessing
Feature engineering for anomaly detection
Train and evaluate an ML model to detect anomalies
Generate reports highlighting data quality issues
Expose a FastAPI /predict endpoint for real-time anomaly scoring

Data

Synthetic Transactions Dataset

Path: data/raw/synthetic_transactions.csv
Includes:
- Normal transactions
- Injected anomalies: large amounts, negative/zero values, category deviations
Fully included in the repo for exploration and modeling

Kaggle Credit Card Fraud Dataset (Not Included)

Original dataset: Credit Card Fraud Detection
Note: creditcard.csv is too large for GitHub, so it is not included in this repo.
To use the Kaggle dataset locally:
1. Sign in to Kaggle and download creditcard.csv.
2. Place it in the folder: data/raw/creditcard.csv
3. The Day 3 notebook will automatically load it from this path.

# Example: loading Kaggle data
import pandas as pd

df_kaggle = pd.read_csv('data/raw/creditcard.csv')

Installation

This project uses Pipenv for dependency management:

pipenv install --dev
pipenv shell

Alternatively, if you prefer pip:

pip install -r requirements.txt

Usage

Notebooks Overview

This project includes a set of structured Jupyter notebooks that walk through the full lifecycle of the anomaly detection pipeline:

01-data-exploration.ipynb

Initial EDA, anomaly visualization, data distributions, missing values, and exploratory insights.

02-preprocessing.ipynb

ETL pipeline construction, cleaning, scaling, handling skewed features, and feature engineering.

03-ml-training.ipynb

Model training (Isolation Forest or others), tuning, evaluation metrics, ROC/AUC, and result interpretation.

A detailed explanation for each notebook is provided inside the notebooks/README.md to help reviewers understand design decisions and methodology.

API Documentation

This project includes a production-ready FastAPI microservice that exposes the unified anomaly detection pipeline for real-time and batch inference.

Base URL

http://localhost:8000

Endpoints

GET /health Simple heartbeat.

Response

{ "status": "ok", "message": "Anomaly Detection API is running." }

Get /metadata Returns model, scaler, and preprocessor metadata.

POST /predict Perform real-time anomaly detection on one transaction.

Request

{
  "timestamp": "2025-01-01T11:22:00",
  "customer_id": 101,
  "Amount": 129.55,
  "category": "grocery",
  "status": 0
}

Response

{
  "prediction": 0,
  "anomaly_score": -0.21
}

POST /predict_batch Perform anomaly detection on multiple transactions.

Request

{
  "records": [
    { "timestamp": "...", "Amount": 129.55, "category": "grocery" },
    { "timestamp": "...", "Amount": 980.25, "category": "tech" }
  ]
}

Response

{
  "count": 2,
  "predictions": [0, 1],
  "anomaly_scores": [-0.21, 0.88]
}

Running The API Locally:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

Running Docker:

docker build -t anomaly-api .
docker run -p 8000:8000 anomaly-api

Project Structure

ai-etl-anomaly-detection/
    data/
        raw/
        processed/
        results/
    models/
    notebooks/
    src/
        BaseCLasses/
            base_preprocessor.py
        Preprocessors/
            kaggle_preprocessor.py
            synthetic_preprocessor.py
            unified_preprocessor.py
        data_loader.py
        preprocessing.py
        feature_engineering.py
        model.py
        evaluate.py
        api.py
    tests/
    Pipfile
    Pipfile.lock
    requirements.txt (optional)
    README.md (this file)
    .gitignore

License

This project is licensed under the MIT License.

Next Steps / Enhancements

Add automated ETL orchestration with Airflow
Implement real-time anomaly monitoring dashboards
Include additional ML models (e.g., Autoencoders) for advanced anomaly detection
Deploy API to cloud services (AWS, GCP, Azure)

Contact / Author

D Fashimpaur
LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Driven ETL Anomaly Detection

Project Overview

Features

Data

Synthetic Transactions Dataset

Kaggle Credit Card Fraud Dataset (Not Included)

Installation

Usage

Notebooks Overview

01-data-exploration.ipynb

02-preprocessing.ipynb

03-ml-training.ipynb

API Documentation

Base URL

Endpoints

Running The API Locally:

Running Docker:

Project Structure

License

Next Steps / Enhancements

Contact / Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt

License

Fashimpaur/ai-etl-anomaly-detection

Folders and files

Latest commit

History

Repository files navigation

AI-Driven ETL Anomaly Detection

Project Overview

Features

Data

Synthetic Transactions Dataset

Kaggle Credit Card Fraud Dataset (Not Included)

Installation

Usage

Notebooks Overview

01-data-exploration.ipynb

02-preprocessing.ipynb

03-ml-training.ipynb

API Documentation

Base URL

Endpoints

Running The API Locally:

Running Docker:

Project Structure

License

Next Steps / Enhancements

Contact / Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages