diff --git a/README.md b/README.md index bde835c..ed458d4 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,198 @@ -# Machine Learning with Iris Dataset -![Python](https://img.shields.io/badge/python-2.x-orange.svg) -![Type](https://img.shields.io/badge/Machine-Learning-red.svg) ![Type](https://img.shields.io/badge/Type-Supervised-yellow.svg) -![Status](https://img.shields.io/badge/Status-Completed-yellowgreen.svg) +# 🌸 Machine Learning with Iris Dataset -## Introduction -The Iris dataset is a classic dataset for classification, machine learning, and data visualization. +[![Python](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) +[![Machine Learning](https://img.shields.io/badge/Machine-Learning-red.svg)](https://scikit-learn.org/) +[![Type](https://img.shields.io/badge/Type-Supervised-yellow.svg)](https://en.wikipedia.org/wiki/Supervised_learning) +[![Status](https://img.shields.io/badge/Status-Enhanced-green.svg)](https://github.com/Hrushikeshsurabhi/Machine-Learning-with-Iris-Dataset) +[![Fork](https://img.shields.io/badge/Fork-venky14-orange.svg)](https://github.com/venky14/Machine-Learning-with-Iris-Dataset) -The dataset contains: 3 classes (different Iris species) with 50 samples each, and then four numeric properties about those classes: Sepal Length, Sepal Width, Petal Length, and Petal Width. +> **Enhanced Version**: This is a forked and improved version of the original Iris dataset project by [venky14](https://github.com/venky14/Machine-Learning-with-Iris-Dataset), featuring a modular structure, comprehensive analysis, and production-ready code. -One species, Iris Setosa, is "linearly separable" from the other two. This means that we can draw a line (or a hyperplane in higher-dimensional spaces) between Iris Setosa samples and samples corresponding to the other two species. +## šŸ“‹ Table of Contents -Predicted Attribute: Different Species of Iris plant. +- [Introduction](#introduction) +- [Project Structure](#project-structure) +- [Features](#features) +- [Quick Start](#quick-start) +- [Usage Examples](#usage-examples) +- [Documentation](#documentation) +- [Contributing](#contributing) -## Purpose -The purpose of this project was to gain introductory exposure to Machine Learning Classification concepts along with data visualization. The project makes heavy use of Scikit-Learn, Pandas and Data Visualization Libraries. +## 🌺 Introduction + +The Iris dataset is a classic dataset for classification, machine learning, and data visualization. This enhanced version provides a comprehensive analysis with a modular, production-ready structure. + +### Dataset Information +- **3 Classes**: Different Iris species (Setosa, Versicolor, Virginica) +- **4 Features**: Sepal Length, Sepal Width, Petal Length, Petal Width +- **150 Samples**: 50 samples per species +- **Linearly Separable**: Iris Setosa is linearly separable from the other two species + +### Enhanced Features +- šŸ—ļø **Modular Architecture**: Clean separation of concerns +- šŸ¤– **Multiple Algorithms**: 5 different classification models +- šŸ“Š **Comprehensive Visualizations**: 8+ different plot types +- šŸ”„ **Cross-Validation**: Robust model evaluation +- šŸ’¾ **Model Persistence**: Save and load trained models +- šŸ“š **Detailed Documentation**: Complete API documentation + +## šŸ“ Project Structure + +``` +Machine-Learning-with-Iris-Dataset/ +ā”œā”€ā”€ šŸ“Š data/ # Data files +│ └── Iris.csv # Original Iris dataset +ā”œā”€ā”€ šŸ““ notebooks/ # Jupyter notebooks +│ ā”œā”€ā”€ Machine Learning with Iris Dataset.ipynb +│ └── Iris Species Dataset Visualization.ipynb +ā”œā”€ā”€ šŸ”§ src/ # Source code modules +│ ā”œā”€ā”€ __init__.py +│ ā”œā”€ā”€ data_loader.py # Data loading and preprocessing +│ ā”œā”€ā”€ models.py # Machine learning models +│ └── visualization.py # Visualization functions +ā”œā”€ā”€ šŸ¤– models/ # Saved models (created during execution) +ā”œā”€ā”€ šŸ“– docs/ # Documentation +│ └── README.md # Detailed documentation +ā”œā”€ā”€ šŸš€ main.py # Main execution script +ā”œā”€ā”€ šŸ“‹ requirements.txt # Python dependencies +└── šŸ“„ README.md # This file +``` + +## ✨ Features + +### šŸŽÆ Machine Learning Models +- **Logistic Regression**: Linear classification +- **Support Vector Machine (SVM)**: Kernel-based classification +- **Random Forest**: Ensemble learning +- **K-Nearest Neighbors (KNN)**: Instance-based learning +- **Decision Tree**: Tree-based classification + +### šŸ“ˆ Analysis Capabilities +- **Data Exploration**: Comprehensive statistical analysis +- **Feature Importance**: Model interpretability +- **Model Comparison**: Performance benchmarking +- **Cross-Validation**: Robust evaluation +- **Confusion Matrix**: Detailed error analysis + +### šŸŽØ Visualization Suite +- **Distribution Plots**: Feature distributions by species +- **Correlation Matrix**: Feature relationships +- **Pair Plots**: Multi-dimensional relationships +- **Box Plots**: Statistical summaries +- **Model Performance**: Comparison charts +- **Confusion Matrix**: Error visualization +- **Feature Importance**: Model interpretability + +## šŸš€ Quick Start + +### 1. Clone the Repository +```bash +git clone https://github.com/Hrushikeshsurabhi/Machine-Learning-with-Iris-Dataset.git +cd Machine-Learning-with-Iris-Dataset +``` + +### 2. Install Dependencies +```bash +pip install -r requirements.txt +``` + +### 3. Run the Complete Pipeline +```bash +python main.py +``` + +This will: +- Load and preprocess the Iris dataset +- Train 5 different machine learning models +- Compare model performances +- Generate comprehensive visualizations +- Save the best performing model + +## šŸ’» Usage Examples + +### Basic Usage +```python +from src.data_loader import load_iris_data, preprocess_data +from src.models import IrisClassifier +from src.visualization import create_summary_plots + +# Load and preprocess data +df = load_iris_data() +X_train, X_test, y_train, y_test, scaler = preprocess_data(df) + +# Train models +classifier = IrisClassifier() +results = classifier.train_all_models(X_train, y_train) + +# Create visualizations +create_summary_plots(df, results, y_test, classifier.best_model.predict(X_test)) +``` + +### Advanced Usage +```python +# Custom preprocessing +X_train, X_test, y_train, y_test, scaler = preprocess_data(df, test_size=0.3, random_state=123) + +# Train with custom parameters +classifier = IrisClassifier() +results = classifier.train_all_models(X_train, y_train, cv=10) + +# Save and load models +classifier.save_model('my_iris_model.pkl') +classifier.load_model('my_iris_model.pkl') +``` + +## šŸ“š Documentation + +- **[Detailed Documentation](docs/README.md)**: Complete module descriptions and API reference +- **[Jupyter Notebooks](notebooks/)**: Interactive examples and tutorials +- **[Source Code](src/)**: Well-documented source modules + +## šŸ› ļø Dependencies + +### Core Libraries +- **pandas** (≄1.3.0): Data manipulation and analysis +- **numpy** (≄1.21.0): Numerical computing +- **scikit-learn** (≄1.0.0): Machine learning algorithms +- **matplotlib** (≄3.4.0): Basic plotting +- **seaborn** (≄0.11.0): Statistical data visualization + +### Additional Libraries +- **joblib** (≄1.1.0): Model persistence +- **jupyter** (≄1.0.0): Interactive notebooks + +## šŸ¤ Contributing + +This project is a fork of the original work by [venky14](https://github.com/venky14/Machine-Learning-with-Iris-Dataset). + +### Enhancements Made +- āœ… Modular code structure +- āœ… Comprehensive documentation +- āœ… Additional visualization capabilities +- āœ… Production-ready features +- āœ… Enhanced error handling +- āœ… Model persistence functionality + +### How to Contribute +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Commit your changes (`git commit -m 'Add amazing feature'`) +4. Push to the branch (`git push origin feature/amazing-feature`) +5. Open a Pull Request + +## šŸ“„ License + +This project maintains the same license as the original repository by venky14. + +## šŸ™ Acknowledgments + +- **Original Author**: [venky14](https://github.com/venky14) for the foundational work +- **Dataset Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris) +- **Libraries**: scikit-learn, pandas, matplotlib, seaborn communities + +--- + +
+

Made with ā¤ļø by Hrushikeshsurabhi

+

Forked from venky14/Machine-Learning-with-Iris-Dataset

+
diff --git a/Iris.csv b/data/Iris.csv similarity index 100% rename from Iris.csv rename to data/Iris.csv diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..b7484c7 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,130 @@ +# Project Documentation + +## Overview +This is an enhanced version of the original Iris dataset machine learning project, featuring a modular structure and comprehensive analysis capabilities. + +## Project Structure + +``` +Machine-Learning-with-Iris-Dataset/ +ā”œā”€ā”€ data/ # Data files +│ └── Iris.csv # Original Iris dataset +ā”œā”€ā”€ notebooks/ # Jupyter notebooks +│ ā”œā”€ā”€ Machine Learning with Iris Dataset.ipynb +│ └── Iris Species Dataset Visualization.ipynb +ā”œā”€ā”€ src/ # Source code modules +│ ā”œā”€ā”€ __init__.py +│ ā”œā”€ā”€ data_loader.py # Data loading and preprocessing +│ ā”œā”€ā”€ models.py # Machine learning models +│ └── visualization.py # Visualization functions +ā”œā”€ā”€ models/ # Saved models (created during execution) +ā”œā”€ā”€ docs/ # Documentation +│ └── README.md # This file +ā”œā”€ā”€ main.py # Main execution script +ā”œā”€ā”€ requirements.txt # Python dependencies +ā”œā”€ā”€ README.md # Project README +ā”œā”€ā”€ .gitignore # Git ignore file +└── .gitattributes # Git attributes +``` + +## Module Descriptions + +### src/data_loader.py +- **Purpose**: Data loading and preprocessing functions +- **Key Functions**: + - `load_iris_data()`: Load dataset from CSV or sklearn + - `preprocess_data()`: Split data and apply scaling + - `get_dataset_info()`: Get dataset statistics + +### src/models.py +- **Purpose**: Machine learning models and evaluation +- **Key Classes**: + - `IrisClassifier`: Comprehensive classifier with multiple algorithms +- **Supported Models**: + - Logistic Regression + - Support Vector Machine (SVM) + - Random Forest + - K-Nearest Neighbors (KNN) + - Decision Tree + +### src/visualization.py +- **Purpose**: Data exploration and model evaluation visualizations +- **Key Functions**: + - `plot_data_distribution()`: Feature distributions by species + - `plot_correlation_matrix()`: Feature correlations + - `plot_model_comparison()`: Model performance comparison + - `create_summary_plots()`: Comprehensive visualization suite + +## Usage + +### Quick Start +1. Install dependencies: + ```bash + pip install -r requirements.txt + ``` + +2. Run the complete pipeline: + ```bash + python main.py + ``` + +### Using Individual Modules +```python +from src.data_loader import load_iris_data, preprocess_data +from src.models import IrisClassifier +from src.visualization import create_summary_plots + +# Load and preprocess data +df = load_iris_data() +X_train, X_test, y_train, y_test, scaler = preprocess_data(df) + +# Train models +classifier = IrisClassifier() +results = classifier.train_all_models(X_train, y_train) + +# Create visualizations +create_summary_plots(df, results, y_test, classifier.best_model.predict(X_test)) +``` + +## Features + +### Enhanced Structure +- **Modular Design**: Separated concerns into logical modules +- **Reusable Components**: Functions can be used independently +- **Comprehensive Documentation**: Detailed docstrings and examples + +### Advanced Analysis +- **Multiple Algorithms**: 5 different classification algorithms +- **Cross-Validation**: Robust model evaluation +- **Feature Importance**: Analysis for tree-based models +- **Comprehensive Visualizations**: 8+ different plot types + +### Production Ready +- **Model Persistence**: Save and load trained models +- **Error Handling**: Graceful handling of missing files +- **Scalable**: Easy to extend with new algorithms + +## Dependencies + +### Core Libraries +- **pandas**: Data manipulation and analysis +- **numpy**: Numerical computing +- **scikit-learn**: Machine learning algorithms +- **matplotlib**: Basic plotting +- **seaborn**: Statistical data visualization + +### Additional Libraries +- **joblib**: Model persistence +- **jupyter**: Interactive notebooks + +## Contributing + +This project is a fork of the original work by venky14. Enhancements include: +- Modular code structure +- Comprehensive documentation +- Additional visualization capabilities +- Production-ready features + +## License + +This project maintains the same license as the original repository. \ No newline at end of file diff --git a/example.py b/example.py new file mode 100644 index 0000000..9f3c7c3 --- /dev/null +++ b/example.py @@ -0,0 +1,61 @@ +#!/usr/bin/env python3 +""" +Simple Example Script for Iris Dataset Analysis +============================================== + +This script demonstrates basic usage of the modular components +for quick analysis and experimentation. + +Author: Hrushikeshsurabhi (Forked from venky14) +""" + +import sys +import os + +# Add src directory to path +sys.path.append(os.path.join(os.path.dirname(__file__), 'src')) + +from src.data_loader import load_iris_data, preprocess_data +from src.models import IrisClassifier +from src.visualization import plot_data_distribution, plot_correlation_matrix + + +def simple_example(): + """ + Simple example demonstrating basic functionality. + """ + print("🌸 Simple Iris Dataset Analysis Example") + print("=" * 50) + + # 1. Load data + print("\n1. Loading Iris dataset...") + df = load_iris_data() + print(f" Dataset loaded: {df.shape[0]} samples, {df.shape[1]} features") + + # 2. Quick visualization + print("\n2. Creating basic visualizations...") + plot_data_distribution(df) + plot_correlation_matrix(df) + + # 3. Train a simple model + print("\n3. Training models...") + X_train, X_test, y_train, y_test, scaler = preprocess_data(df) + + classifier = IrisClassifier() + results = classifier.train_all_models(X_train, y_train, cv=3) + + # 4. Show results + print("\n4. Results Summary:") + print(f" Best model: {classifier.best_model_name}") + print(f" Best CV score: {classifier.best_score:.4f}") + + # 5. Evaluate on test set + classifier.train_best_model(X_train, y_train) + evaluation = classifier.evaluate_model(X_test, y_test) + print(f" Test accuracy: {evaluation['accuracy']:.4f}") + + print("\nāœ… Example completed successfully!") + + +if __name__ == "__main__": + simple_example() \ No newline at end of file diff --git a/main.py b/main.py new file mode 100644 index 0000000..1aa9929 --- /dev/null +++ b/main.py @@ -0,0 +1,110 @@ +#!/usr/bin/env python3 +""" +Main Script for Iris Dataset Machine Learning Project +==================================================== + +This script demonstrates a complete machine learning pipeline for the Iris dataset, +including data loading, preprocessing, model training, evaluation, and visualization. + +Author: Hrushikeshsurabhi (Forked from venky14) +""" + +import sys +import os + +# Add src directory to path +sys.path.append(os.path.join(os.path.dirname(__file__), 'src')) + +from src.data_loader import load_iris_data, preprocess_data, get_dataset_info +from src.models import IrisClassifier, create_model_comparison_report +from src.visualization import create_summary_plots, plot_confusion_matrix +import pandas as pd + + +def main(): + """ + Main function to run the complete machine learning pipeline. + """ + print("=" * 60) + print("IRIS DATASET MACHINE LEARNING PROJECT") + print("=" * 60) + print("Forked from venky14/Machine-Learning-with-Iris-Dataset") + print("Enhanced with modular structure and comprehensive analysis") + print("=" * 60) + + # Step 1: Load Data + print("\n1. LOADING DATA") + print("-" * 30) + df = load_iris_data() + + # Display dataset information + info = get_dataset_info(df) + print(f"\nDataset Shape: {info['shape']}") + print(f"Features: {info['columns']}") + print(f"Species Distribution:") + for species, count in info['species_distribution'].items(): + print(f" {species}: {count} samples") + + # Step 2: Data Preprocessing + print("\n2. DATA PREPROCESSING") + print("-" * 30) + X_train, X_test, y_train, y_test, scaler = preprocess_data(df) + + # Step 3: Model Training + print("\n3. MODEL TRAINING") + print("-" * 30) + classifier = IrisClassifier() + results = classifier.train_all_models(X_train, y_train, cv=5) + + # Display model comparison + print("\nModel Comparison Summary:") + comparison_df = create_model_comparison_report(results) + print(comparison_df.to_string(index=False)) + + # Step 4: Train Best Model + print("\n4. TRAINING BEST MODEL") + print("-" * 30) + classifier.train_best_model(X_train, y_train) + + # Step 5: Model Evaluation + print("\n5. MODEL EVALUATION") + print("-" * 30) + evaluation_results = classifier.evaluate_model(X_test, y_test) + + print(f"Test Accuracy: {evaluation_results['accuracy']:.4f}") + print("\nClassification Report:") + print(evaluation_results['classification_report']) + + # Step 6: Visualization + print("\n6. CREATING VISUALIZATIONS") + print("-" * 30) + + # Get predictions for visualization + y_pred = classifier.best_model.predict(X_test) + + # Create comprehensive visualizations + create_summary_plots( + df=df, + results=results, + y_true=y_test, + y_pred=y_pred, + model=classifier.best_model, + feature_names=['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] + ) + + # Step 7: Save Model + print("\n7. SAVING MODEL") + print("-" * 30) + classifier.save_model() + + print("\n" + "=" * 60) + print("PIPELINE COMPLETED SUCCESSFULLY!") + print("=" * 60) + print(f"Best Model: {classifier.best_model_name}") + print(f"Best CV Score: {classifier.best_score:.4f}") + print(f"Test Accuracy: {evaluation_results['accuracy']:.4f}") + print("=" * 60) + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/Iris Species Dataset Visualization.ipynb b/notebooks/Iris Species Dataset Visualization.ipynb similarity index 100% rename from Iris Species Dataset Visualization.ipynb rename to notebooks/Iris Species Dataset Visualization.ipynb diff --git a/Machine Learning with Iris Dataset.ipynb b/notebooks/Machine Learning with Iris Dataset.ipynb similarity index 100% rename from Machine Learning with Iris Dataset.ipynb rename to notebooks/Machine Learning with Iris Dataset.ipynb diff --git a/plan.txt b/plan.txt new file mode 100644 index 0000000..e69de29 diff --git a/requirements.txt b/requirements.txt index 8fe9ebf..9ff5f58 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,28 +1,21 @@ -# this version or latest -ipykernel==4.2.2 -ipython==4.1.1 -ipython-genutils==0.1.0 -ipywidgets==4.1.1 -jsonschema==2.5.1 -jupyter==1.0.0 -jupyter-client==4.1.1 -jupyter-console==4.1.0 -jupyter-core==4.0.6 -MarkupSafe==0.23 -matplotlib==1.5.1 -mistune==0.8.1 -nbconvert==4.1.0 -nbformat==4.0.1 -notebook==5.7.8 -numpy==1.10.4 -pandas==0.17.1 -scikit-learn==0.20.3 -scipy==1.1.0 -seaborn==0.7.0 -simplegeneric==0.8.1 -singledispatch==3.4.0.3 -six==1.10.0 -terminado==0.6 -tornado==4.3 -traitlets==4.1.0 -wheel==0.24.0 +# Core Data Science Libraries +pandas +numpy +scikit-learn +scipy + +# Visualization Libraries +matplotlib +seaborn + +# Jupyter Environment +jupyter +notebook +ipykernel +ipython + +# Model Persistence +joblib + +# Additional Utilities +tqdm diff --git a/rishi.md b/rishi.md new file mode 100644 index 0000000..e69de29 diff --git a/src/__init__.py b/src/__init__.py new file mode 100644 index 0000000..75d38e8 --- /dev/null +++ b/src/__init__.py @@ -0,0 +1 @@ +# Machine Learning with Iris Dataset - Source Package \ No newline at end of file diff --git a/src/data_loader.py b/src/data_loader.py new file mode 100644 index 0000000..0a6490f --- /dev/null +++ b/src/data_loader.py @@ -0,0 +1,113 @@ +""" +Data Loader Module for Iris Dataset +=================================== + +This module provides functions to load and preprocess the Iris dataset. +""" + +import pandas as pd +import numpy as np +from sklearn.datasets import load_iris +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import StandardScaler + + +def load_iris_data(file_path='../data/Iris.csv'): + """ + Load Iris dataset from CSV file. + + Parameters: + ----------- + file_path : str + Path to the Iris.csv file + + Returns: + -------- + pandas.DataFrame + Loaded Iris dataset + """ + try: + df = pd.read_csv(file_path) + print(f"Dataset loaded successfully with {len(df)} samples and {len(df.columns)} features") + return df + except FileNotFoundError: + print(f"File not found: {file_path}") + print("Loading from sklearn.datasets instead...") + return load_iris_sklearn() + + +def load_iris_sklearn(): + """ + Load Iris dataset from sklearn.datasets. + + Returns: + -------- + pandas.DataFrame + Loaded Iris dataset + """ + iris = load_iris() + df = pd.DataFrame(iris.data, columns=iris.feature_names) + df['Species'] = iris.target_names[iris.target] + print(f"Dataset loaded from sklearn with {len(df)} samples and {len(df.columns)} features") + return df + + +def preprocess_data(df, test_size=0.2, random_state=42): + """ + Preprocess the Iris dataset for machine learning. + + Parameters: + ----------- + df : pandas.DataFrame + Input dataset + test_size : float + Proportion of dataset to include in the test split + random_state : int + Random state for reproducibility + + Returns: + -------- + tuple + X_train, X_test, y_train, y_test, scaler + """ + # Separate features and target + X = df.drop('Species', axis=1) + y = df['Species'] + + # Split the data + X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=test_size, random_state=random_state, stratify=y + ) + + # Scale the features + scaler = StandardScaler() + X_train_scaled = scaler.fit_transform(X_train) + X_test_scaled = scaler.transform(X_test) + + print(f"Data preprocessed: Train set {X_train.shape[0]} samples, Test set {X_test.shape[0]} samples") + + return X_train_scaled, X_test_scaled, y_train, y_test, scaler + + +def get_dataset_info(df): + """ + Get basic information about the dataset. + + Parameters: + ----------- + df : pandas.DataFrame + Input dataset + + Returns: + -------- + dict + Dictionary containing dataset information + """ + info = { + 'shape': df.shape, + 'columns': list(df.columns), + 'dtypes': df.dtypes.to_dict(), + 'missing_values': df.isnull().sum().to_dict(), + 'species_distribution': df['Species'].value_counts().to_dict() if 'Species' in df.columns else None + } + return info \ No newline at end of file diff --git a/src/models.py b/src/models.py new file mode 100644 index 0000000..486b3a5 --- /dev/null +++ b/src/models.py @@ -0,0 +1,192 @@ +""" +Machine Learning Models Module for Iris Dataset +=============================================== + +This module provides various machine learning models and evaluation functions +for the Iris classification problem. +""" + +import numpy as np +import pandas as pd +from sklearn.linear_model import LogisticRegression +from sklearn.svm import SVC +from sklearn.ensemble import RandomForestClassifier +from sklearn.neighbors import KNeighborsClassifier +from sklearn.tree import DecisionTreeClassifier +from sklearn.metrics import accuracy_score, classification_report, confusion_matrix +from sklearn.model_selection import cross_val_score +import joblib +import os + + +class IrisClassifier: + """ + A comprehensive classifier for the Iris dataset with multiple algorithms. + """ + + def __init__(self): + self.models = { + 'logistic_regression': LogisticRegression(random_state=42), + 'svm': SVC(random_state=42), + 'random_forest': RandomForestClassifier(random_state=42), + 'knn': KNeighborsClassifier(n_neighbors=3), + 'decision_tree': DecisionTreeClassifier(random_state=42) + } + self.best_model = None + self.best_score = 0 + self.best_model_name = None + + def train_all_models(self, X_train, y_train, cv=5): + """ + Train all models and find the best performing one. + + Parameters: + ----------- + X_train : array-like + Training features + y_train : array-like + Training labels + cv : int + Number of cross-validation folds + + Returns: + -------- + dict + Dictionary with model names and their cross-validation scores + """ + results = {} + + for name, model in self.models.items(): + print(f"Training {name}...") + scores = cross_val_score(model, X_train, y_train, cv=cv) + mean_score = scores.mean() + std_score = scores.std() + + results[name] = { + 'mean_score': mean_score, + 'std_score': std_score, + 'scores': scores + } + + print(f"{name}: {mean_score:.4f} (+/- {std_score * 2:.4f})") + + # Update best model + if mean_score > self.best_score: + self.best_score = mean_score + self.best_model = model + self.best_model_name = name + + print(f"\nBest model: {self.best_model_name} with score: {self.best_score:.4f}") + return results + + def train_best_model(self, X_train, y_train): + """ + Train the best performing model on the full training set. + + Parameters: + ----------- + X_train : array-like + Training features + y_train : array-like + Training labels + """ + if self.best_model is None: + raise ValueError("No best model selected. Run train_all_models first.") + + print(f"Training best model ({self.best_model_name}) on full dataset...") + self.best_model.fit(X_train, y_train) + print("Training completed!") + + def evaluate_model(self, X_test, y_test, model=None): + """ + Evaluate a model on the test set. + + Parameters: + ----------- + X_test : array-like + Test features + y_test : array-like + Test labels + model : sklearn estimator, optional + Model to evaluate. If None, uses the best model. + + Returns: + -------- + dict + Dictionary with evaluation metrics + """ + if model is None: + model = self.best_model + + if model is None: + raise ValueError("No model available for evaluation.") + + y_pred = model.predict(X_test) + + results = { + 'accuracy': accuracy_score(y_test, y_pred), + 'classification_report': classification_report(y_test, y_pred, output_dict=True), + 'confusion_matrix': confusion_matrix(y_test, y_pred) + } + + return results + + def save_model(self, filepath='../models/best_iris_model.pkl'): + """ + Save the best model to disk. + + Parameters: + ----------- + filepath : str + Path where to save the model + """ + if self.best_model is None: + raise ValueError("No model to save. Train a model first.") + + # Create directory if it doesn't exist + os.makedirs(os.path.dirname(filepath), exist_ok=True) + + joblib.dump(self.best_model, filepath) + print(f"Model saved to {filepath}") + + def load_model(self, filepath='../models/best_iris_model.pkl'): + """ + Load a saved model from disk. + + Parameters: + ----------- + filepath : str + Path to the saved model + """ + if os.path.exists(filepath): + self.best_model = joblib.load(filepath) + print(f"Model loaded from {filepath}") + else: + print(f"Model file not found: {filepath}") + + +def create_model_comparison_report(results): + """ + Create a comparison report of all models. + + Parameters: + ----------- + results : dict + Results from train_all_models + + Returns: + -------- + pandas.DataFrame + Comparison table + """ + comparison_data = [] + + for model_name, result in results.items(): + comparison_data.append({ + 'Model': model_name, + 'Mean CV Score': f"{result['mean_score']:.4f}", + 'Std CV Score': f"{result['std_score']:.4f}", + 'Score Range': f"{result['mean_score'] - result['std_score']:.4f} - {result['mean_score'] + result['std_score']:.4f}" + }) + + return pd.DataFrame(comparison_data) \ No newline at end of file diff --git a/src/visualization.py b/src/visualization.py new file mode 100644 index 0000000..31f0a1b --- /dev/null +++ b/src/visualization.py @@ -0,0 +1,272 @@ +""" +Visualization Module for Iris Dataset +===================================== + +This module provides comprehensive visualization functions for data exploration +and model results analysis. +""" + +import matplotlib.pyplot as plt +import seaborn as sns +import pandas as pd +import numpy as np +from sklearn.metrics import confusion_matrix +import warnings +warnings.filterwarnings('ignore') + +# Set style for better looking plots +plt.style.use('seaborn-v0_8') +sns.set_palette("husl") + + +def plot_data_distribution(df, figsize=(15, 10)): + """ + Plot distribution of features by species. + + Parameters: + ----------- + df : pandas.DataFrame + Iris dataset + figsize : tuple + Figure size + """ + features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] + + fig, axes = plt.subplots(2, 2, figsize=figsize) + fig.suptitle('Feature Distributions by Species', fontsize=16, fontweight='bold') + + for i, feature in enumerate(features): + row = i // 2 + col = i % 2 + + for species in df['Species'].unique(): + species_data = df[df['Species'] == species][feature] + axes[row, col].hist(species_data, alpha=0.7, label=species, bins=15) + + axes[row, col].set_title(f'{feature} Distribution') + axes[row, col].set_xlabel(feature) + axes[row, col].set_ylabel('Frequency') + axes[row, col].legend() + axes[row, col].grid(True, alpha=0.3) + + plt.tight_layout() + plt.show() + + +def plot_correlation_matrix(df, figsize=(10, 8)): + """ + Plot correlation matrix heatmap. + + Parameters: + ----------- + df : pandas.DataFrame + Iris dataset + figsize : tuple + Figure size + """ + # Select numeric columns + numeric_cols = df.select_dtypes(include=[np.number]).columns + correlation_matrix = df[numeric_cols].corr() + + plt.figure(figsize=figsize) + sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, + square=True, linewidths=0.5, cbar_kws={"shrink": .8}) + plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold') + plt.tight_layout() + plt.show() + + +def plot_pairplot(df, hue='Species', figsize=(12, 10)): + """ + Create a pairplot for feature relationships. + + Parameters: + ----------- + df : pandas.DataFrame + Iris dataset + hue : str + Column to use for color coding + figsize : tuple + Figure size + """ + features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] + + plt.figure(figsize=figsize) + sns.pairplot(df[features + [hue]], hue=hue, diag_kind='hist', + plot_kws={'alpha': 0.7}, diag_kws={'alpha': 0.7}) + plt.suptitle('Feature Relationships by Species', y=1.02, fontsize=16, fontweight='bold') + plt.show() + + +def plot_boxplots(df, figsize=(15, 8)): + """ + Create boxplots for feature distributions by species. + + Parameters: + ----------- + df : pandas.DataFrame + Iris dataset + figsize : tuple + Figure size + """ + features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] + + fig, axes = plt.subplots(1, 4, figsize=figsize) + fig.suptitle('Feature Distributions by Species (Boxplots)', fontsize=16, fontweight='bold') + + for i, feature in enumerate(features): + sns.boxplot(data=df, x='Species', y=feature, ax=axes[i]) + axes[i].set_title(f'{feature} by Species') + axes[i].set_xlabel('Species') + axes[i].set_ylabel(feature) + axes[i].tick_params(axis='x', rotation=45) + + plt.tight_layout() + plt.show() + + +def plot_confusion_matrix(y_true, y_pred, class_names=None, figsize=(8, 6)): + """ + Plot confusion matrix. + + Parameters: + ----------- + y_true : array-like + True labels + y_pred : array-like + Predicted labels + class_names : list, optional + Names of the classes + figsize : tuple + Figure size + """ + cm = confusion_matrix(y_true, y_pred) + + plt.figure(figsize=figsize) + sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', + xticklabels=class_names, yticklabels=class_names) + plt.title('Confusion Matrix', fontsize=16, fontweight='bold') + plt.xlabel('Predicted Label') + plt.ylabel('True Label') + plt.tight_layout() + plt.show() + + +def plot_model_comparison(results, figsize=(12, 6)): + """ + Plot model comparison results. + + Parameters: + ----------- + results : dict + Results from model training + figsize : tuple + Figure size + """ + model_names = list(results.keys()) + mean_scores = [results[name]['mean_score'] for name in model_names] + std_scores = [results[name]['std_score'] for name in model_names] + + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize) + + # Bar plot of mean scores + bars = ax1.bar(model_names, mean_scores, yerr=std_scores, + capsize=5, alpha=0.7, color='skyblue', edgecolor='navy') + ax1.set_title('Model Performance Comparison', fontweight='bold') + ax1.set_ylabel('Cross-Validation Score') + ax1.set_xlabel('Models') + ax1.tick_params(axis='x', rotation=45) + ax1.grid(True, alpha=0.3) + + # Add value labels on bars + for bar, score in zip(bars, mean_scores): + height = bar.get_height() + ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01, + f'{score:.3f}', ha='center', va='bottom', fontweight='bold') + + # Score distribution + for i, name in enumerate(model_names): + scores = results[name]['scores'] + ax2.hist(scores, alpha=0.7, label=name, bins=10) + + ax2.set_title('Score Distributions', fontweight='bold') + ax2.set_xlabel('Cross-Validation Score') + ax2.set_ylabel('Frequency') + ax2.legend() + ax2.grid(True, alpha=0.3) + + plt.tight_layout() + plt.show() + + +def plot_feature_importance(model, feature_names, figsize=(10, 6)): + """ + Plot feature importance for tree-based models. + + Parameters: + ----------- + model : sklearn estimator + Trained model with feature_importances_ attribute + feature_names : list + Names of the features + figsize : tuple + Figure size + """ + if hasattr(model, 'feature_importances_'): + importances = model.feature_importances_ + indices = np.argsort(importances)[::-1] + + plt.figure(figsize=figsize) + plt.title('Feature Importance', fontsize=16, fontweight='bold') + plt.bar(range(len(importances)), importances[indices], + color='lightcoral', alpha=0.7) + plt.xticks(range(len(importances)), [feature_names[i] for i in indices], + rotation=45, ha='right') + plt.xlabel('Features') + plt.ylabel('Importance') + plt.grid(True, alpha=0.3) + plt.tight_layout() + plt.show() + else: + print("This model doesn't have feature importance attribute.") + + +def create_summary_plots(df, results=None, y_true=None, y_pred=None, + model=None, feature_names=None): + """ + Create a comprehensive set of summary plots. + + Parameters: + ----------- + df : pandas.DataFrame + Iris dataset + results : dict, optional + Model training results + y_true : array-like, optional + True labels for confusion matrix + y_pred : array-like, optional + Predicted labels for confusion matrix + model : sklearn estimator, optional + Trained model for feature importance + feature_names : list, optional + Feature names for importance plot + """ + print("Creating comprehensive visualization summary...") + + # Data exploration plots + plot_data_distribution(df) + plot_correlation_matrix(df) + plot_pairplot(df) + plot_boxplots(df) + + # Model evaluation plots + if results is not None: + plot_model_comparison(results) + + if y_true is not None and y_pred is not None: + plot_confusion_matrix(y_true, y_pred) + + if model is not None and feature_names is not None: + plot_feature_importance(model, feature_names) + + print("Visualization summary completed!") \ No newline at end of file