diff --git a/README.md b/README.md
index bde835c..ed458d4 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,198 @@
-# Machine Learning with Iris Dataset
-
- 
-
+# šø Machine Learning with Iris Dataset
-## Introduction
-The Iris dataset is a classic dataset for classification, machine learning, and data visualization.
+[](https://www.python.org/downloads/)
+[](https://scikit-learn.org/)
+[](https://en.wikipedia.org/wiki/Supervised_learning)
+[](https://github.com/Hrushikeshsurabhi/Machine-Learning-with-Iris-Dataset)
+[](https://github.com/venky14/Machine-Learning-with-Iris-Dataset)
-The dataset contains: 3 classes (different Iris species) with 50 samples each, and then four numeric properties about those classes: Sepal Length, Sepal Width, Petal Length, and Petal Width.
+> **Enhanced Version**: This is a forked and improved version of the original Iris dataset project by [venky14](https://github.com/venky14/Machine-Learning-with-Iris-Dataset), featuring a modular structure, comprehensive analysis, and production-ready code.
-One species, Iris Setosa, is "linearly separable" from the other two. This means that we can draw a line (or a hyperplane in higher-dimensional spaces) between Iris Setosa samples and samples corresponding to the other two species.
+## š Table of Contents
-Predicted Attribute: Different Species of Iris plant.
+- [Introduction](#introduction)
+- [Project Structure](#project-structure)
+- [Features](#features)
+- [Quick Start](#quick-start)
+- [Usage Examples](#usage-examples)
+- [Documentation](#documentation)
+- [Contributing](#contributing)
-## Purpose
-The purpose of this project was to gain introductory exposure to Machine Learning Classification concepts along with data visualization. The project makes heavy use of Scikit-Learn, Pandas and Data Visualization Libraries.
+## šŗ Introduction
+
+The Iris dataset is a classic dataset for classification, machine learning, and data visualization. This enhanced version provides a comprehensive analysis with a modular, production-ready structure.
+
+### Dataset Information
+- **3 Classes**: Different Iris species (Setosa, Versicolor, Virginica)
+- **4 Features**: Sepal Length, Sepal Width, Petal Length, Petal Width
+- **150 Samples**: 50 samples per species
+- **Linearly Separable**: Iris Setosa is linearly separable from the other two species
+
+### Enhanced Features
+- šļø **Modular Architecture**: Clean separation of concerns
+- š¤ **Multiple Algorithms**: 5 different classification models
+- š **Comprehensive Visualizations**: 8+ different plot types
+- š **Cross-Validation**: Robust model evaluation
+- š¾ **Model Persistence**: Save and load trained models
+- š **Detailed Documentation**: Complete API documentation
+
+## š Project Structure
+
+```
+Machine-Learning-with-Iris-Dataset/
+āāā š data/ # Data files
+ā āāā Iris.csv # Original Iris dataset
+āāā š notebooks/ # Jupyter notebooks
+ā āāā Machine Learning with Iris Dataset.ipynb
+ā āāā Iris Species Dataset Visualization.ipynb
+āāā š§ src/ # Source code modules
+ā āāā __init__.py
+ā āāā data_loader.py # Data loading and preprocessing
+ā āāā models.py # Machine learning models
+ā āāā visualization.py # Visualization functions
+āāā š¤ models/ # Saved models (created during execution)
+āāā š docs/ # Documentation
+ā āāā README.md # Detailed documentation
+āāā š main.py # Main execution script
+āāā š requirements.txt # Python dependencies
+āāā š README.md # This file
+```
+
+## ⨠Features
+
+### šÆ Machine Learning Models
+- **Logistic Regression**: Linear classification
+- **Support Vector Machine (SVM)**: Kernel-based classification
+- **Random Forest**: Ensemble learning
+- **K-Nearest Neighbors (KNN)**: Instance-based learning
+- **Decision Tree**: Tree-based classification
+
+### š Analysis Capabilities
+- **Data Exploration**: Comprehensive statistical analysis
+- **Feature Importance**: Model interpretability
+- **Model Comparison**: Performance benchmarking
+- **Cross-Validation**: Robust evaluation
+- **Confusion Matrix**: Detailed error analysis
+
+### šØ Visualization Suite
+- **Distribution Plots**: Feature distributions by species
+- **Correlation Matrix**: Feature relationships
+- **Pair Plots**: Multi-dimensional relationships
+- **Box Plots**: Statistical summaries
+- **Model Performance**: Comparison charts
+- **Confusion Matrix**: Error visualization
+- **Feature Importance**: Model interpretability
+
+## š Quick Start
+
+### 1. Clone the Repository
+```bash
+git clone https://github.com/Hrushikeshsurabhi/Machine-Learning-with-Iris-Dataset.git
+cd Machine-Learning-with-Iris-Dataset
+```
+
+### 2. Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Run the Complete Pipeline
+```bash
+python main.py
+```
+
+This will:
+- Load and preprocess the Iris dataset
+- Train 5 different machine learning models
+- Compare model performances
+- Generate comprehensive visualizations
+- Save the best performing model
+
+## š» Usage Examples
+
+### Basic Usage
+```python
+from src.data_loader import load_iris_data, preprocess_data
+from src.models import IrisClassifier
+from src.visualization import create_summary_plots
+
+# Load and preprocess data
+df = load_iris_data()
+X_train, X_test, y_train, y_test, scaler = preprocess_data(df)
+
+# Train models
+classifier = IrisClassifier()
+results = classifier.train_all_models(X_train, y_train)
+
+# Create visualizations
+create_summary_plots(df, results, y_test, classifier.best_model.predict(X_test))
+```
+
+### Advanced Usage
+```python
+# Custom preprocessing
+X_train, X_test, y_train, y_test, scaler = preprocess_data(df, test_size=0.3, random_state=123)
+
+# Train with custom parameters
+classifier = IrisClassifier()
+results = classifier.train_all_models(X_train, y_train, cv=10)
+
+# Save and load models
+classifier.save_model('my_iris_model.pkl')
+classifier.load_model('my_iris_model.pkl')
+```
+
+## š Documentation
+
+- **[Detailed Documentation](docs/README.md)**: Complete module descriptions and API reference
+- **[Jupyter Notebooks](notebooks/)**: Interactive examples and tutorials
+- **[Source Code](src/)**: Well-documented source modules
+
+## š ļø Dependencies
+
+### Core Libraries
+- **pandas** (ā„1.3.0): Data manipulation and analysis
+- **numpy** (ā„1.21.0): Numerical computing
+- **scikit-learn** (ā„1.0.0): Machine learning algorithms
+- **matplotlib** (ā„3.4.0): Basic plotting
+- **seaborn** (ā„0.11.0): Statistical data visualization
+
+### Additional Libraries
+- **joblib** (ā„1.1.0): Model persistence
+- **jupyter** (ā„1.0.0): Interactive notebooks
+
+## š¤ Contributing
+
+This project is a fork of the original work by [venky14](https://github.com/venky14/Machine-Learning-with-Iris-Dataset).
+
+### Enhancements Made
+- ā
Modular code structure
+- ā
Comprehensive documentation
+- ā
Additional visualization capabilities
+- ā
Production-ready features
+- ā
Enhanced error handling
+- ā
Model persistence functionality
+
+### How to Contribute
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/amazing-feature`)
+3. Commit your changes (`git commit -m 'Add amazing feature'`)
+4. Push to the branch (`git push origin feature/amazing-feature`)
+5. Open a Pull Request
+
+## š License
+
+This project maintains the same license as the original repository by venky14.
+
+## š Acknowledgments
+
+- **Original Author**: [venky14](https://github.com/venky14) for the foundational work
+- **Dataset Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)
+- **Libraries**: scikit-learn, pandas, matplotlib, seaborn communities
+
+---
+
+
diff --git a/Iris.csv b/data/Iris.csv
similarity index 100%
rename from Iris.csv
rename to data/Iris.csv
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..b7484c7
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,130 @@
+# Project Documentation
+
+## Overview
+This is an enhanced version of the original Iris dataset machine learning project, featuring a modular structure and comprehensive analysis capabilities.
+
+## Project Structure
+
+```
+Machine-Learning-with-Iris-Dataset/
+āāā data/ # Data files
+ā āāā Iris.csv # Original Iris dataset
+āāā notebooks/ # Jupyter notebooks
+ā āāā Machine Learning with Iris Dataset.ipynb
+ā āāā Iris Species Dataset Visualization.ipynb
+āāā src/ # Source code modules
+ā āāā __init__.py
+ā āāā data_loader.py # Data loading and preprocessing
+ā āāā models.py # Machine learning models
+ā āāā visualization.py # Visualization functions
+āāā models/ # Saved models (created during execution)
+āāā docs/ # Documentation
+ā āāā README.md # This file
+āāā main.py # Main execution script
+āāā requirements.txt # Python dependencies
+āāā README.md # Project README
+āāā .gitignore # Git ignore file
+āāā .gitattributes # Git attributes
+```
+
+## Module Descriptions
+
+### src/data_loader.py
+- **Purpose**: Data loading and preprocessing functions
+- **Key Functions**:
+ - `load_iris_data()`: Load dataset from CSV or sklearn
+ - `preprocess_data()`: Split data and apply scaling
+ - `get_dataset_info()`: Get dataset statistics
+
+### src/models.py
+- **Purpose**: Machine learning models and evaluation
+- **Key Classes**:
+ - `IrisClassifier`: Comprehensive classifier with multiple algorithms
+- **Supported Models**:
+ - Logistic Regression
+ - Support Vector Machine (SVM)
+ - Random Forest
+ - K-Nearest Neighbors (KNN)
+ - Decision Tree
+
+### src/visualization.py
+- **Purpose**: Data exploration and model evaluation visualizations
+- **Key Functions**:
+ - `plot_data_distribution()`: Feature distributions by species
+ - `plot_correlation_matrix()`: Feature correlations
+ - `plot_model_comparison()`: Model performance comparison
+ - `create_summary_plots()`: Comprehensive visualization suite
+
+## Usage
+
+### Quick Start
+1. Install dependencies:
+ ```bash
+ pip install -r requirements.txt
+ ```
+
+2. Run the complete pipeline:
+ ```bash
+ python main.py
+ ```
+
+### Using Individual Modules
+```python
+from src.data_loader import load_iris_data, preprocess_data
+from src.models import IrisClassifier
+from src.visualization import create_summary_plots
+
+# Load and preprocess data
+df = load_iris_data()
+X_train, X_test, y_train, y_test, scaler = preprocess_data(df)
+
+# Train models
+classifier = IrisClassifier()
+results = classifier.train_all_models(X_train, y_train)
+
+# Create visualizations
+create_summary_plots(df, results, y_test, classifier.best_model.predict(X_test))
+```
+
+## Features
+
+### Enhanced Structure
+- **Modular Design**: Separated concerns into logical modules
+- **Reusable Components**: Functions can be used independently
+- **Comprehensive Documentation**: Detailed docstrings and examples
+
+### Advanced Analysis
+- **Multiple Algorithms**: 5 different classification algorithms
+- **Cross-Validation**: Robust model evaluation
+- **Feature Importance**: Analysis for tree-based models
+- **Comprehensive Visualizations**: 8+ different plot types
+
+### Production Ready
+- **Model Persistence**: Save and load trained models
+- **Error Handling**: Graceful handling of missing files
+- **Scalable**: Easy to extend with new algorithms
+
+## Dependencies
+
+### Core Libraries
+- **pandas**: Data manipulation and analysis
+- **numpy**: Numerical computing
+- **scikit-learn**: Machine learning algorithms
+- **matplotlib**: Basic plotting
+- **seaborn**: Statistical data visualization
+
+### Additional Libraries
+- **joblib**: Model persistence
+- **jupyter**: Interactive notebooks
+
+## Contributing
+
+This project is a fork of the original work by venky14. Enhancements include:
+- Modular code structure
+- Comprehensive documentation
+- Additional visualization capabilities
+- Production-ready features
+
+## License
+
+This project maintains the same license as the original repository.
\ No newline at end of file
diff --git a/example.py b/example.py
new file mode 100644
index 0000000..9f3c7c3
--- /dev/null
+++ b/example.py
@@ -0,0 +1,61 @@
+#!/usr/bin/env python3
+"""
+Simple Example Script for Iris Dataset Analysis
+==============================================
+
+This script demonstrates basic usage of the modular components
+for quick analysis and experimentation.
+
+Author: Hrushikeshsurabhi (Forked from venky14)
+"""
+
+import sys
+import os
+
+# Add src directory to path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+
+from src.data_loader import load_iris_data, preprocess_data
+from src.models import IrisClassifier
+from src.visualization import plot_data_distribution, plot_correlation_matrix
+
+
+def simple_example():
+ """
+ Simple example demonstrating basic functionality.
+ """
+ print("šø Simple Iris Dataset Analysis Example")
+ print("=" * 50)
+
+ # 1. Load data
+ print("\n1. Loading Iris dataset...")
+ df = load_iris_data()
+ print(f" Dataset loaded: {df.shape[0]} samples, {df.shape[1]} features")
+
+ # 2. Quick visualization
+ print("\n2. Creating basic visualizations...")
+ plot_data_distribution(df)
+ plot_correlation_matrix(df)
+
+ # 3. Train a simple model
+ print("\n3. Training models...")
+ X_train, X_test, y_train, y_test, scaler = preprocess_data(df)
+
+ classifier = IrisClassifier()
+ results = classifier.train_all_models(X_train, y_train, cv=3)
+
+ # 4. Show results
+ print("\n4. Results Summary:")
+ print(f" Best model: {classifier.best_model_name}")
+ print(f" Best CV score: {classifier.best_score:.4f}")
+
+ # 5. Evaluate on test set
+ classifier.train_best_model(X_train, y_train)
+ evaluation = classifier.evaluate_model(X_test, y_test)
+ print(f" Test accuracy: {evaluation['accuracy']:.4f}")
+
+ print("\nā
Example completed successfully!")
+
+
+if __name__ == "__main__":
+ simple_example()
\ No newline at end of file
diff --git a/main.py b/main.py
new file mode 100644
index 0000000..1aa9929
--- /dev/null
+++ b/main.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+"""
+Main Script for Iris Dataset Machine Learning Project
+====================================================
+
+This script demonstrates a complete machine learning pipeline for the Iris dataset,
+including data loading, preprocessing, model training, evaluation, and visualization.
+
+Author: Hrushikeshsurabhi (Forked from venky14)
+"""
+
+import sys
+import os
+
+# Add src directory to path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+
+from src.data_loader import load_iris_data, preprocess_data, get_dataset_info
+from src.models import IrisClassifier, create_model_comparison_report
+from src.visualization import create_summary_plots, plot_confusion_matrix
+import pandas as pd
+
+
+def main():
+ """
+ Main function to run the complete machine learning pipeline.
+ """
+ print("=" * 60)
+ print("IRIS DATASET MACHINE LEARNING PROJECT")
+ print("=" * 60)
+ print("Forked from venky14/Machine-Learning-with-Iris-Dataset")
+ print("Enhanced with modular structure and comprehensive analysis")
+ print("=" * 60)
+
+ # Step 1: Load Data
+ print("\n1. LOADING DATA")
+ print("-" * 30)
+ df = load_iris_data()
+
+ # Display dataset information
+ info = get_dataset_info(df)
+ print(f"\nDataset Shape: {info['shape']}")
+ print(f"Features: {info['columns']}")
+ print(f"Species Distribution:")
+ for species, count in info['species_distribution'].items():
+ print(f" {species}: {count} samples")
+
+ # Step 2: Data Preprocessing
+ print("\n2. DATA PREPROCESSING")
+ print("-" * 30)
+ X_train, X_test, y_train, y_test, scaler = preprocess_data(df)
+
+ # Step 3: Model Training
+ print("\n3. MODEL TRAINING")
+ print("-" * 30)
+ classifier = IrisClassifier()
+ results = classifier.train_all_models(X_train, y_train, cv=5)
+
+ # Display model comparison
+ print("\nModel Comparison Summary:")
+ comparison_df = create_model_comparison_report(results)
+ print(comparison_df.to_string(index=False))
+
+ # Step 4: Train Best Model
+ print("\n4. TRAINING BEST MODEL")
+ print("-" * 30)
+ classifier.train_best_model(X_train, y_train)
+
+ # Step 5: Model Evaluation
+ print("\n5. MODEL EVALUATION")
+ print("-" * 30)
+ evaluation_results = classifier.evaluate_model(X_test, y_test)
+
+ print(f"Test Accuracy: {evaluation_results['accuracy']:.4f}")
+ print("\nClassification Report:")
+ print(evaluation_results['classification_report'])
+
+ # Step 6: Visualization
+ print("\n6. CREATING VISUALIZATIONS")
+ print("-" * 30)
+
+ # Get predictions for visualization
+ y_pred = classifier.best_model.predict(X_test)
+
+ # Create comprehensive visualizations
+ create_summary_plots(
+ df=df,
+ results=results,
+ y_true=y_test,
+ y_pred=y_pred,
+ model=classifier.best_model,
+ feature_names=['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
+ )
+
+ # Step 7: Save Model
+ print("\n7. SAVING MODEL")
+ print("-" * 30)
+ classifier.save_model()
+
+ print("\n" + "=" * 60)
+ print("PIPELINE COMPLETED SUCCESSFULLY!")
+ print("=" * 60)
+ print(f"Best Model: {classifier.best_model_name}")
+ print(f"Best CV Score: {classifier.best_score:.4f}")
+ print(f"Test Accuracy: {evaluation_results['accuracy']:.4f}")
+ print("=" * 60)
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/Iris Species Dataset Visualization.ipynb b/notebooks/Iris Species Dataset Visualization.ipynb
similarity index 100%
rename from Iris Species Dataset Visualization.ipynb
rename to notebooks/Iris Species Dataset Visualization.ipynb
diff --git a/Machine Learning with Iris Dataset.ipynb b/notebooks/Machine Learning with Iris Dataset.ipynb
similarity index 100%
rename from Machine Learning with Iris Dataset.ipynb
rename to notebooks/Machine Learning with Iris Dataset.ipynb
diff --git a/plan.txt b/plan.txt
new file mode 100644
index 0000000..e69de29
diff --git a/requirements.txt b/requirements.txt
index 8fe9ebf..9ff5f58 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,28 +1,21 @@
-# this version or latest
-ipykernel==4.2.2
-ipython==4.1.1
-ipython-genutils==0.1.0
-ipywidgets==4.1.1
-jsonschema==2.5.1
-jupyter==1.0.0
-jupyter-client==4.1.1
-jupyter-console==4.1.0
-jupyter-core==4.0.6
-MarkupSafe==0.23
-matplotlib==1.5.1
-mistune==0.8.1
-nbconvert==4.1.0
-nbformat==4.0.1
-notebook==5.7.8
-numpy==1.10.4
-pandas==0.17.1
-scikit-learn==0.20.3
-scipy==1.1.0
-seaborn==0.7.0
-simplegeneric==0.8.1
-singledispatch==3.4.0.3
-six==1.10.0
-terminado==0.6
-tornado==4.3
-traitlets==4.1.0
-wheel==0.24.0
+# Core Data Science Libraries
+pandas
+numpy
+scikit-learn
+scipy
+
+# Visualization Libraries
+matplotlib
+seaborn
+
+# Jupyter Environment
+jupyter
+notebook
+ipykernel
+ipython
+
+# Model Persistence
+joblib
+
+# Additional Utilities
+tqdm
diff --git a/rishi.md b/rishi.md
new file mode 100644
index 0000000..e69de29
diff --git a/src/__init__.py b/src/__init__.py
new file mode 100644
index 0000000..75d38e8
--- /dev/null
+++ b/src/__init__.py
@@ -0,0 +1 @@
+# Machine Learning with Iris Dataset - Source Package
\ No newline at end of file
diff --git a/src/data_loader.py b/src/data_loader.py
new file mode 100644
index 0000000..0a6490f
--- /dev/null
+++ b/src/data_loader.py
@@ -0,0 +1,113 @@
+"""
+Data Loader Module for Iris Dataset
+===================================
+
+This module provides functions to load and preprocess the Iris dataset.
+"""
+
+import pandas as pd
+import numpy as np
+from sklearn.datasets import load_iris
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+
+
+def load_iris_data(file_path='../data/Iris.csv'):
+ """
+ Load Iris dataset from CSV file.
+
+ Parameters:
+ -----------
+ file_path : str
+ Path to the Iris.csv file
+
+ Returns:
+ --------
+ pandas.DataFrame
+ Loaded Iris dataset
+ """
+ try:
+ df = pd.read_csv(file_path)
+ print(f"Dataset loaded successfully with {len(df)} samples and {len(df.columns)} features")
+ return df
+ except FileNotFoundError:
+ print(f"File not found: {file_path}")
+ print("Loading from sklearn.datasets instead...")
+ return load_iris_sklearn()
+
+
+def load_iris_sklearn():
+ """
+ Load Iris dataset from sklearn.datasets.
+
+ Returns:
+ --------
+ pandas.DataFrame
+ Loaded Iris dataset
+ """
+ iris = load_iris()
+ df = pd.DataFrame(iris.data, columns=iris.feature_names)
+ df['Species'] = iris.target_names[iris.target]
+ print(f"Dataset loaded from sklearn with {len(df)} samples and {len(df.columns)} features")
+ return df
+
+
+def preprocess_data(df, test_size=0.2, random_state=42):
+ """
+ Preprocess the Iris dataset for machine learning.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Input dataset
+ test_size : float
+ Proportion of dataset to include in the test split
+ random_state : int
+ Random state for reproducibility
+
+ Returns:
+ --------
+ tuple
+ X_train, X_test, y_train, y_test, scaler
+ """
+ # Separate features and target
+ X = df.drop('Species', axis=1)
+ y = df['Species']
+
+ # Split the data
+ X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=test_size, random_state=random_state, stratify=y
+ )
+
+ # Scale the features
+ scaler = StandardScaler()
+ X_train_scaled = scaler.fit_transform(X_train)
+ X_test_scaled = scaler.transform(X_test)
+
+ print(f"Data preprocessed: Train set {X_train.shape[0]} samples, Test set {X_test.shape[0]} samples")
+
+ return X_train_scaled, X_test_scaled, y_train, y_test, scaler
+
+
+def get_dataset_info(df):
+ """
+ Get basic information about the dataset.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Input dataset
+
+ Returns:
+ --------
+ dict
+ Dictionary containing dataset information
+ """
+ info = {
+ 'shape': df.shape,
+ 'columns': list(df.columns),
+ 'dtypes': df.dtypes.to_dict(),
+ 'missing_values': df.isnull().sum().to_dict(),
+ 'species_distribution': df['Species'].value_counts().to_dict() if 'Species' in df.columns else None
+ }
+ return info
\ No newline at end of file
diff --git a/src/models.py b/src/models.py
new file mode 100644
index 0000000..486b3a5
--- /dev/null
+++ b/src/models.py
@@ -0,0 +1,192 @@
+"""
+Machine Learning Models Module for Iris Dataset
+===============================================
+
+This module provides various machine learning models and evaluation functions
+for the Iris classification problem.
+"""
+
+import numpy as np
+import pandas as pd
+from sklearn.linear_model import LogisticRegression
+from sklearn.svm import SVC
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+from sklearn.model_selection import cross_val_score
+import joblib
+import os
+
+
+class IrisClassifier:
+ """
+ A comprehensive classifier for the Iris dataset with multiple algorithms.
+ """
+
+ def __init__(self):
+ self.models = {
+ 'logistic_regression': LogisticRegression(random_state=42),
+ 'svm': SVC(random_state=42),
+ 'random_forest': RandomForestClassifier(random_state=42),
+ 'knn': KNeighborsClassifier(n_neighbors=3),
+ 'decision_tree': DecisionTreeClassifier(random_state=42)
+ }
+ self.best_model = None
+ self.best_score = 0
+ self.best_model_name = None
+
+ def train_all_models(self, X_train, y_train, cv=5):
+ """
+ Train all models and find the best performing one.
+
+ Parameters:
+ -----------
+ X_train : array-like
+ Training features
+ y_train : array-like
+ Training labels
+ cv : int
+ Number of cross-validation folds
+
+ Returns:
+ --------
+ dict
+ Dictionary with model names and their cross-validation scores
+ """
+ results = {}
+
+ for name, model in self.models.items():
+ print(f"Training {name}...")
+ scores = cross_val_score(model, X_train, y_train, cv=cv)
+ mean_score = scores.mean()
+ std_score = scores.std()
+
+ results[name] = {
+ 'mean_score': mean_score,
+ 'std_score': std_score,
+ 'scores': scores
+ }
+
+ print(f"{name}: {mean_score:.4f} (+/- {std_score * 2:.4f})")
+
+ # Update best model
+ if mean_score > self.best_score:
+ self.best_score = mean_score
+ self.best_model = model
+ self.best_model_name = name
+
+ print(f"\nBest model: {self.best_model_name} with score: {self.best_score:.4f}")
+ return results
+
+ def train_best_model(self, X_train, y_train):
+ """
+ Train the best performing model on the full training set.
+
+ Parameters:
+ -----------
+ X_train : array-like
+ Training features
+ y_train : array-like
+ Training labels
+ """
+ if self.best_model is None:
+ raise ValueError("No best model selected. Run train_all_models first.")
+
+ print(f"Training best model ({self.best_model_name}) on full dataset...")
+ self.best_model.fit(X_train, y_train)
+ print("Training completed!")
+
+ def evaluate_model(self, X_test, y_test, model=None):
+ """
+ Evaluate a model on the test set.
+
+ Parameters:
+ -----------
+ X_test : array-like
+ Test features
+ y_test : array-like
+ Test labels
+ model : sklearn estimator, optional
+ Model to evaluate. If None, uses the best model.
+
+ Returns:
+ --------
+ dict
+ Dictionary with evaluation metrics
+ """
+ if model is None:
+ model = self.best_model
+
+ if model is None:
+ raise ValueError("No model available for evaluation.")
+
+ y_pred = model.predict(X_test)
+
+ results = {
+ 'accuracy': accuracy_score(y_test, y_pred),
+ 'classification_report': classification_report(y_test, y_pred, output_dict=True),
+ 'confusion_matrix': confusion_matrix(y_test, y_pred)
+ }
+
+ return results
+
+ def save_model(self, filepath='../models/best_iris_model.pkl'):
+ """
+ Save the best model to disk.
+
+ Parameters:
+ -----------
+ filepath : str
+ Path where to save the model
+ """
+ if self.best_model is None:
+ raise ValueError("No model to save. Train a model first.")
+
+ # Create directory if it doesn't exist
+ os.makedirs(os.path.dirname(filepath), exist_ok=True)
+
+ joblib.dump(self.best_model, filepath)
+ print(f"Model saved to {filepath}")
+
+ def load_model(self, filepath='../models/best_iris_model.pkl'):
+ """
+ Load a saved model from disk.
+
+ Parameters:
+ -----------
+ filepath : str
+ Path to the saved model
+ """
+ if os.path.exists(filepath):
+ self.best_model = joblib.load(filepath)
+ print(f"Model loaded from {filepath}")
+ else:
+ print(f"Model file not found: {filepath}")
+
+
+def create_model_comparison_report(results):
+ """
+ Create a comparison report of all models.
+
+ Parameters:
+ -----------
+ results : dict
+ Results from train_all_models
+
+ Returns:
+ --------
+ pandas.DataFrame
+ Comparison table
+ """
+ comparison_data = []
+
+ for model_name, result in results.items():
+ comparison_data.append({
+ 'Model': model_name,
+ 'Mean CV Score': f"{result['mean_score']:.4f}",
+ 'Std CV Score': f"{result['std_score']:.4f}",
+ 'Score Range': f"{result['mean_score'] - result['std_score']:.4f} - {result['mean_score'] + result['std_score']:.4f}"
+ })
+
+ return pd.DataFrame(comparison_data)
\ No newline at end of file
diff --git a/src/visualization.py b/src/visualization.py
new file mode 100644
index 0000000..31f0a1b
--- /dev/null
+++ b/src/visualization.py
@@ -0,0 +1,272 @@
+"""
+Visualization Module for Iris Dataset
+=====================================
+
+This module provides comprehensive visualization functions for data exploration
+and model results analysis.
+"""
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+import pandas as pd
+import numpy as np
+from sklearn.metrics import confusion_matrix
+import warnings
+warnings.filterwarnings('ignore')
+
+# Set style for better looking plots
+plt.style.use('seaborn-v0_8')
+sns.set_palette("husl")
+
+
+def plot_data_distribution(df, figsize=(15, 10)):
+ """
+ Plot distribution of features by species.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Iris dataset
+ figsize : tuple
+ Figure size
+ """
+ features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
+
+ fig, axes = plt.subplots(2, 2, figsize=figsize)
+ fig.suptitle('Feature Distributions by Species', fontsize=16, fontweight='bold')
+
+ for i, feature in enumerate(features):
+ row = i // 2
+ col = i % 2
+
+ for species in df['Species'].unique():
+ species_data = df[df['Species'] == species][feature]
+ axes[row, col].hist(species_data, alpha=0.7, label=species, bins=15)
+
+ axes[row, col].set_title(f'{feature} Distribution')
+ axes[row, col].set_xlabel(feature)
+ axes[row, col].set_ylabel('Frequency')
+ axes[row, col].legend()
+ axes[row, col].grid(True, alpha=0.3)
+
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_correlation_matrix(df, figsize=(10, 8)):
+ """
+ Plot correlation matrix heatmap.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Iris dataset
+ figsize : tuple
+ Figure size
+ """
+ # Select numeric columns
+ numeric_cols = df.select_dtypes(include=[np.number]).columns
+ correlation_matrix = df[numeric_cols].corr()
+
+ plt.figure(figsize=figsize)
+ sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
+ square=True, linewidths=0.5, cbar_kws={"shrink": .8})
+ plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_pairplot(df, hue='Species', figsize=(12, 10)):
+ """
+ Create a pairplot for feature relationships.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Iris dataset
+ hue : str
+ Column to use for color coding
+ figsize : tuple
+ Figure size
+ """
+ features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
+
+ plt.figure(figsize=figsize)
+ sns.pairplot(df[features + [hue]], hue=hue, diag_kind='hist',
+ plot_kws={'alpha': 0.7}, diag_kws={'alpha': 0.7})
+ plt.suptitle('Feature Relationships by Species', y=1.02, fontsize=16, fontweight='bold')
+ plt.show()
+
+
+def plot_boxplots(df, figsize=(15, 8)):
+ """
+ Create boxplots for feature distributions by species.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Iris dataset
+ figsize : tuple
+ Figure size
+ """
+ features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
+
+ fig, axes = plt.subplots(1, 4, figsize=figsize)
+ fig.suptitle('Feature Distributions by Species (Boxplots)', fontsize=16, fontweight='bold')
+
+ for i, feature in enumerate(features):
+ sns.boxplot(data=df, x='Species', y=feature, ax=axes[i])
+ axes[i].set_title(f'{feature} by Species')
+ axes[i].set_xlabel('Species')
+ axes[i].set_ylabel(feature)
+ axes[i].tick_params(axis='x', rotation=45)
+
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_confusion_matrix(y_true, y_pred, class_names=None, figsize=(8, 6)):
+ """
+ Plot confusion matrix.
+
+ Parameters:
+ -----------
+ y_true : array-like
+ True labels
+ y_pred : array-like
+ Predicted labels
+ class_names : list, optional
+ Names of the classes
+ figsize : tuple
+ Figure size
+ """
+ cm = confusion_matrix(y_true, y_pred)
+
+ plt.figure(figsize=figsize)
+ sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
+ xticklabels=class_names, yticklabels=class_names)
+ plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
+ plt.xlabel('Predicted Label')
+ plt.ylabel('True Label')
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_model_comparison(results, figsize=(12, 6)):
+ """
+ Plot model comparison results.
+
+ Parameters:
+ -----------
+ results : dict
+ Results from model training
+ figsize : tuple
+ Figure size
+ """
+ model_names = list(results.keys())
+ mean_scores = [results[name]['mean_score'] for name in model_names]
+ std_scores = [results[name]['std_score'] for name in model_names]
+
+ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
+
+ # Bar plot of mean scores
+ bars = ax1.bar(model_names, mean_scores, yerr=std_scores,
+ capsize=5, alpha=0.7, color='skyblue', edgecolor='navy')
+ ax1.set_title('Model Performance Comparison', fontweight='bold')
+ ax1.set_ylabel('Cross-Validation Score')
+ ax1.set_xlabel('Models')
+ ax1.tick_params(axis='x', rotation=45)
+ ax1.grid(True, alpha=0.3)
+
+ # Add value labels on bars
+ for bar, score in zip(bars, mean_scores):
+ height = bar.get_height()
+ ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
+ f'{score:.3f}', ha='center', va='bottom', fontweight='bold')
+
+ # Score distribution
+ for i, name in enumerate(model_names):
+ scores = results[name]['scores']
+ ax2.hist(scores, alpha=0.7, label=name, bins=10)
+
+ ax2.set_title('Score Distributions', fontweight='bold')
+ ax2.set_xlabel('Cross-Validation Score')
+ ax2.set_ylabel('Frequency')
+ ax2.legend()
+ ax2.grid(True, alpha=0.3)
+
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_feature_importance(model, feature_names, figsize=(10, 6)):
+ """
+ Plot feature importance for tree-based models.
+
+ Parameters:
+ -----------
+ model : sklearn estimator
+ Trained model with feature_importances_ attribute
+ feature_names : list
+ Names of the features
+ figsize : tuple
+ Figure size
+ """
+ if hasattr(model, 'feature_importances_'):
+ importances = model.feature_importances_
+ indices = np.argsort(importances)[::-1]
+
+ plt.figure(figsize=figsize)
+ plt.title('Feature Importance', fontsize=16, fontweight='bold')
+ plt.bar(range(len(importances)), importances[indices],
+ color='lightcoral', alpha=0.7)
+ plt.xticks(range(len(importances)), [feature_names[i] for i in indices],
+ rotation=45, ha='right')
+ plt.xlabel('Features')
+ plt.ylabel('Importance')
+ plt.grid(True, alpha=0.3)
+ plt.tight_layout()
+ plt.show()
+ else:
+ print("This model doesn't have feature importance attribute.")
+
+
+def create_summary_plots(df, results=None, y_true=None, y_pred=None,
+ model=None, feature_names=None):
+ """
+ Create a comprehensive set of summary plots.
+
+ Parameters:
+ -----------
+ df : pandas.DataFrame
+ Iris dataset
+ results : dict, optional
+ Model training results
+ y_true : array-like, optional
+ True labels for confusion matrix
+ y_pred : array-like, optional
+ Predicted labels for confusion matrix
+ model : sklearn estimator, optional
+ Trained model for feature importance
+ feature_names : list, optional
+ Feature names for importance plot
+ """
+ print("Creating comprehensive visualization summary...")
+
+ # Data exploration plots
+ plot_data_distribution(df)
+ plot_correlation_matrix(df)
+ plot_pairplot(df)
+ plot_boxplots(df)
+
+ # Model evaluation plots
+ if results is not None:
+ plot_model_comparison(results)
+
+ if y_true is not None and y_pred is not None:
+ plot_confusion_matrix(y_true, y_pred)
+
+ if model is not None and feature_names is not None:
+ plot_feature_importance(model, feature_names)
+
+ print("Visualization summary completed!")
\ No newline at end of file