A complete end-to-end machine learning project for text summarization using the HuggingFace Pegasus model. This project demonstrates a production-ready ML pipeline with proper modular architecture, configuration management, , API deployment, and containerization.
- Overview
- Project Architecture
- Features
- Getting Started
- Project Structure
- Configuration
- Usage
- API Endpoints
- Model Training Pipeline
- Deployment
- External Tools and Dependencies
- Best Practices Implemented
- Troubleshooting
- Contributing
This project provides a complete solution for summarizing conversational text. It fine-tunes Google's Pegasus model on the SAMSum dataset to generate concise summaries from dialogue. The entire system follows MLOps best practices, featuring a modular architectural design, comprehensive logging, a repeatable training pipeline, configuration management, and a deployed API for easy inference.
The pipeline processes conversational data and generates concise summaries, making it ideal for chat summarization, meeting notes, and dialogue analysis applications. This repository is designed to be both a functional application and a learning guide for building robust, production-level ML systems.
The project follows a modular, pipeline-based architecture that separates concerns, making it scalable, maintainable, and easy to debug.
- Why this architecture? A modular design allows individual components (like data ingestion or model training) to be developed, tested, and updated independently without affecting the rest of the system. This is crucial for collaborative projects and long-term maintenance.
- Data Ingestion: Downloads and extracts the SAMSum dataset
- Data Transformation: Tokenizes and preprocesses text data for model training
- Model Training: Fine-tunes the Pegasus model on the prepared dataset
- Model Evaluation: Assesses model performance using ROUGE metrics
- Prediction Pipeline: Serves the trained model for real-time inference
- Modular Architecture: Each component can be developed, tested, and executed independently, allowing for iterative development and easier debugging
- Pipeline Pattern: Sequential processing stages with clear boundaries enable stage-by-stage development and validation
- Configuration Management: Centralized YAML-based configuration supports environment-specific deployments
- Entity-Component Pattern: Clear data structures and component interfaces promote code reusability and maintainability
- Dependency Injection: Configurable components provide flexibility for different execution environments
- Production-Ready Pipeline: Complete ML workflow from data ingestion to deployment
- Flexible Configuration: YAML-based configuration for easy management of paths and parameter/hyperparameters tuning
- Multi-Device Support: Automatic detection and utilization of CPU, CUDA (NVIDIA GPU), and Apple Silicon (MPS)
- RESTful API: FastAPI-based web service for serving the summarization model
- Comprehensive Logging: Detailed logging throughout the pipeline for easy monitoring and debugging
- Model Evaluation:
ROUGE scorecalculation for performance assessment - Dockerized Deployment: Easy containerization and deployment using
Dockerfile
Follow these steps to set up and run the project locally.
Before starting this project, ensure you have the following installed:
- Python 3.8+
- Git
- Hardware Requirements:
- Minimum 8GB RAM (16GB recommended for training)
- GPU support (optional but recommended for faster training)
- 5GB free disk space for datasets and models
First, clone the repository and navigate to the project directory:
git clone https://github.com/GoJo-Rika/Text-Summarizer-Using-HuggingFace-Transformers.git
cd Text-Summarizer-Using-HuggingFace-TransformersWe recommend using uv, a fast, next-generation Python package manager.
-
Install
uvon your system if you haven't already.# On macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # On Windows powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Create a virtual environment and install dependencies with a single command:
uv sync
This command automatically creates a
.venvfolder and installs all required packages fromrequirements.txt.Note: For a comprehensive guide on
uv, check out this detailed tutorial: uv-tutorial-guide.
If you prefer to use the standard venv and pip:
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt # Using uv: uv add -r requirements.txt
Create the necessary directory structure and empty files for the project using the following command:
python template.py # Using uv: uv run template.pyText-Summarizer/
βββ artifacts/ # Stores outputs: data, models, and metrics
βββ config/
β βββ config.yaml # Static configuration (paths, model names)
βββ logs/ # Application logs
βββ research/ # Jupyter notebooks for experimentation
βββ src/
β βββ text_summarizer/
β βββ components/ # Core ML components logic for each pipeline stage
β β βββ data_ingestion.py
β β βββ data_transformation.py
β β βββ model_trainer.py
β β βββ model_evaluation.py
β βββ config/ # Configuration manager logic
β β βββ configuration.py
β βββ entity/ # Custom data structures (dataclasses) and entities
β β βββ __init__.py
β βββ pipeline/ Orchestrates the ML workflow stages
β β βββ stage_1_data_ingestion_pipeline.py
β β βββ stage_2_data_transformation_pipeline.py
β β βββ stage_3_model_trainer_pipeline.py
β β βββ stage_4_model_evaluation_pipeline.py
β β βββ prediction_pipeline.py
β βββ utils/ # Utility functions# Helper functions (e.g., reading YAML)
β βββ common.py
βββ app.py # FastAPI web application for prediction
βββ main.py # Main script to run the training pipeline
βββ params.yaml # Tunable hyperparameters for training
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker containerization configuration for deployment
The project uses two separate YAML files for configuration, a common best practice.
Holds static configuration like file paths, artifact directories, and pre-trained model names. These rarely change.
data_ingestion:
source_URL: "https://github.com/GoJo-Rika/datasets/raw/refs/heads/main/summarizer-data.zip"
model_trainer:
model_ckpt: "google/pegasus-cnn_dailymail"Contains hyperparameters for model training (e.g., learning rate, batch size, epochs). This allows for easy tuning and experimentation without modifying the core application code.
TrainingArguments:
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16To run the complete training pipeline from scratch, execute main.py:
python main.py # uv run main.pyThis command executes all four pipeline stages sequentially:
- Data Ingestion: Downloads and extracts data (ingestion and extraction).
- Data Transformation: Prepares data for the model (preprocessing and tokenization).
- Model Training: Fine-tunes the model (training).
- Model Evaluation: Calculates performance metrics (evaluation and metric calculation).
Note on Model Training: The model training stage in main.py is commented out by default to prevent accidental, resource-intensive retraining. To run a full training session, uncomment the relevant lines in main.py.
You can also run individual components for testing or debugging:
from src.text_summarizer.pipeline.stage_1_data_ingestion_pipeline import DataIngestionTrainingPipeline
# Run only data ingestion
pipeline = DataIngestionTrainingPipeline()
pipeline.initiate_data_ingestion()To start the web service for inference and serve the trained model via a REST API, run the app.py file::
python app.py # uv run app.pyThe server will start on http://localhost:8080 with automatic API documentation available at http://localhost:8080/docs.
Redirects to the interactive API documentation.
Triggers the complete training pipeline. Useful for retraining the model via an API call.
curl -X GET "http://localhost:8080/train"Generates a summary for the provided text.
curl -X POST "http://localhost:8080/predict?text=Your%20text%20to%20summarize%20here"Example:
curl -X POST "http://localhost:8080/predict?text=Alice%3A%20Hey%2C%20I%20can't%20make%20it%20to%20the%20meeting%20this%20afternoon.%20Bob%3A%20No%20problem!%20I'll%20send%20you%20the%20notes."The training pipeline consists of four distinct, reuseable stages:
- Description
- Downloads the SAMSum dataset from the configured URL
- Extracts the ZIP file to the artifacts directory
- Validates data integrity and structure
- Code:
src/text_summarizer/pipeline/stage_1_data_ingestion_pipeline.py
- Description
- Loads the raw dataset using HuggingFace datasets
- Tokenizes dialogue and summary pairs using the Pegasus tokenizer
- Applies appropriate truncation and padding strategies
- Saves the processed dataset for training
- Code:
src/text_summarizer/pipeline/stage_2_data_transformation_pipeline.py
- Description
- Loads the pre-trained Pegasus model
- Configures training arguments from params.yaml
- Implements data collation for sequence-to-sequence tasks
- Fine-tunes the model on the SAMSum dataset
- Saves the trained model and tokenizer
- Code:
src/text_summarizer/pipeline/stage_3_model_trainer_pipeline.py
Training Environment: The model training was performed on Google Colab's free tier using T4 GPU, achieving significant performance improvements over local CPU training. The complete training process took approximately 10 minutes per epoch, with the full pipeline validation taking around 40 minutes including model downloading and file transfers. The modular architecture proved particularly valuable during development, allowing individual pipeline stages to be tested locally before moving to GPU-accelerated training in the cloud environment.
- Description
- Evaluates the model using ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L)
- Generates performance reports and saves metrics to CSV
- Provides quantitative assessment of summarization quality
- Code:
src/text_summarizer/pipeline/stage_4_model_evaluation_pipeline.py
Build and run the application in a Docker container:
# Build the Docker image
docker build -t text-summarizer .
# Run the container
docker run -p 8080:8080 text-summarizerFor production deployment, consider:
- Environment Variables: Use environment variables for sensitive configurations
- Model Versioning: Implement model versioning and rollback capabilities
- Monitoring: Add application and model performance monitoring
- Scaling: Use container orchestration platforms like Kubernetes
- Security: Implement authentication and input validation
- Transformers: HuggingFace library for transformer models
- Datasets: HuggingFace datasets library for data loading
- Torch: PyTorch for deep learning operations
- Evaluate: HuggingFace evaluate library for metrics calculation
- Pandas: Data manipulation and analysis
- NLTK: Natural language processing utilities
- py7zr: Archive extraction support
- FastAPI: Modern web framework for building APIs
- Uvicorn: ASGI server for running FastAPI applications
- PyYAML: YAML parsing and configuration management
- python-box: Enhanced dictionary access for configurations
- ensure: Type checking and validation decorators
- ROUGE Score: Text summarization evaluation metrics
- sacrebleu: BLEU score calculation for text generation
- Weights & Biases (wandb): Experiment tracking and artifact storage for model training metrics and logs
- Cloud Integration: Google Colab integration for GPU-accelerated training on free tier resources
- Modular Architecture: Clear separation of concerns with dedicated components for pipelines, and utilities
- Configuration Management: Centralized YAML configuration files for easy parameter tuning
- Logging Strategy: Comprehensive logging throughout the pipeline for traceability and debugging
- Error Handling: Proper exception handling and error reporting
- Pipeline Pattern: Sequential processing stages with clear interfaces
- Data Validation: Input validation and data integrity checks
- Model Versioning: Organized model saving and loading procedures
- Evaluation Framework: Systematic model evaluation using standard metrics
- Type Hints: Enforced type hints for improved code quality and readability
- Documentation: Comprehensive docstrings and comments
- Dependency Management: Clear
requirements.txtspecification file for reproducible environments - Environment Isolation: Clear instructions for using virtual environments (
uvorvenv)
If you encounter out-of-memory errors:
- Reduce
per_device_train_batch_sizein params.yaml - Increase
gradient_accumulation_stepsto maintain effective batch size - Consider using gradient checkpointing for memory optimization
If the model fails to load:
- Verify internet connectivity for downloading pre-trained models
- Check HuggingFace model hub availability
- Ensure sufficient disk space for model files
If the API server fails to start:
- Check if port 8080 is available and is not already in use by another application
- Verify all dependencies are installed correctly
- Review application logs for specific error messages
For device-specific issues:
- Apple Silicon: Ensure MPS support is available in your PyTorch installation
- CUDA: Verify CUDA drivers and PyTorch GPU support
- CPU: The system falls back to CPU automatically if GPU is unavailable
To contribute to this project:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Note: This project is designed for educational and research purposes. For production use, consider additional security measures, monitoring, and scalability optimizations based on your specific requirements.
For questions or issues, please refer to the project's issue tracker or contact the maintainers.