|
| 1 | +# Claude Sonnet 4 vs OpenAI o4-mini on code generation using DeepEval |
| 2 | + |
| 3 | +This application compares the code generation capabilities of Claude Sonnet 4 and OpenAI o4-mini using DeepEval metrics. The app allows users to ingest code from a GitHub repository as context and generate new code based on that context. Both models run parallely side by side giving a fair comparison of their capabilities. Finally DeepEval evaluates both models on custom code metrics and |
| 4 | +provide a detailed performance comparison with neat and clean visuals. |
| 5 | + |
| 6 | +We use: |
| 7 | +- LiteLLM for orchestration |
| 8 | +- DeepEval for evaluation |
| 9 | +- Gitingest for ingesting code |
| 10 | +- Streamlit for the UI |
| 11 | + |
| 12 | +--- |
| 13 | +## Setup and Installation |
| 14 | + |
| 15 | +Ensure you have Python 3.12 or later installed on your system. |
| 16 | + |
| 17 | +Install dependencies: |
| 18 | +```bash |
| 19 | +uv sync |
| 20 | +``` |
| 21 | + |
| 22 | +Copy `.env.example` to `.env` and configure the following environment variables: |
| 23 | +``` |
| 24 | +ANTHROPIC_API_KEY=your_anthropic_api_key_here |
| 25 | +OPENAI_API_KEY=your_openai_api_key_here |
| 26 | +``` |
| 27 | + |
| 28 | +Run the Streamlit app: |
| 29 | +```bash |
| 30 | +streamlit run app.py |
| 31 | +``` |
| 32 | + |
| 33 | +## Usage |
| 34 | + |
| 35 | +1. Enter a GitHub repository URL in the sidebar |
| 36 | +2. Click "Ingest Repository" to load the repository context |
| 37 | +3. Enter your code generation prompt in the chat |
| 38 | +4. View the generated code from both models side by side |
| 39 | +5. Click on "Evaluate Code" to evaluate code using DeepEval |
| 40 | +6. View the evaluation metrics comparing both models' performance |
| 41 | + |
| 42 | +## Evaluation Metrics |
| 43 | + |
| 44 | +The app evaluates generated code using three comprehensive metrics powered by DeepEval: |
| 45 | + |
| 46 | +- **Code Correctness**: Evaluates the functional correctness of the generated code |
| 47 | + |
| 48 | +- **Code Readability**: Measures how easy the code is to understand and maintain |
| 49 | + |
| 50 | +- **Best Practices**: Assesses adherence to coding standards and coding best practices |
| 51 | + |
| 52 | +Each metric is scored on a scale of 0-10, with the following general interpretation: |
| 53 | +- 0-2: Major issues or non-functional code |
| 54 | +- 3-5: Basic implementation with significant gaps |
| 55 | +- 6-8: Good implementation with minor issues |
| 56 | +- 9-10: Excellent implementation meeting all criteria |
| 57 | + |
| 58 | +The overall score is calculated as an average of these three metrics. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## 📬 Stay Updated with Our Newsletter! |
| 63 | +**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com) |
| 64 | + |
| 65 | +[](https://join.dailydoseofds.com) |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## Contribution |
| 70 | + |
| 71 | +Contributions are welcome! Please fork the repository and submit a pull request with your improvements. |
0 commit comments