Skip to content

Commit eb73f0d

Browse files
committed
Add sonnet-4 vs openai o4
1 parent 32c36b2 commit eb73f0d

File tree

10 files changed

+3116
-0
lines changed

10 files changed

+3116
-0
lines changed

sonnet4-vs-o4/.env.example

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ANTHROPIC_API_KEY=your_anthropic_api_key
2+
OPENAI_API_KEY=your_openai_api_key

sonnet4-vs-o4/.gitignore

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Python-generated files
2+
__pycache__/
3+
*.py[cod]
4+
build/
5+
dist/
6+
wheels/
7+
*.egg-info/
8+
*.egg
9+
.eggs/
10+
.Python
11+
develop-eggs/
12+
downloads/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
.installed.cfg
19+
20+
# Virtual environments
21+
.venv
22+
venv/
23+
ENV/
24+
env/
25+
.env
26+
27+
# IDE specific files
28+
.idea/
29+
.vscode/
30+
*.swp
31+
*.swo
32+
.DS_Store
33+
.project
34+
.pydevproject
35+
.settings/
36+
*.sublime-workspace
37+
*.sublime-project
38+
39+
# Testing and coverage
40+
.tox/
41+
.coverage
42+
.coverage.*
43+
.cache
44+
nosetests.xml
45+
coverage.xml
46+
*.cover
47+
.hypothesis/
48+
.pytest_cache/
49+
htmlcov/
50+
51+
# Documentation
52+
docs/_build/
53+
site/
54+
55+
# Jupyter Notebook
56+
.ipynb_checkpoints
57+
58+
# mypy
59+
.mypy_cache/
60+
.dmypy.json
61+
dmypy.json
62+
63+
# Logs and databases
64+
*.log
65+
*.sqlite
66+
*.db
67+
68+
# Environment variables
69+
.env
70+
.env.local
71+
.env.*.local

sonnet4-vs-o4/.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12

sonnet4-vs-o4/README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Claude Sonnet 4 vs OpenAI o4-mini on code generation using DeepEval
2+
3+
This application compares the code generation capabilities of Claude Sonnet 4 and OpenAI o4-mini using DeepEval metrics. The app allows users to ingest code from a GitHub repository as context and generate new code based on that context. Both models run parallely side by side giving a fair comparison of their capabilities. Finally DeepEval evaluates both models on custom code metrics and
4+
provide a detailed performance comparison with neat and clean visuals.
5+
6+
We use:
7+
- LiteLLM for orchestration
8+
- DeepEval for evaluation
9+
- Gitingest for ingesting code
10+
- Streamlit for the UI
11+
12+
---
13+
## Setup and Installation
14+
15+
Ensure you have Python 3.12 or later installed on your system.
16+
17+
Install dependencies:
18+
```bash
19+
uv sync
20+
```
21+
22+
Copy `.env.example` to `.env` and configure the following environment variables:
23+
```
24+
ANTHROPIC_API_KEY=your_anthropic_api_key_here
25+
OPENAI_API_KEY=your_openai_api_key_here
26+
```
27+
28+
Run the Streamlit app:
29+
```bash
30+
streamlit run app.py
31+
```
32+
33+
## Usage
34+
35+
1. Enter a GitHub repository URL in the sidebar
36+
2. Click "Ingest Repository" to load the repository context
37+
3. Enter your code generation prompt in the chat
38+
4. View the generated code from both models side by side
39+
5. Click on "Evaluate Code" to evaluate code using DeepEval
40+
6. View the evaluation metrics comparing both models' performance
41+
42+
## Evaluation Metrics
43+
44+
The app evaluates generated code using three comprehensive metrics powered by DeepEval:
45+
46+
- **Code Correctness**: Evaluates the functional correctness of the generated code
47+
48+
- **Code Readability**: Measures how easy the code is to understand and maintain
49+
50+
- **Best Practices**: Assesses adherence to coding standards and coding best practices
51+
52+
Each metric is scored on a scale of 0-10, with the following general interpretation:
53+
- 0-2: Major issues or non-functional code
54+
- 3-5: Basic implementation with significant gaps
55+
- 6-8: Good implementation with minor issues
56+
- 9-10: Excellent implementation meeting all criteria
57+
58+
The overall score is calculated as an average of these three metrics.
59+
60+
---
61+
62+
## 📬 Stay Updated with Our Newsletter!
63+
**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com)
64+
65+
[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)
66+
67+
---
68+
69+
## Contribution
70+
71+
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

0 commit comments

Comments
 (0)