The LLM-powered OCR engine that turns any PDF into beautifully formatted Markdown. It reads your documents like a human, handles messy layouts, and outputs text your AI can actually understand.
โก Get Started โข โจ Key Features โข ๐ฎ Usage & Examples โข ๐ฐ Cost Breakdown โข โ๏ธ Configuration โข ๐๏ธ Project Structure
Swift OCR is the document processor your AI assistant wishes it had. Stop feeding your LLM screenshots and praying it reads them correctly. This tool acts like a professional transcriber, reading every page of your PDF, intelligently handling tables, headers, and mixed layouts, then packaging everything into perfectly structured Markdown so your AI can actually work with it.
|
GPT-4 Vision Human-level reading accuracy |
Parallel Processing Multi-page PDFs in seconds |
Clean Markdown Tables, headers, listsโall formatted |
How it slaps:
- You:
curl -X POST "http://localhost:8000/ocr" -F "file=@messy_document.pdf" - Swift OCR: Converts pages โ Sends to GPT-4 Vision โ Formats as Markdown
- You: Get perfectly structured text with tables, headers, and lists intact.
- Result: Your AI finally understands that 50-page contract. โ
video.mp4
Demo video showcasing the conversion of NASA's Apollo 17 flight documentsโcomplete with unorganized, horizontally and vertically oriented pagesโinto well-structured Markdown format without breaking a sweat.
Manually extracting text from PDFs is a vibe-killer. Swift OCR makes traditional OCR look ancient.
| โ The Old Way (Pain) | โ The Swift OCR Way (Glory) |
|
|
We're not just running basic OCR. We're using GPT-4 Vision to actually understand your documentsโhandling rotated pages, complex tables, mixed layouts, and even describing images for accessibility.
Our solution offers an optimal balance of affordability and accuracy that makes enterprise OCR solutions look like highway robbery.
| Metric | Value |
|---|---|
| Avg tokens/page | ~1,500 (including prompt) |
| GPT-4o input cost | $5 per million tokens |
| GPT-4o output cost | $15 per million tokens |
| Cost per 1,000 pages | ~$15 |
| Optimization | Cost per 1,000 pages |
|---|---|
| GPT-4o (default) | ~$15 |
| GPT-4o mini | ~$8 |
| Batch API | ~$4 |
| Solution | Cost per 1,000 pages | Tables? | Markdown? |
|---|---|---|---|
| Swift OCR | $15 | โ Perfect | โ Native |
| CloudConvert (PDFTron) | ~$30 | โ No | |
| Adobe Acrobat API | ~$50+ | โ Good | โ No |
| Tesseract (free) | $0 | โ Broken | โ No |
Bottom line: Half the cost of competitors, 10x the quality. It's not just about being cheaperโit's about getting output you can actually use.
- Python 3.8+
- Azure OpenAI account (with GPT-4 Vision deployment)
# Clone the repo
git clone https://github.com/yigitkonur/swift-ocr-llm-powered-pdf-to-markdown.git
cd swift-ocr-llm-powered-pdf-to-markdown
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the root directory:
# Required
OPENAI_API_KEY=your_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_DEPLOYMENT_ID=your_gpt4_vision_deployment
# Optional (sensible defaults)
OPENAI_API_VERSION=gpt-4o
BATCH_SIZE=1 # Images per OCR request (1-10)
MAX_CONCURRENT_OCR_REQUESTS=5 # Parallel OCR calls
MAX_CONCURRENT_PDF_CONVERSION=4 # Parallel page rendering# Option 1: Classic uvicorn (backward compatible)
uvicorn main:app --reload
# Option 2: Using the new package
uvicorn swift_ocr.app:app --reload
# Option 3: As a Python module
python -m swift_ocr
# Option 4: With CLI arguments
python -m swift_ocr --host 0.0.0.0 --port 8080 --workers 4๐ API is now live at http://127.0.0.1:8000
โจ Pro tip: Check out the auto-generated docs at
http://127.0.0.1:8000/docs
POST /ocr
Accept a PDF file upload OR a URL to a PDF. Returns beautifully formatted Markdown.
Upload a PDF file:
curl -X POST "http://127.0.0.1:8000/ocr" \
-F "file=@/path/to/your/document.pdf"Process a PDF from URL:
curl -X POST "http://127.0.0.1:8000/ocr" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf"}'{
"text": "# Document Title\n\n## Section 1\n\nExtracted text with **formatting** preserved...\n\n| Column 1 | Column 2 |\n|----------|----------|\n| Data | Data |"
}The new response includes additional metadata:
{
"text": "# Document Title\n\n## Section 1\n\nExtracted text...",
"status": "success",
"pages_processed": 5,
"processing_time_ms": 1234
}curl http://127.0.0.1:8000/health{
"status": "healthy",
"version": "2.0.0",
"timestamp": "2024-01-01T00:00:00Z",
"openai_configured": true
}| Code | Meaning |
|---|---|
200 |
SuccessโMarkdown text returned |
400 |
Bad request (no file/URL, or both provided) |
422 |
Validation error |
429 |
Rate limitedโretry with backoff |
500 |
Processing error |
504 |
Timeout downloading PDF |
| Feature | What It Does | Why You Care |
|---|---|---|
๐ง GPT-4 VisionHuman-level OCR |
Uses OpenAI's most capable vision model to read documents | Actually understands context, not just character shapes |
โก Parallel ProcessingMultiprocessing + async |
Converts PDF pages and calls OCR in parallel | 50-page PDF in seconds, not minutes |
๐ Table PreservationMarkdown tables |
Detects and formats tables as proper Markdown | Your data stays structured, not flattened to gibberish |
๐ Smart BatchingConfigurable batch size |
Groups pages to optimize API calls vs accuracy | Balance speed and cost for your use case |
๐ก๏ธ Retry with BackoffExponential backoff |
Automatically retries on rate limits and timeouts | Handles API hiccups without crashing |
๐ Flexible InputFile upload or URL |
Accept PDFs directly or fetch from any URL | Works with your existing workflow |
๐ผ๏ธ Image DescriptionsAccessibility-friendly |
Describes non-text elements: [Image: description] |
Context your AI can actually use |
All settings are managed via environment variables. Tune these for your workload:
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
โ | Your Azure OpenAI API key |
AZURE_OPENAI_ENDPOINT |
โ | Your Azure OpenAI endpoint URL |
OPENAI_DEPLOYMENT_ID |
โ | Your GPT-4 Vision deployment ID |
OPENAI_API_VERSION |
gpt-4o |
API version |
BATCH_SIZE |
1 |
Pages per OCR request (1-10). Higher = faster but less accurate |
MAX_CONCURRENT_OCR_REQUESTS |
5 |
Parallel OCR calls. Increase for throughput |
MAX_CONCURRENT_PDF_CONVERSION |
4 |
Parallel page renders. Match your CPU cores |
- High accuracy, slower:
BATCH_SIZE=1 - Balanced:
BATCH_SIZE=5,MAX_CONCURRENT_OCR_REQUESTS=10 - Maximum throughput:
BATCH_SIZE=10,MAX_CONCURRENT_OCR_REQUESTS=20(watch rate limits!)
World-class Python engineering with atomic modules and clean separation of concerns:
swift_ocr/
โโโ __init__.py # Package init with version
โโโ __main__.py # CLI entry point (python -m swift_ocr)
โโโ app.py # FastAPI app factory
โโโ config/
โ โโโ __init__.py
โ โโโ settings.py # Pydantic Settings (type-safe config)
โโโ core/
โ โโโ __init__.py
โ โโโ exceptions.py # Custom exception hierarchy
โ โโโ logging.py # Structured logging setup
โ โโโ retry.py # Exponential backoff utilities
โโโ schemas/
โ โโโ __init__.py
โ โโโ ocr.py # Pydantic request/response models
โโโ services/
โ โโโ __init__.py
โ โโโ ocr.py # OpenAI Vision OCR service
โ โโโ pdf.py # PDF conversion service
โโโ api/
โโโ __init__.py
โโโ deps.py # Dependency injection
โโโ exceptions.py # FastAPI exception handlers
โโโ router.py # Route aggregation
โโโ routes/
โโโ __init__.py
โโโ health.py # Health check endpoints
โโโ ocr.py # OCR endpoints
Key architectural decisions
| Pattern | Implementation | Benefit |
|---|---|---|
| Pydantic Settings | config/settings.py |
Type-safe config with .env support and validation |
| Dependency Injection | api/deps.py |
Testable, swappable services |
| Custom Exceptions | core/exceptions.py |
Rich error context with proper HTTP status codes |
| Retry with Backoff | core/retry.py |
Handles rate limits and transient failures |
| App Factory | app.py |
Configurable app creation for testing |
| Typed Throughout | py.typed marker |
Full mypy compatibility |
Expand for troubleshooting tips
| Problem | Solution |
|---|---|
| "Missing required environment variables" | Check your .env file has all three required variables: OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, OPENAI_DEPLOYMENT_ID |
| Rate limit errors (429) | Reduce MAX_CONCURRENT_OCR_REQUESTS or BATCH_SIZE. The retry logic will handle temporary limits automatically. |
| Timeout errors | Large PDFs take time. The system has exponential backoff built inโgive it a moment. |
| Garbled output | Make sure your PDF isn't password-protected or corrupted. Try opening it locally first. |
| Tables not formatting correctly | Some extremely complex tables may need BATCH_SIZE=1 for best accuracy. |
| "Failed to initialize OpenAI client" | Verify your Azure endpoint URL format: https://your-resource.openai.azure.com/ |
This project uses PyMuPDF for PDF processing, which requires the GNU AGPL v3.0 license.
Want MIT instead? Fork this project and swap PyMuPDF for
pdf2image+ Poppler. The rest of the code is yours to use freely.
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2024 Yiฤit Konur
See LICENSE.md for the full license text.
Built with ๐ฅ because manually transcribing PDFs is a soul-crushing waste of time.