⚡ Swift OCR ⚡

Stop squinting at PDFs. Start extracting clean markdown.

The LLM-powered OCR engine that turns any PDF into beautifully formatted Markdown. It reads your documents like a human, handles messy layouts, and outputs text your AI can actually understand.

•

🧭 Quick Navigation

⚡ Get Started • ✨ Key Features • 🎮 Usage & Examples • 💰 Cost Breakdown • ⚙️ Configuration • 🏗️ Project Structure

Swift OCR is the document processor your AI assistant wishes it had. Stop feeding your LLM screenshots and praying it reads them correctly. This tool acts like a professional transcriber, reading every page of your PDF, intelligently handling tables, headers, and mixed layouts, then packaging everything into perfectly structured Markdown so your AI can actually work with it.

🧠

GPT-4 Vision
_{Human-level reading accuracy}

⚡

Parallel Processing
_{Multi-page PDFs in seconds}

📝

Clean Markdown
_{Tables, headers, lists—all formatted}

How it slaps:

You: curl -X POST "http://localhost:8000/ocr" -F "file=@messy_document.pdf"
Swift OCR: Converts pages → Sends to GPT-4 Vision → Formats as Markdown
You: Get perfectly structured text with tables, headers, and lists intact.
Result: Your AI finally understands that 50-page contract. ☕

📹 Demo

video.mp4

Demo video showcasing the conversion of NASA's Apollo 17 flight documents—complete with unorganized, horizontally and vertically oriented pages—into well-structured Markdown format without breaking a sweat.

💥 Why This Slaps Other Methods

Manually extracting text from PDFs is a vibe-killer. Swift OCR makes traditional OCR look ancient.

❌ The Old Way (Pain)	✅ The Swift OCR Way (Glory)
Run Tesseract. Get garbled text. Tables? What tables? Just random words now. Manually fix formatting for 2 hours. Feed broken context to your AI. Get a useless answer. Cry.	Upload PDF to Swift OCR. Get perfectly formatted Markdown. Tables intact. Headers preserved. Feed clean context to your AI. Get genius-level answers. Go grab a coffee. ☕

We're not just running basic OCR. We're using GPT-4 Vision to actually understand your documents—handling rotated pages, complex tables, mixed layouts, and even describing images for accessibility.

💰 Cost Breakdown: Stupidly Cheap

Our solution offers an optimal balance of affordability and accuracy that makes enterprise OCR solutions look like highway robbery.

Metric	Value
Avg tokens/page	~1,500 (including prompt)
GPT-4o input cost	$5 per million tokens
GPT-4o output cost	$15 per million tokens
Cost per 1,000 pages	~$15

💡 Want It Even Cheaper?

Optimization	Cost per 1,000 pages
GPT-4o (default)	~$15
GPT-4o mini	~$8
Batch API	~$4

🆚 Market Comparison

Solution	Cost per 1,000 pages	Tables?	Markdown?
Swift OCR	$15	✅ Perfect	✅ Native
CloudConvert (PDFTron)	~$30	⚠️ Basic	❌ No
Adobe Acrobat API	~$50+	✅ Good	❌ No
Tesseract (free)	$0	❌ Broken	❌ No

Bottom line: Half the cost of competitors, 10x the quality. It's not just about being cheaper—it's about getting output you can actually use.

🚀 Get Started in 60 Seconds

Prerequisites

Python 3.8+
Azure OpenAI account (with GPT-4 Vision deployment)

Installation

# Clone the repo
git clone https://github.com/yigitkonur/swift-ocr-llm-powered-pdf-to-markdown.git
cd swift-ocr-llm-powered-pdf-to-markdown

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configure Environment

Create a .env file in the root directory:

# Required
OPENAI_API_KEY=your_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_DEPLOYMENT_ID=your_gpt4_vision_deployment

# Optional (sensible defaults)
OPENAI_API_VERSION=gpt-4o
BATCH_SIZE=1                        # Images per OCR request (1-10)
MAX_CONCURRENT_OCR_REQUESTS=5       # Parallel OCR calls
MAX_CONCURRENT_PDF_CONVERSION=4     # Parallel page rendering

Run It

# Option 1: Classic uvicorn (backward compatible)
uvicorn main:app --reload

# Option 2: Using the new package
uvicorn swift_ocr.app:app --reload

# Option 3: As a Python module
python -m swift_ocr

# Option 4: With CLI arguments
python -m swift_ocr --host 0.0.0.0 --port 8080 --workers 4

🎉 API is now live at http://127.0.0.1:8000

✨ Pro tip: Check out the auto-generated docs at http://127.0.0.1:8000/docs

🎮 Usage: Fire and Forget

API Endpoint

POST /ocr

Accept a PDF file upload OR a URL to a PDF. Returns beautifully formatted Markdown.

Examples

Upload a PDF file:

curl -X POST "http://127.0.0.1:8000/ocr" \
  -F "file=@/path/to/your/document.pdf"

Process a PDF from URL:

curl -X POST "http://127.0.0.1:8000/ocr" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

Response

{
  "text": "# Document Title\n\n## Section 1\n\nExtracted text with **formatting** preserved...\n\n| Column 1 | Column 2 |\n|----------|----------|\n| Data     | Data     |"
}

Response (v2.0+)

The new response includes additional metadata:

{
  "text": "# Document Title\n\n## Section 1\n\nExtracted text...",
  "status": "success",
  "pages_processed": 5,
  "processing_time_ms": 1234
}

Health Check

curl http://127.0.0.1:8000/health

{
  "status": "healthy",
  "version": "2.0.0",
  "timestamp": "2024-01-01T00:00:00Z",
  "openai_configured": true
}

Error Codes

Code	Meaning
`200`	Success—Markdown text returned
`400`	Bad request (no file/URL, or both provided)
`422`	Validation error
`429`	Rate limited—retry with backoff
`500`	Processing error
`504`	Timeout downloading PDF

✨ Feature Breakdown: The Secret Sauce

Feature	What It Does	Why You Care
🧠 GPT-4 Vision `Human-level OCR`	Uses OpenAI's most capable vision model to read documents	Actually understands context, not just character shapes
⚡ Parallel Processing `Multiprocessing + async`	Converts PDF pages and calls OCR in parallel	50-page PDF in seconds, not minutes
📊 Table Preservation `Markdown tables`	Detects and formats tables as proper Markdown	Your data stays structured, not flattened to gibberish
🔄 Smart Batching `Configurable batch size`	Groups pages to optimize API calls vs accuracy	Balance speed and cost for your use case
🛡️ Retry with Backoff `Exponential backoff`	Automatically retries on rate limits and timeouts	Handles API hiccups without crashing
📄 Flexible Input `File upload or URL`	Accept PDFs directly or fetch from any URL	Works with your existing workflow
🖼️ Image Descriptions `Accessibility-friendly`	Describes non-text elements: `[Image: description]`	Context your AI can actually use

⚙️ Configuration

All settings are managed via environment variables. Tune these for your workload:

Variable	Default	Description
`OPENAI_API_KEY`	—	Your Azure OpenAI API key
`AZURE_OPENAI_ENDPOINT`	—	Your Azure OpenAI endpoint URL
`OPENAI_DEPLOYMENT_ID`	—	Your GPT-4 Vision deployment ID
`OPENAI_API_VERSION`	`gpt-4o`	API version
`BATCH_SIZE`	`1`	Pages per OCR request (1-10). Higher = faster but less accurate
`MAX_CONCURRENT_OCR_REQUESTS`	`5`	Parallel OCR calls. Increase for throughput
`MAX_CONCURRENT_PDF_CONVERSION`	`4`	Parallel page renders. Match your CPU cores

Performance Tuning Tips

High accuracy, slower: BATCH_SIZE=1
Balanced: BATCH_SIZE=5, MAX_CONCURRENT_OCR_REQUESTS=10
Maximum throughput: BATCH_SIZE=10, MAX_CONCURRENT_OCR_REQUESTS=20 (watch rate limits!)

🏗️ Project Structure

World-class Python engineering with atomic modules and clean separation of concerns:

swift_ocr/
├── __init__.py              # Package init with version
├── __main__.py              # CLI entry point (python -m swift_ocr)
├── app.py                   # FastAPI app factory
├── config/
│   ├── __init__.py
│   └── settings.py          # Pydantic Settings (type-safe config)
├── core/
│   ├── __init__.py
│   ├── exceptions.py        # Custom exception hierarchy
│   ├── logging.py           # Structured logging setup
│   └── retry.py             # Exponential backoff utilities
├── schemas/
│   ├── __init__.py
│   └── ocr.py               # Pydantic request/response models
├── services/
│   ├── __init__.py
│   ├── ocr.py               # OpenAI Vision OCR service
│   └── pdf.py               # PDF conversion service
└── api/
    ├── __init__.py
    ├── deps.py              # Dependency injection
    ├── exceptions.py        # FastAPI exception handlers
    ├── router.py            # Route aggregation
    └── routes/
        ├── __init__.py
        ├── health.py        # Health check endpoints
        └── ocr.py           # OCR endpoints

Key architectural decisions

Pattern	Implementation	Benefit
Pydantic Settings	`config/settings.py`	Type-safe config with `.env` support and validation
Dependency Injection	`api/deps.py`	Testable, swappable services
Custom Exceptions	`core/exceptions.py`	Rich error context with proper HTTP status codes
Retry with Backoff	`core/retry.py`	Handles rate limits and transient failures
App Factory	`app.py`	Configurable app creation for testing
Typed Throughout	`py.typed` marker	Full mypy compatibility

🔥 Common Issues & Quick Fixes

Expand for troubleshooting tips

Problem	Solution
"Missing required environment variables"	Check your `.env` file has all three required variables: `OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`, `OPENAI_DEPLOYMENT_ID`
Rate limit errors (429)	Reduce `MAX_CONCURRENT_OCR_REQUESTS` or `BATCH_SIZE`. The retry logic will handle temporary limits automatically.
Timeout errors	Large PDFs take time. The system has exponential backoff built in—give it a moment.
Garbled output	Make sure your PDF isn't password-protected or corrupted. Try opening it locally first.
Tables not formatting correctly	Some extremely complex tables may need `BATCH_SIZE=1` for best accuracy.
"Failed to initialize OpenAI client"	Verify your Azure endpoint URL format: `https://your-resource.openai.azure.com/`

📜 License

This project uses PyMuPDF for PDF processing, which requires the GNU AGPL v3.0 license.

Want MIT instead? Fork this project and swap PyMuPDF for pdf2image + Poppler. The rest of the code is yours to use freely.

GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007

Copyright (C) 2024 Yiğit Konur

See LICENSE.md for the full license text.

Built with 🔥 because manually transcribing PDFs is a soul-crushing waste of time.

Report Bug • Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
swift_ocr		swift_ocr
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

yigitkonur/llm-based-ocr

Folders and files

Latest commit

History

Repository files navigation

⚡ Swift OCR ⚡

Stop squinting at PDFs. Start extracting clean markdown.

🧭 Quick Navigation

🧠

⚡

📝

📹 Demo

💥 Why This Slaps Other Methods

💰 Cost Breakdown: Stupidly Cheap

💡 Want It Even Cheaper?

🆚 Market Comparison

🚀 Get Started in 60 Seconds

Prerequisites

Installation

Configure Environment

Run It

🎮 Usage: Fire and Forget

API Endpoint

Examples

Response

Response (v2.0+)

Health Check

Error Codes

✨ Feature Breakdown: The Secret Sauce

⚙️ Configuration

Performance Tuning Tips

🏗️ Project Structure

🔥 Common Issues & Quick Fixes

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages