Skip to content

High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic for scalable extraction.

License

Notifications You must be signed in to change notification settings

yigitkonur/llm-based-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โšก Swift OCR โšก

Stop squinting at PDFs. Start extracting clean markdown.

The LLM-powered OCR engine that turns any PDF into beautifully formatted Markdown. It reads your documents like a human, handles messy layouts, and outputs text your AI can actually understand.

python fastapi ย ย โ€ขย ย  license platform

gpt-4/5 vision markdown output


Swift OCR is the document processor your AI assistant wishes it had. Stop feeding your LLM screenshots and praying it reads them correctly. This tool acts like a professional transcriber, reading every page of your PDF, intelligently handling tables, headers, and mixed layouts, then packaging everything into perfectly structured Markdown so your AI can actually work with it.

๐Ÿง 

GPT-4 Vision
Human-level reading accuracy

โšก

Parallel Processing
Multi-page PDFs in seconds

๐Ÿ“

Clean Markdown
Tables, headers, listsโ€”all formatted

How it slaps:

  • You: curl -X POST "http://localhost:8000/ocr" -F "file=@messy_document.pdf"
  • Swift OCR: Converts pages โ†’ Sends to GPT-4 Vision โ†’ Formats as Markdown
  • You: Get perfectly structured text with tables, headers, and lists intact.
  • Result: Your AI finally understands that 50-page contract. โ˜•

๐Ÿ“น Demo

video.mp4

Demo video showcasing the conversion of NASA's Apollo 17 flight documentsโ€”complete with unorganized, horizontally and vertically oriented pagesโ€”into well-structured Markdown format without breaking a sweat.


๐Ÿ’ฅ Why This Slaps Other Methods

Manually extracting text from PDFs is a vibe-killer. Swift OCR makes traditional OCR look ancient.

โŒ The Old Way (Pain) โœ… The Swift OCR Way (Glory)
  1. Run Tesseract. Get garbled text.
  2. Tables? What tables? Just random words now.
  3. Manually fix formatting for 2 hours.
  4. Feed broken context to your AI.
  5. Get a useless answer. Cry.
  1. Upload PDF to Swift OCR.
  2. Get perfectly formatted Markdown.
  3. Tables intact. Headers preserved.
  4. Feed clean context to your AI.
  5. Get genius-level answers. Go grab a coffee. โ˜•

We're not just running basic OCR. We're using GPT-4 Vision to actually understand your documentsโ€”handling rotated pages, complex tables, mixed layouts, and even describing images for accessibility.


๐Ÿ’ฐ Cost Breakdown: Stupidly Cheap

Our solution offers an optimal balance of affordability and accuracy that makes enterprise OCR solutions look like highway robbery.

Metric Value
Avg tokens/page ~1,500 (including prompt)
GPT-4o input cost $5 per million tokens
GPT-4o output cost $15 per million tokens
Cost per 1,000 pages ~$15

๐Ÿ’ก Want It Even Cheaper?

Optimization Cost per 1,000 pages
GPT-4o (default) ~$15
GPT-4o mini ~$8
Batch API ~$4

๐Ÿ†š Market Comparison

Solution Cost per 1,000 pages Tables? Markdown?
Swift OCR $15 โœ… Perfect โœ… Native
CloudConvert (PDFTron) ~$30 โš ๏ธ Basic โŒ No
Adobe Acrobat API ~$50+ โœ… Good โŒ No
Tesseract (free) $0 โŒ Broken โŒ No

Bottom line: Half the cost of competitors, 10x the quality. It's not just about being cheaperโ€”it's about getting output you can actually use.


๐Ÿš€ Get Started in 60 Seconds

Prerequisites

  • Python 3.8+
  • Azure OpenAI account (with GPT-4 Vision deployment)

Installation

# Clone the repo
git clone https://github.com/yigitkonur/swift-ocr-llm-powered-pdf-to-markdown.git
cd swift-ocr-llm-powered-pdf-to-markdown

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configure Environment

Create a .env file in the root directory:

# Required
OPENAI_API_KEY=your_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_DEPLOYMENT_ID=your_gpt4_vision_deployment

# Optional (sensible defaults)
OPENAI_API_VERSION=gpt-4o
BATCH_SIZE=1                        # Images per OCR request (1-10)
MAX_CONCURRENT_OCR_REQUESTS=5       # Parallel OCR calls
MAX_CONCURRENT_PDF_CONVERSION=4     # Parallel page rendering

Run It

# Option 1: Classic uvicorn (backward compatible)
uvicorn main:app --reload

# Option 2: Using the new package
uvicorn swift_ocr.app:app --reload

# Option 3: As a Python module
python -m swift_ocr

# Option 4: With CLI arguments
python -m swift_ocr --host 0.0.0.0 --port 8080 --workers 4

๐ŸŽ‰ API is now live at http://127.0.0.1:8000

โœจ Pro tip: Check out the auto-generated docs at http://127.0.0.1:8000/docs


๐ŸŽฎ Usage: Fire and Forget

API Endpoint

POST /ocr

Accept a PDF file upload OR a URL to a PDF. Returns beautifully formatted Markdown.

Examples

Upload a PDF file:

curl -X POST "http://127.0.0.1:8000/ocr" \
  -F "file=@/path/to/your/document.pdf"

Process a PDF from URL:

curl -X POST "http://127.0.0.1:8000/ocr" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

Response

{
  "text": "# Document Title\n\n## Section 1\n\nExtracted text with **formatting** preserved...\n\n| Column 1 | Column 2 |\n|----------|----------|\n| Data     | Data     |"
}

Response (v2.0+)

The new response includes additional metadata:

{
  "text": "# Document Title\n\n## Section 1\n\nExtracted text...",
  "status": "success",
  "pages_processed": 5,
  "processing_time_ms": 1234
}

Health Check

curl http://127.0.0.1:8000/health
{
  "status": "healthy",
  "version": "2.0.0",
  "timestamp": "2024-01-01T00:00:00Z",
  "openai_configured": true
}

Error Codes

Code Meaning
200 Successโ€”Markdown text returned
400 Bad request (no file/URL, or both provided)
422 Validation error
429 Rate limitedโ€”retry with backoff
500 Processing error
504 Timeout downloading PDF

โœจ Feature Breakdown: The Secret Sauce

Feature What It Does Why You Care
๐Ÿง  GPT-4 Vision
Human-level OCR
Uses OpenAI's most capable vision model to read documents Actually understands context, not just character shapes
โšก Parallel Processing
Multiprocessing + async
Converts PDF pages and calls OCR in parallel 50-page PDF in seconds, not minutes
๐Ÿ“Š Table Preservation
Markdown tables
Detects and formats tables as proper Markdown Your data stays structured, not flattened to gibberish
๐Ÿ”„ Smart Batching
Configurable batch size
Groups pages to optimize API calls vs accuracy Balance speed and cost for your use case
๐Ÿ›ก๏ธ Retry with Backoff
Exponential backoff
Automatically retries on rate limits and timeouts Handles API hiccups without crashing
๐Ÿ“„ Flexible Input
File upload or URL
Accept PDFs directly or fetch from any URL Works with your existing workflow
๐Ÿ–ผ๏ธ Image Descriptions
Accessibility-friendly
Describes non-text elements: [Image: description] Context your AI can actually use

โš™๏ธ Configuration

All settings are managed via environment variables. Tune these for your workload:

Variable Default Description
OPENAI_API_KEY โ€” Your Azure OpenAI API key
AZURE_OPENAI_ENDPOINT โ€” Your Azure OpenAI endpoint URL
OPENAI_DEPLOYMENT_ID โ€” Your GPT-4 Vision deployment ID
OPENAI_API_VERSION gpt-4o API version
BATCH_SIZE 1 Pages per OCR request (1-10). Higher = faster but less accurate
MAX_CONCURRENT_OCR_REQUESTS 5 Parallel OCR calls. Increase for throughput
MAX_CONCURRENT_PDF_CONVERSION 4 Parallel page renders. Match your CPU cores

Performance Tuning Tips

  • High accuracy, slower: BATCH_SIZE=1
  • Balanced: BATCH_SIZE=5, MAX_CONCURRENT_OCR_REQUESTS=10
  • Maximum throughput: BATCH_SIZE=10, MAX_CONCURRENT_OCR_REQUESTS=20 (watch rate limits!)

๐Ÿ—๏ธ Project Structure

World-class Python engineering with atomic modules and clean separation of concerns:

swift_ocr/
โ”œโ”€โ”€ __init__.py              # Package init with version
โ”œโ”€โ”€ __main__.py              # CLI entry point (python -m swift_ocr)
โ”œโ”€โ”€ app.py                   # FastAPI app factory
โ”œโ”€โ”€ config/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ settings.py          # Pydantic Settings (type-safe config)
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ exceptions.py        # Custom exception hierarchy
โ”‚   โ”œโ”€โ”€ logging.py           # Structured logging setup
โ”‚   โ””โ”€โ”€ retry.py             # Exponential backoff utilities
โ”œโ”€โ”€ schemas/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ ocr.py               # Pydantic request/response models
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ ocr.py               # OpenAI Vision OCR service
โ”‚   โ””โ”€โ”€ pdf.py               # PDF conversion service
โ””โ”€โ”€ api/
    โ”œโ”€โ”€ __init__.py
    โ”œโ”€โ”€ deps.py              # Dependency injection
    โ”œโ”€โ”€ exceptions.py        # FastAPI exception handlers
    โ”œโ”€โ”€ router.py            # Route aggregation
    โ””โ”€โ”€ routes/
        โ”œโ”€โ”€ __init__.py
        โ”œโ”€โ”€ health.py        # Health check endpoints
        โ””โ”€โ”€ ocr.py           # OCR endpoints
Key architectural decisions
Pattern Implementation Benefit
Pydantic Settings config/settings.py Type-safe config with .env support and validation
Dependency Injection api/deps.py Testable, swappable services
Custom Exceptions core/exceptions.py Rich error context with proper HTTP status codes
Retry with Backoff core/retry.py Handles rate limits and transient failures
App Factory app.py Configurable app creation for testing
Typed Throughout py.typed marker Full mypy compatibility

๐Ÿ”ฅ Common Issues & Quick Fixes

Expand for troubleshooting tips
Problem Solution
"Missing required environment variables" Check your .env file has all three required variables: OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, OPENAI_DEPLOYMENT_ID
Rate limit errors (429) Reduce MAX_CONCURRENT_OCR_REQUESTS or BATCH_SIZE. The retry logic will handle temporary limits automatically.
Timeout errors Large PDFs take time. The system has exponential backoff built inโ€”give it a moment.
Garbled output Make sure your PDF isn't password-protected or corrupted. Try opening it locally first.
Tables not formatting correctly Some extremely complex tables may need BATCH_SIZE=1 for best accuracy.
"Failed to initialize OpenAI client" Verify your Azure endpoint URL format: https://your-resource.openai.azure.com/

๐Ÿ“œ License

This project uses PyMuPDF for PDF processing, which requires the GNU AGPL v3.0 license.

Want MIT instead? Fork this project and swap PyMuPDF for pdf2image + Poppler. The rest of the code is yours to use freely.

GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007

Copyright (C) 2024 YiฤŸit Konur

See LICENSE.md for the full license text.


Built with ๐Ÿ”ฅ because manually transcribing PDFs is a soul-crushing waste of time.

Report Bug โ€ข Request Feature

About

High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic for scalable extraction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages