🇰🇪 Kenya Government Document Intelligence System

A modern semantic search application for Kenya government documents, powered by AI embeddings and vector search technology.

✨ Features

🔍 Semantic Search - AI-powered search across 1,374 document chunks from 50+ Kenya government documents
📄 Document Viewing - Built-in PDF viewer with text extraction
⬇️ PDF Downloads - Direct download of source government documents
📊 Document Dashboard - Browse and explore the document corpus
🎯 Smart Suggestions - Example queries tailored to Kenya government content
⚡ Real-time Search - Fast vector similarity search with relevance scoring

🗄️ Document Corpus

The system includes pre-processed documents covering:

Kenya Mining Handbook and regulations
Tax procedures and investment incentives
Business registration and licensing
Mining development strategies
Environmental requirements
And 45+ other government documents

Total: 1,374 chunks from 50 documents, indexed with 384-dimensional embeddings

🚀 Technology Stack

Core Framework

⚡ Next.js 15 - React framework with App Router
📘 TypeScript 5 - Type-safe development
🎨 Tailwind CSS 4 - Modern UI styling

Search & AI

🔮 Vector Embeddings - all-MiniLM-L6-v2 model (384 dimensions)
🔍 Cosine Similarity - Semantic search ranking
📚 Pre-computed Corpus - Fast search without runtime embedding generation

UI Components

🧩 shadcn/ui - Accessible components built on Radix UI
🎯 Lucide React - Beautiful icons
🌈 Framer Motion - Smooth animations

Database & Backend

🗄️ Prisma - Type-safe ORM with SQLite
📁 File System - Direct PDF serving
🔄 Socket.IO - Real-time updates

📁 Project Structure

src/
├── app/
│   ├── api/
│   │   ├── search/              # Vector search endpoint
│   │   ├── documents/           # Document management
│   │   │   ├── list/           # Browse documents
│   │   │   ├── view/           # View document text
│   │   │   └── download/       # Download PDFs
│   │   └── corpus/             # Corpus statistics
│   ├── page.tsx                # Main search interface
│   └── layout.tsx              # App layout
├── components/
│   └── ui/                     # shadcn/ui components
├── lib/
│   ├── embeddings.ts           # Vector search service
│   ├── chunking.ts             # Document chunking
│   ├── db.ts                   # Database client
│   └── utils.ts                # Utilities
├── data/
│   └── kenya_gov_corpus.json   # Pre-computed embeddings
└── public/
    └── pdfs/                    # Government PDF files

🚀 Quick Start

Prerequisites

Node.js 18+
npm or yarn

Installation

# Install dependencies
npm install

# Setup database
npm run db:push

# Start development server
npm run dev

Open http://localhost:3000 to see the application.

Available Scripts

npm run dev          # Start development server
npm run build        # Build for production
npm start            # Start production server
npm run db:push      # Push Prisma schema to database
npm run db:generate  # Generate Prisma client

🔍 How It Works

1. Document Processing

Documents are processed offline using Python:

PDFs extracted with Chandra OCR (Qwen2.5-VL-7B-Instruct model)
Text chunked into 1000-character segments with 200-character overlap
Embeddings generated using sentence-transformers (all-MiniLM-L6-v2)
Stored in kenya_gov_corpus.json with pre-computed vectors

2. Search Flow

User enters a search query
Query embedding generated using all-MiniLM-L6-v2 model via @xenova/transformers
Cosine similarity calculated against all 1,374 document chunks
Results sorted by relevance score (typically 45-80% for relevant results)
Top 5 results displayed with document context

3. Vector Search

// Cosine similarity for semantic matching
const cosineSimilarity = (vec1, vec2) => {
  const dotProduct = vec1.reduce((sum, val, i) => sum + val * vec2[i], 0)
  const norm1 = Math.sqrt(vec1.reduce((sum, val) => sum + val * val, 0))
  const norm2 = Math.sqrt(vec2.reduce((sum, val) => sum + val * val, 0))
  return dotProduct / (norm1 * norm2)
}

📊 Data Statistics

{
  "totalDocuments": 50,
  "totalChunks": 1374,
  "embeddingDimension": 384,
  "embeddingModel": "all-MiniLM-L6-v2",
  "chunkSize": 1000,
  "chunkOverlap": 200
}

🔧 Roadmap

Current Limitations

Limited to pre-indexed documents only
First search query takes longer while model loads (~5-10 seconds)

Planned Improvements

Add document upload and processing pipeline
Implement relevance score thresholds for better filtering
Add multi-language support
Enhance document metadata extraction
Add export functionality for search results
Implement caching for faster subsequent searches

🤝 Contributing

This is a personal project showcasing semantic search for government documents. Feel free to fork and adapt for your own use cases.

📄 License

This project uses publicly available Kenya government documents. The codebase is available for educational and research purposes.

Built with modern web technologies for efficient government document search 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
db		db
prisma		prisma
public		public
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
CLOUD_RUN_DEPLOYMENT_FIXES.md		CLOUD_RUN_DEPLOYMENT_FIXES.md
DEPLOYMENT.md		DEPLOYMENT.md
DEPLOY_INSTRUCTIONS.md		DEPLOY_INSTRUCTIONS.md
Dockerfile		Dockerfile
GEMINI_MIGRATION.md		GEMINI_MIGRATION.md
README.md		README.md
components.json		components.json
deploy.sh		deploy.sh
eslint.config.mjs		eslint.config.mjs
migrate-to-gemini.ts		migrate-to-gemini.ts
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
server.ts		server.ts
tailwind.config.ts		tailwind.config.ts
test-gemini.ts		test-gemini.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🇰🇪 Kenya Government Document Intelligence System

✨ Features

🗄️ Document Corpus

🚀 Technology Stack

Core Framework

Search & AI

UI Components

Database & Backend

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Available Scripts

🔍 How It Works

1. Document Processing

2. Search Flow

3. Vector Search

📊 Data Statistics

🔧 Roadmap

Current Limitations

Planned Improvements

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

benmikeal/DocumentIntelligence

Folders and files

Latest commit

History

Repository files navigation

🇰🇪 Kenya Government Document Intelligence System

✨ Features

🗄️ Document Corpus

🚀 Technology Stack

Core Framework

Search & AI

UI Components

Database & Backend

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Available Scripts

🔍 How It Works

1. Document Processing

2. Search Flow

3. Vector Search

📊 Data Statistics

🔧 Roadmap

Current Limitations

Planned Improvements

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages