Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions RAG-with-MongoDB-Skill/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# RAG with MongoDB Skill

Learn how to build a Retrieval-Augmented Generation (RAG) system using MongoDB Atlas Vector Search, LangChain, and OpenAI! This example demonstrates creating an intelligent question-answering system that can provide accurate responses based on your document content.

## What This Demo Does

📚 **Answer questions from your documents**: Ask questions about your PDF content and get intelligent responses
🔍 **Semantic document retrieval**: Finds relevant document chunks using AI-powered similarity search
⚡ **Fast vector search**: Uses MongoDB's optimized vector search capabilities for quick retrieval
🤖 **AI-powered responses**: Combines retrieved context with OpenAI's GPT-4 for accurate answers

## What You'll Need

Before getting started, make sure you have:

- ✅ **MongoDB Atlas Cluster** with connection string
- ✅ **OpenAI API Key** for GPT-4 and metadata generation
- ✅ **Voyage AI API Key** (free tier available)
- ✅ **Python 3.8+** installed on your machine

## Step-by-Step Setup

### Step 1: Set Up Your Python Environment

Create an isolated environment for this project:

**Windows:**
```bash
python -m venv venv
venv\Scripts\activate
```

**macOS/Linux:**
```bash
python -m venv venv
source venv/bin/activate
```

### Step 2: Install Required Packages

```bash
pip install -r requirements.txt
```

### Step 3: Configure Your API Keys

Open the `key_param.py` file and add your credentials:

```python
LLM_API_KEY="your_openai_api_key_here"
VOYAGE_API_KEY="your_voyage_api_key_here"
MONGODB_URI="your_mongodb_connection_string_here"
```

💡 **Getting your keys:**
- **MongoDB URI**: Copy from your Atlas cluster's "Connect" button
- **OpenAI API Key**: Get from [openai.com](https://platform.openai.com)
- **Voyage API Key**: Sign up at [voyageai.com](https://voyageai.com) for a free API key

## How to Use

### 1. Load Your Data

First, run the data loading script to process your PDF and store embeddings:

```bash
python load_data.py
```

⏱️ **Note**: This process may take a couple of minutes as it generates embeddings and metadata for each document chunk.

This will:
- 📄 **Load and clean** your PDF document
- ✂️ **Split text** into manageable chunks (500 chars with 150 overlap)
- 🏷️ **Generate metadata** using OpenAI (title, keywords, hasCode)
- 🧠 **Create embeddings** using Voyage AI's voyage-3-large model
- 💾 **Store everything** in MongoDB Atlas with vector search capabilities

### 2. Create Vector Search Index

After your data is loaded, create a vector search index in your Atlas cluster's **Search & Vector Search** tab (On the left sidebar):

**Database:** `book_mongodb_chunks`
**Collection:** `chunked_data`
**Index Name:** `vector_index`

**Index Definition:**

```json
{
"fields": [
{
"numDimensions": 1024,
"path": "embedding",
"similarity": "dotProduct",
"type": "vector"
},
{
"path": "hasCode",
"type": "filter"
}
]
}
```

⚠️ **Important**: Wait for the index to finish building before proceeding. You can check the index status in the Atlas UI - it should show as "Ready" before you can run queries.

### 3. Ask Questions

Run the RAG system to start asking questions:

```bash
python rag.py
```

### 4. Customize Your Queries

Edit the query in `rag.py` to ask different questions:

```python
print(query_data("What is the difference between a collection and database in MongoDB?"))
# Try other questions like:
# "How do I create an index in MongoDB?"
# "What are the benefits of using MongoDB Atlas?"
# "Explain MongoDB's aggregation pipeline"
```

### 5. View Results

The system will output intelligent answers based on your document content with:
- 💭 **Contextual answers** generated from relevant document sections
- 🎯 **Source-grounded responses** that don't hallucinate beyond your content
- ⚡ **Fast retrieval** using vector similarity search

## Example Output

```
Query: "What is the difference between a collection and database in MongoDB?"

Answer: Based on the provided context, a database in MongoDB is a container that holds collections, while a collection is a grouping of MongoDB documents. Think of a database as a filing cabinet and collections as the folders within that cabinet that organize related documents together.
```

## How It Works

1. **Document Processing**: Your PDF gets chunked into smaller pieces with metadata extraction
2. **Vector Embedding**: Each chunk gets converted to a high-dimensional vector using Voyage AI
3. **Semantic Search**: When you ask a question, it finds the most relevant chunks using vector similarity
4. **Context Assembly**: Top matching chunks get combined into context for the AI
5. **Answer Generation**: OpenAI GPT-4 generates answers based only on the retrieved context

## Troubleshooting

**🚫 "No vector index found"**: Make sure your Atlas vector search index is created and active
**🔑 "Authentication failed"**: Verify your API keys in `key_param.py`
**📦 "Module not found"**: Ensure you activated your virtual environment
**📄 "File not found"**: Check that your PDF is in the `sample_files` directory

## Learn More

- 📚 [MongoDB Atlas Vector Search Documentation](https://docs.atlas.mongodb.com/atlas-vector-search/)
- 🎓 [Earn the Vector Search Fundamentals Badge](https://learn.mongodb.com/courses/vector-search-fundamentals)
- 🎓 [Earn the RAG with MongoDB Badge](https://learn.mongodb.com/courses/rag-with-mongodb)
- 🤖 [Voyage AI](https://voyageai.com/)
3 changes: 3 additions & 0 deletions RAG-with-MongoDB-Skill/key_param.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
LLM_API_KEY="your_openai_api_key_here"
VOYAGE_API_KEY="your_voyage_api_key_here"
MONGODB_URI="your_mongodb_connection_string_here"
54 changes: 54 additions & 0 deletions RAG-with-MongoDB-Skill/load_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from pymongo import MongoClient
from langchain_openai import ChatOpenAI
from langchain_voyageai import VoyageAIEmbeddings
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_transformers.openai_functions import (
create_metadata_tagger,
)

import key_param

# Set the MongoDB URI, DB, Collection Names

client = MongoClient(key_param.MONGODB_URI)
dbName = "book_mongodb_chunks"
collectionName = "chunked_data"
collection = client[dbName][collectionName]

loader = PyPDFLoader(".\sample_files\mongodb.pdf")
pages = loader.load()
cleaned_pages = []

for page in pages:
if len(page.page_content.split(" ")) > 20:
cleaned_pages.append(page)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=150)

schema = {
"properties": {
"title": {"type": "string"},
"keywords": {"type": "array", "items": {"type": "string"}},
"hasCode": {"type": "boolean"},
},
"required": ["title", "keywords", "hasCode"],
}

llm = ChatOpenAI(
openai_api_key=key_param.LLM_API_KEY, temperature=0, model="gpt-4"
)

document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)

docs = document_transformer.transform_documents(cleaned_pages)

split_docs = text_splitter.split_documents(docs)

embeddings = VoyageAIEmbeddings(voyage_api_key=key_param.VOYAGE_API_KEY, model="voyage-3-large")


vectorStore = MongoDBAtlasVectorSearch.from_documents(
split_docs, embeddings, collection=collection
)
62 changes: 62 additions & 0 deletions RAG-with-MongoDB-Skill/rag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import key_param
from langchain_voyageai import VoyageAIEmbeddings

dbName = "book_mongodb_chunks"
collectionName = "chunked_data"
index = "vector_index"

vectorStore = MongoDBAtlasVectorSearch.from_connection_string(
key_param.MONGODB_URI,
dbName + "." + collectionName,
VoyageAIEmbeddings(voyage_api_key=key_param.VOYAGE_API_KEY, model="voyage-3-large"),
index_name=index,
)

def query_data(query):
retriever = vectorStore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 3
},
)

template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Do not answer the question if there is no given context.
Do not answer the question if it is not related to the context.
Do not give recommendations to anything other than MongoDB.
Context:
{context}
Question: {question}
"""

custom_rag_prompt = PromptTemplate.from_template(template)

retrieve = {
"context": retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])),
"question": RunnablePassthrough()
}

llm = ChatOpenAI(openai_api_key=key_param.LLM_API_KEY, temperature=0, model="gpt-4")

response_parser = StrOutputParser()

rag_chain = (
retrieve
| custom_rag_prompt
| llm
| response_parser
)

answer = rag_chain.invoke(query)


return answer

print(query_data("What is the difference between a collection and database in MongoDB?"))
8 changes: 8 additions & 0 deletions RAG-with-MongoDB-Skill/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
langchain==0.3.27
langchain_community==0.3.30
langchain_core==0.3.78
langchain_mongodb==0.7.0
pymongo==4.15.2
langchain-voyageai==0.1.3
langchain_openai==0.3.35
pypdf==6.1.1
Binary file added RAG-with-MongoDB-Skill/sample_files/mongodb.pdf
Binary file not shown.