|
| 1 | +# Knowledge Graph-based RAG System |
| 2 | + |
| 3 | +This repository contains the materials for the NVIDIA GTC 2025 Deep Learning Institute course: "Structure From Chaos: Accelerate GraphRAG With cuGraph and NVIDIA NIM" [DLIT71491]. |
| 4 | + |
| 5 | +You can access the online course video at: [NVIDIA On-Demand](https://www.nvidia.com/en-us/on-demand/session/gtc25-dlit71491/) |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +This course demonstrates how to build a Knowledge Graph-based RAG system for financial documents, specifically SEC 10-K filings. The system extracts structured knowledge triplets, builds a dynamic knowledge graph, and provides enhanced query capabilities. |
| 10 | + |
| 11 | +You'll learn how to integrate large language models (LLMs) with NVIDIA Inference Microservices (NIM) and cuGraph to create cutting-edge, graph-based AI solutions for handling complex, interconnected data. The course covers fine-tuning techniques, Langchain agents, and GPU-accelerated graph analytics to enhance AI capabilities and retrieval-augmented generation (RAG) evaluation. |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +- Docker and docker-compose |
| 16 | +- 4x NVIDIA A100 GPUs required for the LLM fine-tuning workflow (Notebook 2) |
| 17 | +- NVIDIA API key (sign up at [NVIDIA AI portal](https://developer.nvidia.com/)) |
| 18 | + |
| 19 | +## Setup |
| 20 | + |
| 21 | +1. Clone this repository |
| 22 | +2. Set your NVIDIA API key in the `.env` file: |
| 23 | + ``` |
| 24 | + NVIDIA_API_KEY=your-nvapi-key |
| 25 | + NGC_API_KEY=your-ngc-key # For some containerized models |
| 26 | + ``` |
| 27 | +3. Start the containers: |
| 28 | + ```bash |
| 29 | + docker-compose up -d |
| 30 | + ``` |
| 31 | +4. Access the Jupyter Lab environment at: http://localhost:8888 |
| 32 | + |
| 33 | +## Notebooks |
| 34 | + |
| 35 | +The course consists of five notebooks that build upon each other: |
| 36 | + |
| 37 | +1. **01_Graph_Triplet_Extraction.ipynb** |
| 38 | + - Extract structured (subject, relation, object) triplets from SEC 10-K filings |
| 39 | + - Transform unstructured text into structured knowledge for a knowledge graph |
| 40 | + |
| 41 | +2. **02_LLM_Finetuning.ipynb** |
| 42 | + - Fine-tune a smaller LLM (LLaMa-3 8B) for more accurate triplet prediction |
| 43 | + - Leverage NVIDIA NeMo and NVIDIA Inference Microservices (NIM) |
| 44 | + |
| 45 | +3. **03_Dynamic_Database.ipynb** |
| 46 | + - Set up a persistent, dynamic backend (ArangoDB) for the knowledge graph |
| 47 | + - Handle triplets being added or deleted over time |
| 48 | + - Create a GraphRAG agent connected to a continuously updating database |
| 49 | + |
| 50 | +4. **04_Evaluation.ipynb** |
| 51 | + - Evaluate the RAG system using Ragas and NVIDIA's Nemotron-340b-reward model |
| 52 | + - Assess metrics such as faithfulness, context precision, and answer relevancy |
| 53 | + |
| 54 | +5. **05_Link_Prediction.ipynb** |
| 55 | + - Improve knowledge graph completeness through link prediction |
| 56 | + - Use techniques like TransE model and non-negative matrix factorization |
| 57 | + - Predict missing relationships within the knowledge graph |
| 58 | + |
| 59 | +## Data |
| 60 | + |
| 61 | +The notebooks use 2021 SEC documents (10-K filings) from the [Kaggle SEC EDGAR Annual Financial Filings 2021 dataset](https://www.kaggle.com/datasets/pranjalverma08/sec-edgar-annual-financial-filings-2021). The data preprocessing tools are included in the `data_prep` directory. |
| 62 | + |
| 63 | +## Requirements |
| 64 | + |
| 65 | +The key Python packages required for this project are listed in `requirements.txt` and include: |
| 66 | +- numpy, scipy, scikit-learn |
| 67 | +- torch, pykeen, fairscale |
| 68 | +- jupyterlab, networkx |
| 69 | +- ArangoDB libraries (nx_arangodb, arango) |
| 70 | +- NVIDIA GPU libraries (cugraph-cu12, nx_cugraph-cu12) |
| 71 | +- LLM frameworks (llama-index, langchain, transformers) |
| 72 | +- Evaluation tools (ragas, datasets) |
| 73 | + |
| 74 | +## Docker Environment |
| 75 | + |
| 76 | +The repository uses multiple containers: |
| 77 | +- A JupyterLab environment with CUDA support |
| 78 | +- NVIDIA NeMo container for model fine-tuning |
| 79 | +- NVIDIA NIM container for model serving |
| 80 | +- ArangoDB for graph database storage |
0 commit comments