You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add models page
* Update config docs for new params
* Spelling
* Add comment on CoT with o-series
* Add notes about managed identity
* Update the viz guide
* Spruce up the getting started wording
* Capitalization
* Add BYOG page
* More BYOG edits
* Update dictionary
* Change example model name
Copy file name to clipboardExpand all lines: docs/config/init.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,4 +29,4 @@ The `init` command will create the following files in the specified directory:
29
29
30
30
## Next Steps
31
31
32
-
After initializing your workspace, you can either run the [Prompt Tuning](../prompt_tuning/auto_prompt_tuning.md) command to adapt the prompts to your data or even start running the [Indexing Pipeline](../index/overview.md) to index your data. For more information on configuring GraphRAG, see the [Configuration](overview.md) documentation.
32
+
After initializing your workspace, you can either run the [Prompt Tuning](../prompt_tuning/auto_prompt_tuning.md) command to adapt the prompts to your data or even start running the [Indexing Pipeline](../index/overview.md) to index your data. For more information on configuration options available, see the [YAML details page](yaml.md).
This page contains information on selecting a model to use and options to supply your own model for GraphRAG. Note that this is not a guide to finding the right model for your use case.
4
+
5
+
## Default Model Support
6
+
7
+
GraphRAG was built and tested using OpenAI models, so this is the default model set we support. This is not intended to be a limiter or statement of quality or fitness for your use case, only that it's the set we are most familiar with for prompting, tuning, and debugging.
8
+
9
+
GraphRAG also utilizes a language model wrapper library used by several projects within our team, called fnllm. fnllm provides two important functions for GraphRAG: rate limiting configuration to help us maximize throughput for large indexing jobs, and robust caching of API calls to minimize consumption on repeated indexes for testing, experimentation, or incremental ingest. fnllm uses the OpenAI Python SDK under the covers, so OpenAI-compliant endpoints are a base requirement out-of-the-box.
10
+
11
+
## Model Selection Considerations
12
+
13
+
GraphRAG has been most thoroughly tested with the gpt-4 series of models from OpenAI, including gpt-4 gpt-4-turbo, gpt-4o, and gpt-4o-mini. Our [arXiv paper](https://arxiv.org/abs/2404.16130), for example, performed quality evaluation using gpt-4-turbo.
14
+
15
+
Versions of GraphRAG before 2.2.0 made extensive use of `max_tokens` and `logit_bias` to control generated response length or content. The introduction of the o-series of models added new, non-compatible parameters because these models include a reasoning component that has different consumption patterns and response generation attributes than non-reasoning models. GraphRAG 2.2.0 now supports these models, but there are important differences that need to be understood before you switch.
16
+
17
+
- Previously, GraphRAG used `max_tokens` to limit responses in a few locations. This is done so that we can have predictable content sizes when building downstream context windows for summarization. We have now switched from using `max_tokens` to use a prompted approach, which is working well in our tests. We suggest using `max_tokens` in your language model config only for budgetary reasons if you want to limit consumption, and not for expected response length control. We now also support the o-series equivalent `max_completion_tokens`, but if you use this keep in mind that there may be some unknown fixed reasoning consumption amount in addition to the response tokens, so it is not a good technique for response control.
18
+
- Previously, GraphRAG used a combination of `max_tokens` and `logit_bias` to strictly control a binary yes/no question during gleanings. This is not possible with reasoning models, so again we have switched to a prompted approach. Our tests with gpt-4o, gpt-4o-mini, and o1 show that this works consistently, but could have issues if you have an older or smaller model.
19
+
- The o-series models are much slower and more expensive. It may be useful to use an asymmetric approach to model use in your config: you can define as many models as you like in the `models` block of your settings.yaml and reference them by key for every workflow that requires a language model. You could use gpt-4o for indexing and o1 for query, for example. Experiment to find the right balance of cost, speed, and quality for your use case.
20
+
- The o-series models contain a form of native native chain-of-thought reasoning that is absent in the non-o-series models. GraphRAG's prompts sometimes contain CoT because it was an effective technique with the gpt-4* series. It may be counterproductive with the o-series, so you may want to tune or even re-write large portions of the prompt templates (particularly for graph and claim extraction).
Another option would be to avoid using a language model at all for the graph extraction, instead using the `fast` [indexing method](../index/methods.md) that uses NLP for portions of the indexing phase in lieu of LLM APIs.
58
+
59
+
## Using Non-OpenAI Models
60
+
61
+
As noted above, our primary experience and focus has been on OpenAI models, so this is what is supported out-of-the-box. Many users have requested support for additional model types, but it's out of the scope of our research to handle the many models available today. There are two approaches you can use to connect to a non-OpenAI model:
62
+
63
+
### Proxy APIs
64
+
65
+
Many users have used platforms such as [ollama](https://ollama.com/) to proxy the underlying model HTTP calls to a different model provider. This seems to work reasonably well, but we frequently see issues with malformed responses (especially JSON), so if you do this please understand that your model needs to reliably return the specific response formats that GraphRAG expects. If you're having trouble with a model, you may need to try prompting to coax the format, or intercepting the response within your proxy to try and handle malformed responses.
66
+
67
+
### Model Protocol
68
+
69
+
As of GraphRAG 2.0.0, we support model injection through the use of a standard chat and embedding Protocol and an accompanying ModelFactory that you can use to register your model implementation. This is not supported with the CLI, so you'll need to use GraphRAG as a library.
70
+
71
+
- Our Protocol is [defined here](https://github.com/microsoft/graphrag/blob/main/graphrag/language_model/protocol/base.py)
72
+
- Our base implementation, which wraps fnllm, [is here](https://github.com/microsoft/graphrag/blob/main/graphrag/language_model/providers/fnllm/models.py)
73
+
- We have a simple mock implementation in our tests that you can [reference here](https://github.com/microsoft/graphrag/blob/main/tests/mock_provider.py)
74
+
75
+
Once you have a model implementation, you need to register it with our ModelFactory:
Then in your config you can reference the type name you used:
87
+
88
+
```yaml
89
+
models:
90
+
default_chat_model:
91
+
type: my-custom-chat-model
92
+
93
+
94
+
extract_graph:
95
+
model_id: default_chat_model
96
+
prompt: "prompts/extract_graph.txt"
97
+
entity_types: [organization,person,geo,event]
98
+
max_gleanings: 1
99
+
```
100
+
101
+
Note that your custom model will be passed the same params for init and method calls that we use throughout GraphRAG. There is not currently any ability to define custom parameters, so you may need to use closure scope or a factory pattern within your implementation to get custom config values.
Copy file name to clipboardExpand all lines: docs/config/overview.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,8 @@ The GraphRAG system is highly configurable. This page provides an overview of th
4
4
5
5
## Default Configuration Mode
6
6
7
-
The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The primary configuration sections for the Indexing Engine pipelines are described below. The main ways to set up GraphRAG in Default Configuration mode are via:
7
+
The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The main ways to set up GraphRAG in Default Configuration mode are via:
8
8
9
-
-[Init command](init.md) (recommended)
10
-
-[Using YAML for deeper control](yaml.md)
9
+
-[Init command](init.md) (recommended first step)
10
+
-[Edit settings.yaml for deeper control](yaml.md)
11
11
-[Purely using environment variables](env_vars.md) (not recommended)
Copy file name to clipboardExpand all lines: docs/config/yaml.md
+18-37Lines changed: 18 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,12 +60,14 @@ models:
60
60
- `concurrent_requests`**int** The number of open requests to allow at once.
61
61
- `async_mode`**asyncio|threaded** The async mode to use. Either `asyncio` or `threaded`.
62
62
- `responses`**list[str]** - If this model type is mock, this is a list of response strings to return.
63
-
- `max_tokens`**int** - The maximum number of output tokens.
64
-
- `temperature`**float** - The temperature to use.
65
-
- `top_p`**float** - The top-p value to use.
66
63
- `n`**int** - The number of completions to generate.
67
-
- `frequency_penalty`**float** - Frequency penalty for token generation.
68
-
- `presence_penalty`**float** - Frequency penalty for token generation.
64
+
- `max_tokens`**int** - The maximum number of output tokens. Not valid for o-series models.
65
+
- `temperature`**float** - The temperature to use. Not valid for o-series models.
66
+
- `top_p`**float** - The top-p value to use. Not valid for o-series models.
67
+
- `frequency_penalty`**float** - Frequency penalty for token generation. Not valid for o-series models.
68
+
- `presence_penalty`**float** - Frequency penalty for token generation. Not valid for o-series models.
69
+
- `max_completion_tokens`**int** - Max number of tokens to consume for chat completion. Must be large enough to include an unknown amount for "reasoning" by the model. o-series models only.
70
+
- `reasoning_effort`**low|medium|high** - Amount of "thought" for the model to expend reasoning about a response. o-series models only.
69
71
70
72
## Input Files and Chunking
71
73
@@ -212,7 +214,6 @@ Tune the language model-based graph extraction process.
212
214
- `prompt`**str** - The prompt file to use.
213
215
- `entity_types`**list[str]** - The entity types to identify.
214
216
- `max_gleanings`**int** - The maximum number of gleaning cycles to use.
215
-
- `encoding_model`**str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
216
217
217
218
### summarize_descriptions
218
219
@@ -221,6 +222,7 @@ Tune the language model-based graph extraction process.
221
222
- `model_id`**str** - Name of the model definition to use for API calls.
222
223
- `prompt`**str** - The prompt file to use.
223
224
- `max_length`**int** - The maximum number of output tokens per summarization.
225
+
- `max_input_length`**int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
224
226
225
227
### extract_graph_nlp
226
228
@@ -274,7 +276,6 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
274
276
- `prompt`**str** - The prompt file to use.
275
277
- `description`**str** - Describes the types of claims we want to extract.
276
278
- `max_gleanings`**int** - The maximum number of gleaning cycles to use.
277
-
- `encoding_model`**str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
278
279
279
280
### community_reports
280
281
@@ -329,11 +330,7 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
329
330
- `conversation_history_max_turns`**int** - The conversation history maximum turns.
330
331
- `top_k_entities`**int** - The top k mapped entities.
331
332
- `top_k_relationships`**int** - The top k mapped relations.
332
-
- `temperature`**float | None** - The temperature to use for token generation.
333
-
- `top_p`**float | None** - The top-p value to use for token generation.
334
-
- `n`**int | None** - The number of completions to generate.
335
-
- `max_tokens`**int** - The maximum tokens.
336
-
- `llm_max_tokens`**int** - The LLM maximum tokens.
333
+
- `max_context_tokens`**int** - The maximum tokens to use building the request context.
337
334
338
335
### global_search
339
336
@@ -346,20 +343,14 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
346
343
- `map_prompt`**str | None** - The global search mapper prompt to use.
347
344
- `reduce_prompt`**str | None** - The global search reducer to use.
348
345
- `knowledge_prompt`**str | None** - The global search general prompt to use.
349
-
- `temperature`**float | None** - The temperature to use for token generation.
350
-
- `top_p`**float | None** - The top-p value to use for token generation.
351
-
- `n`**int | None** - The number of completions to generate.
352
-
- `max_tokens`**int** - The maximum context size in tokens.
353
-
- `data_max_tokens`**int** - The data llm maximum tokens.
354
-
- `map_max_tokens`**int** - The map llm maximum tokens.
355
-
- `reduce_max_tokens`**int** - The reduce llm maximum tokens.
356
-
- `concurrency`**int** - The number of concurrent requests.
357
-
- `dynamic_search_llm`**str** - LLM model to use for dynamic community selection.
346
+
- `max_context_tokens`**int** - The maximum context size to create, in tokens.
347
+
- `data_max_tokens`**int** - The maximum tokens to use constructing the final response from the reduces responses.
348
+
- `map_max_length`**int** - The maximum length to request for map responses, in words.
349
+
- `reduce_max_length`**int** - The maximum length to request for reduce responses, in words.
358
350
- `dynamic_search_threshold`**int** - Rating threshold in include a community report.
359
351
- `dynamic_search_keep_parent`**bool** - Keep parent community if any of the child communities are relevant.
360
352
- `dynamic_search_num_repeats`**int** - Number of times to rate the same community report.
361
353
- `dynamic_search_use_summary`**bool** - Use community summary instead of full_context.
362
-
- `dynamic_search_concurrent_coroutines`**int** - Number of concurrent coroutines to rate community reports.
363
354
- `dynamic_search_max_level`**int** - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
364
355
365
356
### drift_search
@@ -370,11 +361,9 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
370
361
- `embedding_model_id`**str** - Name of the model definition to use for Embedding calls.
371
362
- `prompt`**str** - The prompt file to use.
372
363
- `reduce_prompt`**str** - The reducer prompt file to use.
373
-
- `temperature`**float** - The temperature to use for token generation.",
374
-
- `top_p`**float** - The top-p value to use for token generation.
375
-
- `n`**int** - The number of completions to generate.
376
-
- `max_tokens`**int** - The maximum context size in tokens.
377
364
- `data_max_tokens`**int** - The data llm maximum tokens.
365
+
- `reduce_max_tokens`**int** - The maximum tokens for the reduce phase. Only use if a non-o-series model.
366
+
- `reduce_max_completion_tokens`**int** - The maximum tokens for the reduce phase. Only use for o-series models.
378
367
- `concurrency`**int** - The number of concurrent requests.
379
368
- `drift_k_followups`**int** - The number of top global results to retrieve.
380
369
- `primer_folds`**int** - The number of folds for search priming.
@@ -388,7 +377,8 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
388
377
- `local_search_temperature`**float** - The temperature to use for token generation in local search.
389
378
- `local_search_top_p`**float** - The top-p value to use for token generation in local search.
390
379
- `local_search_n`**int** - The number of completions to generate in local search.
391
-
- `local_search_llm_max_gen_tokens`**int** - The maximum number of generated tokens for the LLM in local search.
380
+
- `local_search_llm_max_gen_tokens`**int** - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.
381
+
- `local_search_llm_max_gen_completion_tokens`**int** - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
392
382
393
383
### basic_search
394
384
@@ -397,13 +387,4 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
397
387
- `chat_model_id`**str** - Name of the model definition to use for Chat Completion calls.
398
388
- `embedding_model_id`**str** - Name of the model definition to use for Embedding calls.
399
389
- `prompt`**str** - The prompt file to use.
400
-
- `text_unit_prop`**float** - The text unit proportion.
401
-
- `community_prop`**float** - The community proportion.
402
-
- `conversation_history_max_turns`**int** - The conversation history maximum turns.
403
-
- `top_k_entities`**int** - The top k mapped entities.
404
-
- `top_k_relationships`**int** - The top k mapped relations.
405
-
- `temperature`**float | None** - The temperature to use for token generation.
406
-
- `top_p`**float | None** - The top-p value to use for token generation.
407
-
- `n`**int | None** - The number of completions to generate.
408
-
- `max_tokens`**int** - The maximum tokens.
409
-
- `llm_max_tokens`**int** - The LLM maximum tokens.
390
+
- `k`**int | None** - Number of text units to retrieve from the vector store for context building.
0 commit comments