Commit b634a5c
fix: streamline theme extraction from overlaps in MultiHopSpecificQue… (#2347)
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2275
- I also encounter this problem on my test on multi hop specific query
synthesizer:
1 validation error for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type,
input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'],
input_type=list]
For further information visit
https://errors.pydantic.dev/2.11/v/string_type
- @czhiming-maker have not update it already 2 weeks past, so I might
give a try to update on it.
## Changes Made
<!-- Describe what you changed and why -->
- add helper function _extract_themes_from_overlaps from the dicsussion
in #2275
-
-
## Testing
<!-- Describe how this should be tested -->
### How to Test
```
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from ragas.testset.synthesizers.multi_hop.specific import (
MultiHopSpecificQuerySynthesizer,
)
import os
loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
# Set your Azure OpenAI credentials
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", )
AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "")
AZURE_API_VERSION = "2024-12-01-preview"
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini")
EMBEDDING_DEPLOYMENT = "text-embedding-ada-002"
# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(
azure_endpoint=AZURE_ENDPOINT,
api_key=AZURE_API_KEY,
api_version=AZURE_API_VERSION,
azure_deployment=EMBEDDING_DEPLOYMENT
)
# Initialize LLM with JSON mode enabled
llm = AzureChatOpenAI(
azure_endpoint=AZURE_ENDPOINT,
api_key=AZURE_API_KEY,
openai_api_version=AZURE_API_VERSION,
azure_deployment=AZURE_DEPLOYMENT,
temperature=0.3,
model_kwargs={
"response_format": {"type": "json_object"} # Force clean JSON output
}
)
generator_llm = LangchainLLMWrapper(llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)
distribution = [
(MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
async def generate():
# generate testset
testset = generator.generate_with_langchain_docs(
documents,
testset_size=4,
query_distribution=distribution,
)
testset.to_evaluation_dataset().to_jsonl("testset.jsonl")
import asyncio
asyncio.run(generate())
```
## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues: #2275
## Screenshots/Examples (if applicable)
<!-- Add screenshots or code examples showing the change -->
Samples Generation in test
<img width="1701" height="111" alt="image"
src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b"
/>
---
<!--
Thank you for contributing to Ragas!
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->
Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>1 parent a54b1c9 commit b634a5c
1 file changed
+32
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
| 90 | + | |
| 91 | + | |
95 | 92 | | |
96 | 93 | | |
97 | 94 | | |
| |||
100 | 97 | | |
101 | 98 | | |
102 | 99 | | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
107 | 103 | | |
108 | 104 | | |
109 | 105 | | |
| |||
117 | 113 | | |
118 | 114 | | |
119 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
0 commit comments