Skip to content

Commit b634a5c

Browse files
kenzoyanKenzo Yan
andauthored
fix: streamline theme extraction from overlaps in MultiHopSpecificQue… (#2347)
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2275 - I also encounter this problem on my test on multi hop specific query synthesizer: 1 validation error for ThemesPersonasInput themes.0 Input should be a valid string [type=string_type, input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'], input_type=list] For further information visit https://errors.pydantic.dev/2.11/v/string_type - @czhiming-maker have not update it already 2 weeks past, so I might give a try to update on it. ## Changes Made <!-- Describe what you changed and why --> - add helper function _extract_themes_from_overlaps from the dicsussion in #2275 - - ## Testing <!-- Describe how this should be tested --> ### How to Test ``` from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers.multi_hop.specific import ( MultiHopSpecificQuerySynthesizer, ) import os loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader) documents = loader.load() # Set your Azure OpenAI credentials AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", ) AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "") AZURE_API_VERSION = "2024-12-01-preview" AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini") EMBEDDING_DEPLOYMENT = "text-embedding-ada-002" # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, model_kwargs={ "response_format": {"type": "json_object"} # Force clean JSON output } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) async def generate(): # generate testset testset = generator.generate_with_langchain_docs( documents, testset_size=4, query_distribution=distribution, ) testset.to_evaluation_dataset().to_jsonl("testset.jsonl") import asyncio asyncio.run(generate()) ``` ## References <!-- Link to related issues, discussions, forums, or external resources --> - Related issues: #2275 ## Screenshots/Examples (if applicable) <!-- Add screenshots or code examples showing the change --> Samples Generation in test <img width="1701" height="111" alt="image" src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b" /> --- <!-- Thank you for contributing to Ragas! Please fill out the sections above as completely as possible. The more information you provide, the faster your PR can be reviewed and merged. --> Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
1 parent a54b1c9 commit b634a5c

File tree

1 file changed

+32
-9
lines changed

1 file changed

+32
-9
lines changed

src/ragas/testset/synthesizers/multi_hop/specific.py

Lines changed: 32 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,8 @@ async def _generate_scenarios(
8787
):
8888
logger.debug("Overlapped items are not strings or iterables.")
8989
continue
90-
themes = (
91-
list(overlapped_items.keys())
92-
if isinstance(overlapped_items, dict)
93-
else overlapped_items
94-
)
90+
themes = self._extract_themes_from_overlaps(overlapped_items)
91+
9592
prompt_input = ThemesPersonasInput(
9693
themes=themes, personas=persona_list
9794
)
@@ -100,10 +97,9 @@ async def _generate_scenarios(
10097
data=prompt_input, llm=self.llm, callbacks=callbacks
10198
)
10299
)
103-
combinations = [
104-
[item] if isinstance(item, str) else list(item)
105-
for item in themes
106-
]
100+
101+
combinations = [[theme] for theme in themes]
102+
107103
base_scenarios = self.prepare_combinations(
108104
[node_a, node_b],
109105
combinations,
@@ -117,3 +113,30 @@ async def _generate_scenarios(
117113
scenarios.extend(base_scenarios)
118114

119115
return scenarios
116+
117+
def _extract_themes_from_overlaps(self, overlapped_items: t.Any) -> t.List[str]:
118+
"""
119+
Extract unique entity names from overlapped items.
120+
121+
Handles multiple formats:
122+
- List[Tuple[str, str]]: Entity pairs from overlap detection
123+
- List[str]: Direct entity names
124+
- Dict[str, Any]: Keys as entity names
125+
"""
126+
if isinstance(overlapped_items, dict):
127+
return list(overlapped_items.keys())
128+
129+
if not isinstance(overlapped_items, list):
130+
return []
131+
132+
unique_entities = set()
133+
for item in overlapped_items:
134+
if isinstance(item, tuple):
135+
# Extract both entities from the pair
136+
for entity in item:
137+
if isinstance(entity, str):
138+
unique_entities.add(entity)
139+
elif isinstance(item, str):
140+
unique_entities.add(item)
141+
142+
return list(unique_entities)

0 commit comments

Comments
 (0)