fix: Add List[List[str]] formats for overlapped items in theme extration (Continuation in #2347) (#2355)

kenzoyan · Kenzo Yan · web-flow · commit 7a0db3a509bd · 2025-10-13T01:36:02.000+05:30
## Issue Link / Problem Description  - I updated the _extract_themes_from_overlaps in #2347 a few days ago, then test on a small documents set and generated examples successfullly. When I use a knowlege graph created by a large documents set in work for geneation, it stucks in scenario generation because it feed empty themes to LLM. <img width="700" height="500" alt="image" src="https://github.com/user-attachments/assets/44a13723-3318-48bf-93bb-ff9633e59a6c" /> - So a update on Fixes #2347 - MultiHopSpecificQuerySynthesizer hangs indefinitely during scenario generation when overlapped items are returned as lists instead of tuples - The _extract_themes_from_overlaps method only handled tuples and strings, but overlapped entity pairs can also be returned as lists, causing no themes to be extracted and the generation process to stall. - **Overlapped_items is in LIst[List] fromat in my case**. So where is the code/feature to decide what format will be ? I am not sure. but now if I take this format into considersion, it can fit in my case. <img width="1072" height="96" alt="image" src="https://github.com/user-attachments/assets/95a81826-bd89-4aca-843a-e9fad9fafd44" /> ## Changes Made  - Modified _extract_themes_from_overlaps in src/ragas/testset/synthesizers/multi_hop/specific.py to handle list type in addition to tuple and str - Changed condition from isinstance(item, tuple) to isinstance(item, (tuple, list)) to properly extract entities from list-formatted overlapped items ## Testing ``` from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers import ( MultiHopSpecificQuerySynthesizer, SingleHopSpecificQuerySynthesizer, ) import os # Set your Azure OpenAI credentials AZURE_ENDPOINT = AZURE_API_KEY = AZURE_API_VERSION = AZURE_DEPLOYMENT = EMBEDDING_DEPLOYMENT = # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, timeout=60, #model_kwargs={ # "response_format": {"type": "json_object"} # Force clean JSON output # } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ # (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0), (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] # print(distribution) from ragas.testset.graph import KnowledgeGraph kg = KnowledgeGraph.load("XXXX.json") print(f"Knowledge graph loaded with {len(kg.nodes)} entities and {len(kg.relationships)} relations.") generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings, knowledge_graph=kg) async def generate(): # generate testset testset = generator.generate( testset_size=3, query_distribution=distribution, with_debugging_logs=True, raise_exceptions=True ) print(testset) testset.to_pandas().to_csv("test_theme.csv", index=False) import asyncio asyncio.run(generate()) ``` ## References  - Related issues: - Documentation: - External references: ## Screenshots/Examples (if applicable)  <img width="3127" height="320" alt="image" src="https://github.com/user-attachments/assets/fc4836d6-4254-4882-b7c7-2c8489ab54bc" /> <img width="1080" height="277" alt="image" src="https://github.com/user-attachments/assets/f2f557bb-90fa-4bfd-b2bf-5354bd4f478d" /> ---  Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
diff --git a/src/ragas/testset/synthesizers/multi_hop/specific.py b/src/ragas/testset/synthesizers/multi_hop/specific.py
@@ -120,6 +120,7 @@ def _extract_themes_from_overlaps(self, overlapped_items: t.Any) -> t.List[str]:
 
         Handles multiple formats:
         - List[Tuple[str, str]]: Entity pairs from overlap detection
+        - List[List[str]]: Entity pairs as lists
         - List[str]: Direct entity names
         - Dict[str, Any]: Keys as entity names
         """
@@ -131,7 +132,7 @@ def _extract_themes_from_overlaps(self, overlapped_items: t.Any) -> t.List[str]:
 
         unique_entities = set()
         for item in overlapped_items:
-            if isinstance(item, tuple):
+            if isinstance(item, (tuple, list)):
                 # Extract both entities from the pair
                 for entity in item:
                     if isinstance(entity, str):