Skip to content

Commit 7a0db3a

Browse files
kenzoyanKenzo Yan
andauthored
fix: Add List[List[str]] formats for overlapped items in theme extration (Continuation in #2347) (#2355)
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - I updated the _extract_themes_from_overlaps in #2347 a few days ago, then test on a small documents set and generated examples successfullly. When I use a knowlege graph created by a large documents set in work for geneation, it stucks in scenario generation because it feed empty themes to LLM. <img width="700" height="500" alt="image" src="https://github.com/user-attachments/assets/44a13723-3318-48bf-93bb-ff9633e59a6c" /> - So a update on Fixes #2347 - MultiHopSpecificQuerySynthesizer hangs indefinitely during scenario generation when overlapped items are returned as lists instead of tuples - The _extract_themes_from_overlaps method only handled tuples and strings, but overlapped entity pairs can also be returned as lists, causing no themes to be extracted and the generation process to stall. - **Overlapped_items is in LIst[List] fromat in my case**. So where is the code/feature to decide what format will be ? I am not sure. but now if I take this format into considersion, it can fit in my case. <img width="1072" height="96" alt="image" src="https://github.com/user-attachments/assets/95a81826-bd89-4aca-843a-e9fad9fafd44" /> ## Changes Made <!-- Describe what you changed and why --> - Modified _extract_themes_from_overlaps in src/ragas/testset/synthesizers/multi_hop/specific.py to handle list type in addition to tuple and str - Changed condition from isinstance(item, tuple) to isinstance(item, (tuple, list)) to properly extract entities from list-formatted overlapped items ## Testing ``` from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers import ( MultiHopSpecificQuerySynthesizer, SingleHopSpecificQuerySynthesizer, ) import os # Set your Azure OpenAI credentials AZURE_ENDPOINT = AZURE_API_KEY = AZURE_API_VERSION = AZURE_DEPLOYMENT = EMBEDDING_DEPLOYMENT = # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, timeout=60, #model_kwargs={ # "response_format": {"type": "json_object"} # Force clean JSON output # } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ # (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0), (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] # print(distribution) from ragas.testset.graph import KnowledgeGraph kg = KnowledgeGraph.load("XXXX.json") print(f"Knowledge graph loaded with {len(kg.nodes)} entities and {len(kg.relationships)} relations.") generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings, knowledge_graph=kg) async def generate(): # generate testset testset = generator.generate( testset_size=3, query_distribution=distribution, with_debugging_logs=True, raise_exceptions=True ) print(testset) testset.to_pandas().to_csv("test_theme.csv", index=False) import asyncio asyncio.run(generate()) ``` ## References <!-- Link to related issues, discussions, forums, or external resources --> - Related issues: - Documentation: - External references: ## Screenshots/Examples (if applicable) <!-- Add screenshots or code examples showing the change --> <img width="3127" height="320" alt="image" src="https://github.com/user-attachments/assets/fc4836d6-4254-4882-b7c7-2c8489ab54bc" /> <img width="1080" height="277" alt="image" src="https://github.com/user-attachments/assets/f2f557bb-90fa-4bfd-b2bf-5354bd4f478d" /> --- <!-- Thank you for contributing to Ragas! Please fill out the sections above as completely as possible. The more information you provide, the faster your PR can be reviewed and merged. --> Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
1 parent cc59f6a commit 7a0db3a

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

src/ragas/testset/synthesizers/multi_hop/specific.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ def _extract_themes_from_overlaps(self, overlapped_items: t.Any) -> t.List[str]:
120120
121121
Handles multiple formats:
122122
- List[Tuple[str, str]]: Entity pairs from overlap detection
123+
- List[List[str]]: Entity pairs as lists
123124
- List[str]: Direct entity names
124125
- Dict[str, Any]: Keys as entity names
125126
"""
@@ -131,7 +132,7 @@ def _extract_themes_from_overlaps(self, overlapped_items: t.Any) -> t.List[str]:
131132

132133
unique_entities = set()
133134
for item in overlapped_items:
134-
if isinstance(item, tuple):
135+
if isinstance(item, (tuple, list)):
135136
# Extract both entities from the pair
136137
for entity in item:
137138
if isinstance(entity, str):

0 commit comments

Comments
 (0)