-
Notifications
You must be signed in to change notification settings - Fork 20k
Description
Checked other resources
- This is a feature request, not a bug report or usage question.
- I added a clear and descriptive title that summarizes the feature request.
- I used the GitHub search to find a similar feature request and didn't find it.
- I checked the LangChain documentation and API reference to see if this feature already exists.
- This is not related to the langchain-community package.
Feature Description
🚀 Feature Request: Support for Custom Metadata Hydrators in Text Splitters
Summary
Currently, LangChain’s text splitters (CharacterTextSplitter, RecursiveCharacterTextSplitter, etc.) allow splitting documents into chunks with basic metadata propagation (like source, page_content, etc.).
However, there is no way to dynamically inject or enrich metadata during the splitting process — e.g. by adding custom fields such as chunk_number, section_title, or context-aware data derived from the splitting logic.
I propose adding support for a custom Metadata Hydrator that can modify or extend document metadata during the split operation.
Motivation
When building Retrieval-Augmented Generation (RAG) pipelines or complex document indexing workflows, it’s often necessary to enrich metadata with contextual information about each chunk.
Examples include:
- Tracking
chunk_numberper document - Adding the name of the section or heading that the chunk originated from
- Including dynamic tokens like the splitter’s position or semantic markers
Right now, this requires post-processing the split results manually, which is inefficient and error-prone.
Proposed Solution
Introduce an optional MetadataHydrator interface and method like .set_metadata_hydrator() on all TextSplitter classes.
Example API:
from langchain_text_splitters import CharacterTextSplitter, BaseMetadataHydrator
class MetaDataCustomHydrator(BaseMetadataHydrator):
def __init__(self, uploadedBy: str = 'user123'):
self.uploadedBy = uploadedBy
super().__init__()
def metadata_hydrate(self, doc_metadata, splitter_context: list, chunk_content: str):
doc_metadata["chunk_number"] = splitter_context["chunk_number"]
doc_metadata["uploaded_by"] = self.uploadedBy
return doc_metadata
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter.set_metadata_hydrator(MetaDataCustomHydrator())
texts = text_splitter.split_documents(documents)
### Use Case
## 🧩 Use Case: Dynamic Metadata Enrichment During Text Splitting
Imagine building a **compliance knowledge base** where legal teams upload multi-page policies, reports, and contracts.
When splitting these long documents into smaller chunks for retrieval or indexing, each chunk needs detailed metadata for accurate search, traceability, and auditability.
### Example Desired Metadata per Chunk
```json
{
"source": "ISO27001-AnnexA.pdf",
"section": "A.9.2 User Access Management",
"chunk_number": 3,
"page_start": 12,
"page_end": 13
}
### Proposed Solution
_No response_
### Alternatives Considered
_No response_
### Additional Context
_No response_