Skip to content

Support for Custom Metadata Hydrators in Text Splitters #33898

@adeptofvoltron

Description

@adeptofvoltron

Checked other resources

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Feature Description

🚀 Feature Request: Support for Custom Metadata Hydrators in Text Splitters

Summary

Currently, LangChain’s text splitters (CharacterTextSplitter, RecursiveCharacterTextSplitter, etc.) allow splitting documents into chunks with basic metadata propagation (like source, page_content, etc.).
However, there is no way to dynamically inject or enrich metadata during the splitting process — e.g. by adding custom fields such as chunk_number, section_title, or context-aware data derived from the splitting logic.

I propose adding support for a custom Metadata Hydrator that can modify or extend document metadata during the split operation.


Motivation

When building Retrieval-Augmented Generation (RAG) pipelines or complex document indexing workflows, it’s often necessary to enrich metadata with contextual information about each chunk.
Examples include:

  • Tracking chunk_number per document
  • Adding the name of the section or heading that the chunk originated from
  • Including dynamic tokens like the splitter’s position or semantic markers

Right now, this requires post-processing the split results manually, which is inefficient and error-prone.


Proposed Solution

Introduce an optional MetadataHydrator interface and method like .set_metadata_hydrator() on all TextSplitter classes.

Example API:

from langchain_text_splitters import CharacterTextSplitter, BaseMetadataHydrator

class MetaDataCustomHydrator(BaseMetadataHydrator):
    def __init__(self, uploadedBy: str = 'user123'):
        self.uploadedBy = uploadedBy
        super().__init__()
    def metadata_hydrate(self, doc_metadata, splitter_context: list, chunk_content: str):
        doc_metadata["chunk_number"] = splitter_context["chunk_number"]
        doc_metadata["uploaded_by"] = self.uploadedBy
        return doc_metadata

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter.set_metadata_hydrator(MetaDataCustomHydrator())

texts = text_splitter.split_documents(documents)


### Use Case

## 🧩 Use Case: Dynamic Metadata Enrichment During Text Splitting

Imagine building a **compliance knowledge base** where legal teams upload multi-page policies, reports, and contracts.  
When splitting these long documents into smaller chunks for retrieval or indexing, each chunk needs detailed metadata for accurate search, traceability, and auditability.

### Example Desired Metadata per Chunk

```json
{
  "source": "ISO27001-AnnexA.pdf",
  "section": "A.9.2 User Access Management",
  "chunk_number": 3,
  "page_start": 12,
  "page_end": 13
}


### Proposed Solution

_No response_

### Alternatives Considered

_No response_

### Additional Context

_No response_

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestrequest for an enhancement / additional functionalitytext-splittersRelated to the package `text-splitters`

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions