Support for Custom Metadata Hydrators in Text Splitters

### Checked other resources

- [x] This is a feature request, not a bug report or usage question.
- [x] I added a clear and descriptive title that summarizes the feature request.
- [x] I used the GitHub search to find a similar feature request and didn't find it.
- [x] I checked the LangChain documentation and API reference to see if this feature already exists.
- [x] This is not related to the langchain-community package.

### Feature Description

## 🚀 Feature Request: Support for Custom Metadata Hydrators in Text Splitters

### Summary

Currently, LangChain’s text splitters (`CharacterTextSplitter`, `RecursiveCharacterTextSplitter`, etc.) allow splitting documents into chunks with basic metadata propagation (like `source`, `page_content`, etc.).  
However, there is no way to dynamically **inject or enrich metadata** during the splitting process — e.g. by adding custom fields such as `chunk_number`, `section_title`, or context-aware data derived from the splitting logic.

I propose adding support for a **custom Metadata Hydrator** that can modify or extend document metadata during the split operation.

---

### Motivation

When building Retrieval-Augmented Generation (RAG) pipelines or complex document indexing workflows, it’s often necessary to enrich metadata with contextual information about each chunk.  
Examples include:

- Tracking `chunk_number` per document  
- Adding the name of the section or heading that the chunk originated from  
- Including dynamic tokens like the splitter’s position or semantic markers  

Right now, this requires post-processing the split results manually, which is inefficient and error-prone.

---

### Proposed Solution

Introduce an optional `MetadataHydrator` interface and method like `.set_metadata_hydrator()` on all `TextSplitter` classes.

#### Example API:

```python
from langchain_text_splitters import CharacterTextSplitter, BaseMetadataHydrator

class MetaDataCustomHydrator(BaseMetadataHydrator):
    def __init__(self, uploadedBy: str = 'user123'):
        self.uploadedBy = uploadedBy
        super().__init__()
    def metadata_hydrate(self, doc_metadata, splitter_context: list, chunk_content: str):
        doc_metadata["chunk_number"] = splitter_context["chunk_number"]
        doc_metadata["uploaded_by"] = self.uploadedBy
        return doc_metadata

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter.set_metadata_hydrator(MetaDataCustomHydrator())

texts = text_splitter.split_documents(documents)


### Use Case

## 🧩 Use Case: Dynamic Metadata Enrichment During Text Splitting

Imagine building a **compliance knowledge base** where legal teams upload multi-page policies, reports, and contracts.  
When splitting these long documents into smaller chunks for retrieval or indexing, each chunk needs detailed metadata for accurate search, traceability, and auditability.

### Example Desired Metadata per Chunk

```json
{
  "source": "ISO27001-AnnexA.pdf",
  "section": "A.9.2 User Access Management",
  "chunk_number": 3,
  "page_start": 12,
  "page_end": 13
}


### Proposed Solution

_No response_

### Alternatives Considered

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Custom Metadata Hydrators in Text Splitters #33898

Checked other resources

Feature Description

🚀 Feature Request: Support for Custom Metadata Hydrators in Text Splitters

Summary

Motivation

Proposed Solution

Example API:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Custom Metadata Hydrators in Text Splitters #33898

Description

Checked other resources

Feature Description

🚀 Feature Request: Support for Custom Metadata Hydrators in Text Splitters

Summary

Motivation

Proposed Solution

Example API:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions