“One library to split them all: Sentence, Code, Docs”
Warning
Heads Up! Version 2.0.0 introduces breaking changes. For a smooth transition and detailed information, please consult our Migration Guide.
You might be thinking, 'Can't I just split my text or code with a simple character count or by arbitrary lines?' Well, you certainly could, but let's be frank – that's a bit like trying to perform delicate surgery with a butter knife! Standard splitting methods often lead to:
- Literary Butchery: Sentences chopped mid-thought or code blocks broken mid-function, leading to a loss of crucial meaning.
- Monolingual Approach: A disregard for the unique rules of non-English languages or the specific structures of programming languages.
- A Goldfish's Memory: Forgetting the context of the previous chunk, resulting in disconnected ideas and a less coherent flow.
Chunklet-py is a versatile and powerful library designed to intelligently segment various forms of content—from raw text to complex documents and source code—into perfectly sized, context-aware chunks. It goes beyond simple splitting, offering specialized tools:
Sentence SplitterPlain Text ChunkerDocument ChunkerCode Chunker
Each of these is tailored to preserve the original meaning and structure of your data.
Whether you're preparing data for Large Language Models (LLMs), developing Retrieval-Augmented Generation (RAG) pipelines, or enhancing AI-driven document search, Chunklet-py (version 2.0) provides the precision and flexibility needed for efficient indexing, embedding, and inference across multiple formats and languages.
| Feature | Why it’s great ? |
|---|---|
| 🚀 Blazingly Fast | Leverages efficient parallel processing to chunk large volumes of content with remarkable speed. |
| 🪶 Featherlight Footprint | Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead. |
| 🗂️ Rich Metadata for RAG | Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications. |
| 🔧 Infinitely Customizable | Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors. |
| 🌐 Multilingual Mastery | Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms. |
| 🧑💻 Code-Aware Intelligence | Language-agnostic code chunking that understands and preserves the structural integrity of your source code. |
| 🎯 Precision Chunking | Flexible constraint-based chunking allows you to combine limits based on sentences, tokens, sections, lines, and functions. |
| 📄 Document Format Mastery | Processes a wide array of document formats including .pdf, .docx, .epub, .txt, .tex, .html, .hml, .md, .rst, and .rtf. |
| 💻 Dual Interface: CLI & Library | Use it as a powerful command-line tool for fast, terminal-based chunking or import it as a library for deep integration into your Python applications. |
And there's even more to discover!
Note
For the documentation, visit our documentation site.
Ready to get Chunklet-py up and running? Fantastic! This guide will walk you through the installation process, making it as smooth as possible.
The most straightforward method to install Chunklet-py is by using pip:
# Install and verify version
pip install chunklet-py
chunklet --versionAnd that's all there is to it! You're now ready to start using Chunklet-py.
Chunklet-py offers optional dependencies to unlock additional functionalities, such as document processing or code chunking. You can install these extras using the following syntax:
- Document Processing: For handling
.pdf,.docx,.epub, and other document formats:pip install "chunklet-py[document]" - Code Chunking: For advanced code analysis and chunking features:
pip install "chunklet-py[code]" - All Extras: To install all optional dependencies:
pip install "chunklet-py[all]"
For those who prefer to build from source, you can clone the repository and install it manually. This method allows for direct modification of the source code and installation of all optional features:
git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .[all]But why would you want to do that? The easy way is so much easier.
Interested in helping make Chunklet-py even better? That's fantastic! Before you dive in, please take a moment to review our Contributing Guide. Here's how you can set up your development environment:
git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
# For basic development (testing, linting)
pip install -e ".[dev]"
# For documentation development
pip install -e ".[docs]"
# For comprehensive development (including all optional features like document and code chunking + docs dependencies)
pip install -e ".[dev-all]"These commands install Chunklet-py in "editable" mode, ensuring that any changes you make to the source code are immediately reflected. The [dev], [docs], and [dev-all] options include the necessary dependencies for specific development tasks.
Now, go forth and code! And remember, good developers always write tests. (Even in a Python project, we appreciate all forms of excellent code examples!)
- CLI interface
- Documents chunking with metadata.
- Code chunking based on interest point.
- Visualization for chunks (e.g., highlighting spans in original documents)
- Extend the file supported:
- Support for odt and eml files
- Support for tabular: csv, excel, ...
While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:
| Library | Key Differentiator | Focus |
|---|---|---|
| chunklet-py | All-in-one, lightweight, and language-agnostic with specialized algorithms. | Text, Code, Docs |
| CintraAI Code Chunker | Relies on tree-sitter, which can add setup complexity. |
Code |
| Chonkie | A feature-rich pipeline tool with cloud/vector integrations, but uses a more basic sentence splitter and tree-sitter for code. |
Pipelines, Integrations |
| code_chunker (JimAiMoment) | Uses basic regex and rules with limited language support. | Code |
| Semchunk | Primarily for text, using a general-purpose sentence splitter. | Text |
Chunklet-py's rule-based, language-agnostic approach to code chunking avoids the need for heavy dependencies like tree-sitter, which can sometimes introduce compatibility issues. For sentence splitting, it uses specialized libraries and algorithms for higher accuracy, rather than a one-size-fits-all approach. This makes Chunklet-py a great choice for projects that require a balance of power, flexibility, and a small footprint.
Big thanks to the people who helped shape Chunklet:
- @jmbernabotto — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
📜 License
See the LICENSE file for full details.
MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)
