-
Notifications
You must be signed in to change notification settings - Fork 191
Description
Problem Statement
The current MongoDB Rust driver's find() method deserializes each document individually, which creates performance bottlenecks for high-throughput workloads.
This came up while optimizing bulk document processing at Rippling, where we're seeing overhead from per-document deserialization in our benchmarks.
Proposed Solution
Add a find_raw_batches() API that returns server response batches directly as RawDocumentBuf without per-document deserialization.
API Example
use futures::stream::StreamExt;
use mongodb::bson::RawDocument;
// Returns a Stream of RawBatch items
let mut cursor: RawBatchCursor = db.find_raw_batches("coll", doc! {}).await?;
while let Some(batch_result) = cursor.next().await {
let batch: RawBatch = batch_result?;
// Zero-copy access to the batch array (firstBatch/nextBatch)
let docs: &RawArray = batch.doc_slices()?;
// Iterate over documents in the batch without allocation
for doc_result in docs.into_iter() {
// Process raw document - can extract fields, forward bytes, etc.
// No per-document deserialization overhead
}
}Performance Impact
In my benchmarks against the existing find() with raw BSON (which already skips typed deserialization), the raw batch API shows meaningful performance improvements. I've included it in the driver's benchmark suite for comparison.
Implementation Notes
I have a working implementation on a branch that:
- Adds
Database::find_raw_batches()method - Implements
RawBatchCursorthat yieldsRawBatchitems - Supports both implicit and explicit sessions
- Includes some tests and benchmarks
- Uses the existing
RawDocumentBufinfrastructure; no unsafe code required
The implementation reuses existing cursor machinery and operation execution paths with a new handle_response_owned() hook that takes ownership of the response buffer rather than borrowing it.
Questions for Maintainers
- Is this feature aligned with the driver's roadmap?
- Are there API naming or design preferences? (e.g.,
find_raw_batchesvsfind_batches_raw) - Any concerns about the implementation approach?
I'm happy to refine the implementation based on feedback and would be excited to contribute this upstream if there's interest.