Skip to content

Zero-copy raw batch cursor API #1529

@SilasMarvin

Description

@SilasMarvin

Problem Statement

The current MongoDB Rust driver's find() method deserializes each document individually, which creates performance bottlenecks for high-throughput workloads.

This came up while optimizing bulk document processing at Rippling, where we're seeing overhead from per-document deserialization in our benchmarks.

Proposed Solution

Add a find_raw_batches() API that returns server response batches directly as RawDocumentBuf without per-document deserialization.

API Example

use futures::stream::StreamExt;
use mongodb::bson::RawDocument;

// Returns a Stream of RawBatch items
let mut cursor: RawBatchCursor = db.find_raw_batches("coll", doc! {}).await?;

while let Some(batch_result) = cursor.next().await {
    let batch: RawBatch = batch_result?;

    // Zero-copy access to the batch array (firstBatch/nextBatch)
    let docs: &RawArray = batch.doc_slices()?;

    // Iterate over documents in the batch without allocation
    for doc_result in docs.into_iter() {
        // Process raw document - can extract fields, forward bytes, etc.
        // No per-document deserialization overhead
    }
}

Performance Impact

In my benchmarks against the existing find() with raw BSON (which already skips typed deserialization), the raw batch API shows meaningful performance improvements. I've included it in the driver's benchmark suite for comparison.

Implementation Notes

I have a working implementation on a branch that:

  • Adds Database::find_raw_batches() method
  • Implements RawBatchCursor that yields RawBatch items
  • Supports both implicit and explicit sessions
  • Includes some tests and benchmarks
  • Uses the existing RawDocumentBuf infrastructure; no unsafe code required

The implementation reuses existing cursor machinery and operation execution paths with a new handle_response_owned() hook that takes ownership of the response buffer rather than borrowing it.

Questions for Maintainers

  1. Is this feature aligned with the driver's roadmap?
  2. Are there API naming or design preferences? (e.g., find_raw_batches vs find_batches_raw)
  3. Any concerns about the implementation approach?

I'm happy to refine the implementation based on feedback and would be excited to contribute this upstream if there's interest.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions