feat(api): add file_processor API skeleton #4113

alinaryan · 2025-11-09T05:11:52Z

This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.

A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.

Related to:
#4114
#4003
#2484

cc: @franciscojavierarceo @alimaredia

This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation. Signed-off-by: Alina Ryan <aliryan@redhat.com>

cdoern

a few comments to start out. Thanks for working on this!

src/llama_stack/distributions/starter/build.yaml

src/llama_stack/apis/datatypes.py

src/llama_stack/providers/inline/file_processor/reference/reference.py

r-bit-rry

Please consider the following comments, if mistaken or missed intention, feel free to ignore and comment ignore on them.

The file-processor endpoints are missing from client-sdks/stainless/openapi.yml, do we need it there?
Do we need CLI support for file_processor? src/llama_stack/cli
Needs at least basic unit tests for the API contract and the reference provider.

I want to push this effort so we can integrate a proper RAG pipeline in the broader scope, thanks

r-bit-rry · 2025-11-25T09:50:54Z

src/llama_stack_api/file_processors.py

+class ProcessFileRequest(BaseModel):
+    """Request for processing a file into structured content."""
+
+    file_data: bytes
+    """Raw file data to process."""
+
+    filename: str
+    """Original filename for format detection and processing hints."""
+
+    options: dict[str, Any] | None = None
+    """Optional processing options. Provider-specific parameters."""
+
+    chunking_strategy: VectorStoreChunkingStrategy | None = None
+    """Optional chunking strategy for splitting content into chunks."""
+
+    include_embeddings: bool = False
+    """Whether to generate embeddings for chunks."""


I notice ProcessFileRequest is defined but never actually used - the process_file method takes individual parameters instead. Should we either remove this class or update the method signature to use it? Using the request model would be more consistent with how some other APIs handle complex requests.

src/llama_stack/apis/file_processor/file_processor.py

r-bit-rry · 2025-11-25T09:52:13Z

src/llama_stack_api/file_processors.py

+    embeddings: list[list[float]] | None = None
+    """Optional embeddings for chunks if requested."""
+
+    metadata: dict[str, Any]


nit: The metadata field is dict[str, Any] but there's no guidance on what keys providers should include. Could we add a docstring or comment listing expected keys like processor, filename, processing_time, etc.? This would help future provider implementations stay consistent.

what's the utility of the ProcessedContent metadata over the metadata in Files and ChunkMetadata?

r-bit-rry · 2025-11-25T09:53:37Z

src/llama_stack_api/file_processors.py

+        filename: str,
+        options: dict[str, Any] | None = None,
+        chunking_strategy: VectorStoreChunkingStrategy | None = None,
+        include_embeddings: bool = False,


When include_embeddings=True, which embedding model gets used? Should this be passed in the options dict, or should we add an explicit embedding_model parameter? It's not clear from the current signature.
Also, maybe change name to generate_embeddings?

r-bit-rry · 2025-11-25T09:56:41Z

src/llama_stack/providers/inline/file_processor/reference/reference.py

+
+    async def process_file(
+        self,
+        file_data: bytes,


nit: Should the reference implementation at least attempt to decode the file_data as text? Even a simple content = file_data.decode('utf-8', errors='ignore') would make it slightly more realistic for testing purposes.
Even though this is a reference implementation, might be worth adding basic validation to set a good example? Something like:

if not file_data: raise ValueError("file_data cannot be empty") if not filename: raise ValueError("filename is required")

Since I'm no longer adding the reference provider in here, I will add this kind of check to future provider implementations

r-bit-rry · 2025-11-25T09:58:06Z

src/llama_stack/providers/inline/file_processor/reference/reference.py

+    async def initialize(self) -> None:
+        pass
+
+    async def process_file(


The method is async, but for large files, should we consider returning a job ID instead of blocking? Similar to how batch processing works? Or is that out of scope for this draft?

Will consider this in the follow-up provider implementation pr

src/llama_stack/distributions/starter/build.yaml

r-bit-rry · 2025-11-25T10:03:27Z

src/llama_stack/providers/inline/file_processor/reference/reference.py

+        self,
+        file_data: bytes,
+        filename: str,
+        options: dict[str, Any] | None = None,


Is there an expected maximum file size? This could become a memory issue if someone tries to process a 1GB text file. Should we document recommended limits or add a max_file_size parameter (maybe part of the options with a default value)?

Will consider this in the follow-up provider implementation pr

Ideally, I would like users to be able to process documents at scale, but a default max might be a good start

r-bit-rry · 2025-11-25T10:16:33Z

src/llama_stack/apis/file_processor/file_processor.py

+from pydantic import BaseModel
+
+from llama_stack.apis.common.tracing import telemetry_traceable
+from llama_stack.apis.vector_io.vector_io import Chunk, VectorStoreChunkingStrategy


direct dependency on the vector_io API by importing the VectoorStoreChunkingStrategy.

yes and no, I think. The types yes, but the logic probably not since these are just two pydantic models.

…skeleton

Signed-off-by: Alina Ryan <aliryan@redhat.com>

…skeleton

alinaryan · 2025-12-01T16:54:42Z

@r-bit-rry

Please consider the following comments, if mistaken or missed intention, feel free to ignore and comment ignore on them.
The file-processor endpoints are missing from client-sdks/stainless/openapi.yml, do we need it there?
Do we need CLI support for file_processor? src/llama_stack/cli
Needs at least basic unit tests for the API contract and the reference provider.
I want to push this effort so we can integrate a proper RAG pipeline in the broader scope, thanks

Thank you! The endpoints are there now.
Yes, I can add that to this PR or in a follow-up PR.
I took out the reference provider in this PR based on some of the review comments. I am working on a follow-up PR to add PyPDF as the default provider, and will add tests there.

I'm working on addressing your other review comments

cdoern · 2025-12-01T16:57:29Z

src/llama_stack/log.py

    "providers",
    "models",
    "files",
+    "file_processors",


did we add any logs for file_processors? or is this just a new section for when people add it later?

i don't see logs here as it's just the stubs which is fine by me

franciscojavierarceo · 2025-12-02T19:16:41Z

src/llama_stack_api/file_processors.py

+class ProcessFileRequest(BaseModel):
+    """Request for processing a file into structured content."""
+
+    file_data: bytes


it may be useful to optionally add file_text when the data is already pure text.

franciscojavierarceo · 2025-12-02T19:22:16Z

src/llama_stack_api/file_processors.py

+    chunking_strategy: VectorStoreChunkingStrategy | None = None
+    """Optional chunking strategy for splitting content into chunks."""
+
+    include_embeddings: bool = False


given the doc string, this feels like a strange name choice. why not generate_embeddings?

franciscojavierarceo · 2025-12-02T19:23:38Z

src/llama_stack_api/file_processors.py

+
+    @webmethod(route="/file-processors/process", method="POST", level=LLAMA_STACK_API_V1ALPHA)
+    async def process_file(
+        self,


wouldn't we need file_id as well?

Since at the moment, it's a requirement for us to have the file uploaded to file storage before processing.

probably it could be useful to not make that requirement as strict and behave like a calculator with bytes in and chunks out.

franciscojavierarceo

this looks really great, some small nits and questions but otherwise think we are close.

cc @cdoern

alinaryan requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners November 9, 2025 05:11

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 9, 2025

alinaryan force-pushed the add-file-processor-skeleton branch from b3ccdb2 to 2664aee Compare November 9, 2025 05:24

alinaryan marked this pull request as draft November 9, 2025 05:31

cdoern reviewed Nov 10, 2025

View reviewed changes

src/llama_stack/distributions/starter/build.yaml Outdated Show resolved Hide resolved

src/llama_stack/apis/datatypes.py Outdated Show resolved Hide resolved

src/llama_stack/providers/inline/file_processor/reference/reference.py Outdated Show resolved Hide resolved

franciscojavierarceo mentioned this pull request Nov 24, 2025

Implement Contextual Retrieval and Contextual Preprocessing #4003

Open

r-bit-rry suggested changes Nov 25, 2025

View reviewed changes

alinaryan added 3 commits November 25, 2025 14:37

Merge remote-tracking branch 'upstream/main' into add-file-processor-…

479e627

…skeleton

Merge remote-tracking branch 'upstream/main' into add-file-processor-…

402358c

…skeleton

fix: address first round of reviews

c2f0db9

Signed-off-by: Alina Ryan <aliryan@redhat.com>

alinaryan marked this pull request as ready for review November 26, 2025 10:39

meffmadd mentioned this pull request Nov 27, 2025

Add Vector Store API TU-Wien-dataLAB/aqueduct#76

Open

6 tasks

Merge remote-tracking branch 'upstream/main' into add-file-processor-…

3f51e16

…skeleton

cdoern reviewed Dec 1, 2025

View reviewed changes

franciscojavierarceo reviewed Dec 2, 2025

View reviewed changes

franciscojavierarceo requested changes Dec 2, 2025

View reviewed changes

feat(api): add file_processor API skeleton #4113

Are you sure you want to change the base?

feat(api): add file_processor API skeleton #4113

Conversation

alinaryan commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

r-bit-rry left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-bit-rry Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alinaryan commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alinaryan commented Nov 9, 2025 •

edited

Loading

r-bit-rry left a comment •

edited

Loading

r-bit-rry Nov 25, 2025 •

edited

Loading