Skip to content

Conversation

@alinaryan
Copy link
Contributor

@alinaryan alinaryan commented Nov 9, 2025

This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.

A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.

Related to:
#4114
#4003
#2484

cc: @franciscojavierarceo @alimaredia

This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
@alinaryan alinaryan force-pushed the add-file-processor-skeleton branch from b3ccdb2 to 2664aee Compare November 9, 2025 05:24
@alinaryan alinaryan marked this pull request as draft November 9, 2025 05:31
Copy link
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments to start out. Thanks for working on this!

Copy link
Contributor

@r-bit-rry r-bit-rry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the following comments, if mistaken or missed intention, feel free to ignore and comment ignore on them.

  1. The file-processor endpoints are missing from client-sdks/stainless/openapi.yml, do we need it there?
  2. Do we need CLI support for file_processor? src/llama_stack/cli
  3. Needs at least basic unit tests for the API contract and the reference provider.

I want to push this effort so we can integrate a proper RAG pipeline in the broader scope, thanks

Comment on lines +18 to +34
class ProcessFileRequest(BaseModel):
"""Request for processing a file into structured content."""

file_data: bytes
"""Raw file data to process."""

filename: str
"""Original filename for format detection and processing hints."""

options: dict[str, Any] | None = None
"""Optional processing options. Provider-specific parameters."""

chunking_strategy: VectorStoreChunkingStrategy | None = None
"""Optional chunking strategy for splitting content into chunks."""

include_embeddings: bool = False
"""Whether to generate embeddings for chunks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice ProcessFileRequest is defined but never actually used - the process_file method takes individual parameters instead. Should we either remove this class or update the method signature to use it? Using the request model would be more consistent with how some other APIs handle complex requests.

embeddings: list[list[float]] | None = None
"""Optional embeddings for chunks if requested."""

metadata: dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The metadata field is dict[str, Any] but there's no guidance on what keys providers should include. Could we add a docstring or comment listing expected keys like processor, filename, processing_time, etc.? This would help future provider implementations stay consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the utility of the ProcessedContent metadata over the metadata in Files and ChunkMetadata?

filename: str,
options: dict[str, Any] | None = None,
chunking_strategy: VectorStoreChunkingStrategy | None = None,
include_embeddings: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When include_embeddings=True, which embedding model gets used? Should this be passed in the options dict, or should we add an explicit embedding_model parameter? It's not clear from the current signature.
Also, maybe change name to generate_embeddings?


async def process_file(
self,
file_data: bytes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should the reference implementation at least attempt to decode the file_data as text? Even a simple content = file_data.decode('utf-8', errors='ignore') would make it slightly more realistic for testing purposes.
Even though this is a reference implementation, might be worth adding basic validation to set a good example? Something like:

if not file_data:
    raise ValueError("file_data cannot be empty")
if not filename:
    raise ValueError("filename is required")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm no longer adding the reference provider in here, I will add this kind of check to future provider implementations

async def initialize(self) -> None:
pass

async def process_file(
Copy link
Contributor

@r-bit-rry r-bit-rry Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is async, but for large files, should we consider returning a job ID instead of blocking? Similar to how batch processing works? Or is that out of scope for this draft?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will consider this in the follow-up provider implementation pr

self,
file_data: bytes,
filename: str,
options: dict[str, Any] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expected maximum file size? This could become a memory issue if someone tries to process a 1GB text file. Should we document recommended limits or add a max_file_size parameter (maybe part of the options with a default value)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will consider this in the follow-up provider implementation pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would like users to be able to process documents at scale, but a default max might be a good start

from pydantic import BaseModel

from llama_stack.apis.common.tracing import telemetry_traceable
from llama_stack.apis.vector_io.vector_io import Chunk, VectorStoreChunkingStrategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direct dependency on the vector_io API by importing the VectoorStoreChunkingStrategy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and no, I think. The types yes, but the logic probably not since these are just two pydantic models.

@alinaryan alinaryan marked this pull request as ready for review November 26, 2025 10:39
@alinaryan
Copy link
Contributor Author

@r-bit-rry

Please consider the following comments, if mistaken or missed intention, feel free to ignore and comment ignore on them.
The file-processor endpoints are missing from client-sdks/stainless/openapi.yml, do we need it there?
Do we need CLI support for file_processor? src/llama_stack/cli
Needs at least basic unit tests for the API contract and the reference provider.
I want to push this effort so we can integrate a proper RAG pipeline in the broader scope, thanks

  1. Thank you! The endpoints are there now.
  2. Yes, I can add that to this PR or in a follow-up PR.
  3. I took out the reference provider in this PR based on some of the review comments. I am working on a follow-up PR to add PyPDF as the default provider, and will add tests there.

I'm working on addressing your other review comments

"providers",
"models",
"files",
"file_processors",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we add any logs for file_processors? or is this just a new section for when people add it later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see logs here as it's just the stubs which is fine by me

class ProcessFileRequest(BaseModel):
"""Request for processing a file into structured content."""

file_data: bytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be useful to optionally add file_text when the data is already pure text.

chunking_strategy: VectorStoreChunkingStrategy | None = None
"""Optional chunking strategy for splitting content into chunks."""

include_embeddings: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the doc string, this feels like a strange name choice. why not generate_embeddings?


@webmethod(route="/file-processors/process", method="POST", level=LLAMA_STACK_API_V1ALPHA)
async def process_file(
self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we need file_id as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since at the moment, it's a requirement for us to have the file uploaded to file storage before processing.

probably it could be useful to not make that requirement as strict and behave like a calculator with bytes in and chunks out.

Copy link
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really great, some small nits and questions but otherwise think we are close.

cc @cdoern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants