-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat(api): add file_processor API skeleton #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change adds a file_processor API skeleton that provides a foundationfor converting files into structured content for vector store ingestionwith support for chunking strategies and optional embedding generation. Signed-off-by: Alina Ryan <aliryan@redhat.com>
b3ccdb2 to
2664aee
Compare
cdoern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments to start out. Thanks for working on this!
src/llama_stack/providers/inline/file_processor/reference/reference.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider the following comments, if mistaken or missed intention, feel free to ignore and comment ignore on them.
- The file-processor endpoints are missing from
client-sdks/stainless/openapi.yml, do we need it there? - Do we need CLI support for file_processor?
src/llama_stack/cli - Needs at least basic unit tests for the API contract and the reference provider.
I want to push this effort so we can integrate a proper RAG pipeline in the broader scope, thanks
| class ProcessFileRequest(BaseModel): | ||
| """Request for processing a file into structured content.""" | ||
|
|
||
| file_data: bytes | ||
| """Raw file data to process.""" | ||
|
|
||
| filename: str | ||
| """Original filename for format detection and processing hints.""" | ||
|
|
||
| options: dict[str, Any] | None = None | ||
| """Optional processing options. Provider-specific parameters.""" | ||
|
|
||
| chunking_strategy: VectorStoreChunkingStrategy | None = None | ||
| """Optional chunking strategy for splitting content into chunks.""" | ||
|
|
||
| include_embeddings: bool = False | ||
| """Whether to generate embeddings for chunks.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice ProcessFileRequest is defined but never actually used - the process_file method takes individual parameters instead. Should we either remove this class or update the method signature to use it? Using the request model would be more consistent with how some other APIs handle complex requests.
| embeddings: list[list[float]] | None = None | ||
| """Optional embeddings for chunks if requested.""" | ||
|
|
||
| metadata: dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The metadata field is dict[str, Any] but there's no guidance on what keys providers should include. Could we add a docstring or comment listing expected keys like processor, filename, processing_time, etc.? This would help future provider implementations stay consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the utility of the ProcessedContent metadata over the metadata in Files and ChunkMetadata?
| filename: str, | ||
| options: dict[str, Any] | None = None, | ||
| chunking_strategy: VectorStoreChunkingStrategy | None = None, | ||
| include_embeddings: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When include_embeddings=True, which embedding model gets used? Should this be passed in the options dict, or should we add an explicit embedding_model parameter? It's not clear from the current signature.
Also, maybe change name to generate_embeddings?
|
|
||
| async def process_file( | ||
| self, | ||
| file_data: bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should the reference implementation at least attempt to decode the file_data as text? Even a simple content = file_data.decode('utf-8', errors='ignore') would make it slightly more realistic for testing purposes.
Even though this is a reference implementation, might be worth adding basic validation to set a good example? Something like:
if not file_data:
raise ValueError("file_data cannot be empty")
if not filename:
raise ValueError("filename is required")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm no longer adding the reference provider in here, I will add this kind of check to future provider implementations
| async def initialize(self) -> None: | ||
| pass | ||
|
|
||
| async def process_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is async, but for large files, should we consider returning a job ID instead of blocking? Similar to how batch processing works? Or is that out of scope for this draft?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will consider this in the follow-up provider implementation pr
| self, | ||
| file_data: bytes, | ||
| filename: str, | ||
| options: dict[str, Any] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expected maximum file size? This could become a memory issue if someone tries to process a 1GB text file. Should we document recommended limits or add a max_file_size parameter (maybe part of the options with a default value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will consider this in the follow-up provider implementation pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I would like users to be able to process documents at scale, but a default max might be a good start
| from pydantic import BaseModel | ||
|
|
||
| from llama_stack.apis.common.tracing import telemetry_traceable | ||
| from llama_stack.apis.vector_io.vector_io import Chunk, VectorStoreChunkingStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
direct dependency on the vector_io API by importing the VectoorStoreChunkingStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and no, I think. The types yes, but the logic probably not since these are just two pydantic models.
Signed-off-by: Alina Ryan <aliryan@redhat.com>
I'm working on addressing your other review comments |
| "providers", | ||
| "models", | ||
| "files", | ||
| "file_processors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we add any logs for file_processors? or is this just a new section for when people add it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see logs here as it's just the stubs which is fine by me
| class ProcessFileRequest(BaseModel): | ||
| """Request for processing a file into structured content.""" | ||
|
|
||
| file_data: bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be useful to optionally add file_text when the data is already pure text.
| chunking_strategy: VectorStoreChunkingStrategy | None = None | ||
| """Optional chunking strategy for splitting content into chunks.""" | ||
|
|
||
| include_embeddings: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the doc string, this feels like a strange name choice. why not generate_embeddings?
|
|
||
| @webmethod(route="/file-processors/process", method="POST", level=LLAMA_STACK_API_V1ALPHA) | ||
| async def process_file( | ||
| self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we need file_id as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since at the moment, it's a requirement for us to have the file uploaded to file storage before processing.
probably it could be useful to not make that requirement as strict and behave like a calculator with bytes in and chunks out.
franciscojavierarceo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really great, some small nits and questions but otherwise think we are close.
cc @cdoern
This PR builds on the file processing workflow demonstrated in a recent Llama Stack community meeting, where we showcased file upload and processing capabilities through the UI. It introduces the backend API foundation that enables those integrations- specifically, a file_processor API skeleton that establishes a framework for converting files into structured content suitable for vector store ingestion, with support for configurable chunking strategies and optional embedding generation.
A follow-up PR will add an inline PyPDF provider implementation that can be invoked either within the vector store or as a standalone processor.
Related to:
#4114
#4003
#2484
cc: @franciscojavierarceo @alimaredia