Skip to content

Conversation

@dilithjay
Copy link
Contributor

@dilithjay dilithjay commented Nov 23, 2025

Logic: If the document has an image, the usual routing is to do LLM_PARSE.

When router_priority="cost", get the character count using PaddleOCR and parse with LLM_PARSE only if the count is larger than some threshold (indicating a significant amount of text).

@dilithjay dilithjay self-assigned this Nov 23, 2025
@dilithjay dilithjay added the enhancement New feature or request label Nov 23, 2025
@pramitchoudhary
Copy link
Contributor

Q: How do the test results look like? - Execution cost
Can we add tests for the above?

@dilithjay
Copy link
Contributor Author

dilithjay commented Nov 29, 2025

Q: How do the test results look like? - Execution cost

We don't have many documents that trigger the switch to OCR, so there isn't an easy way to estimate this on average. But the cost is guaranteed to be <= that of the other AUTO routing methods or LLM_PARSE. However, the speed and accuracy can be lower.

Can we add tests for the above?

Added

Copy link
Contributor

@pramitchoudhary pramitchoudhary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dilithjay 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Estimate text density in AUTO mode by checking for percentage of non-black pixels

3 participants