-
Notifications
You must be signed in to change notification settings - Fork 568
feat(ai): add single message truncation #5079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| ) | ||
|
|
||
| available_content_bytes = max_bytes - overhead_size - 20 | ||
| message["content"] = content[:available_content_bytes] + "..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: In-place Mutation of Message Content Causes Side Effects
The function mutates the original message dict instead of creating a copy before modification. This causes unintended side effects where the caller's input data is modified. Similar to normalize_message_roles, this function should create a copy of the message before modifying its content field.
| ) | ||
|
|
||
| available_content_bytes = max_bytes - overhead_size - 20 | ||
| message["content"] = content[:available_content_bytes] + "..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: UTF-8 Truncation Mismatch Causes Encoding Errors
The truncation uses content[:available_content_bytes] which slices by character index rather than byte length. For multi-byte UTF-8 characters like emoji or non-ASCII text, this produces incorrect results since available_content_bytes is a byte count but string slicing operates on character positions. This can truncate too many or too few bytes, and may slice mid-character causing encoding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in terms of the general approach.
The message mutation also to be addressed; we should not edit the return value of another library and return the edited value to the user.
| return 0 | ||
|
|
||
|
|
||
| def truncate_messages_by_size(messages, max_bytes=MAX_GEN_AI_MESSAGE_BYTES): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead check if the returned array has one element, and if so, truncate the single message?
Let me know if I am missing some edge case that wouldn't be covered that way.
Contributes to TET-1208