-
Notifications
You must be signed in to change notification settings - Fork 0
Stub CLI command for updating existing docs with embeddings #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stub CLI command for updating existing docs with embeddings #370
Conversation
Why these changes are being introduced: * TIM requires a CLI command to update existing docs with embeddings. Currently, TIM uses the OpenSearch 'index' action to create/update docs using *full* records, which performs a full replacement of existing docs. To update docs using *partial* records (e.g., only a subset of TIMDEX fields [i.e., embeddings]), the OpenSearch 'update' action must be used. How this addresses that need: * Add 'bulk_update_embeddings' CLI command * Create BulkOperationError exception * Update tim.helpers.generate_bulk_actions to format inputs for OpenSearch 'update' action Side effects of this change: * TIMDEX Pipeline Lambdas will require updates to generate the 'bulk-update-embeddings' CLI command. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-122
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite my "Request changes" here, I think it looks fantastic. Nice work. Thanks for the discussions along the way which helped my understanding.
My only request is a docstring for the CLI command. Even though some of the functionality is stubbed, I think this stubbing work can include what the CLI command will do.
Otherwise, fully approved!
| return result | ||
|
|
||
|
|
||
| def bulk_update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here broadly: I think the addition of this helper was a great idea. As discussed off-PR, I think it's the right call to introduce code duplication now and get things working, and then could optionally take a second pass and consider combining bulk_index, bulk_delete, and bulk_update into some kind of bulk_operation with something like action=index|delete|update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Should I go add a ticket in the USE backlog or create a GH issue? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try a GH issue! It doesn't feel at all required, just sort of optional code hygene and maintenance consideration. If we don't touch it, no problem, but can get a feel if GH issues are valuable + remembered.
Pull Request Test Coverage Report for Build 19438830847Details
💛 - Coveralls |
Purpose and background context
TIM requires a CLI command to update existing docs with embeddings. Currently, TIM uses the OpenSearch
'index' action to create/update docs using full records, which performs a full replacement of existing docs. To update docs using partial records (e.g., only a subset of TIMDEX fields [i.e., embeddings]), the OpenSearch 'update' action must be used.
How can a reviewer manually see the effects of these changes?
For now, I've decided not to add new unit tests -- partly due to not having created unit tests with
vcrand partly because, as discussed, there may be some refactoring with thetim.opensearch.bulk_*commands in the near future. Rest assured, I did run some testing using a local OpenSearch instance! The steps I took are summarized below:Follow the instructions in How to run and query OpenSearch locally | OpenSearch and OpenSearch Dashboards.
In another terminal, paste credentials for the
TIMDEXManagersrole from Prod and reindexlibguidesusing the following command:Run the following command:
Note: This record belongs to the run represented by the Run ID above.
Output
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES - TIMDEX Pipeline Lambdas will require updates to generate
the 'bulk-update-embeddings' CLI command.
What are the relevant tickets?