Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Nov 12, 2025

Purpose and background context

TIM requires a CLI command to update existing docs with embeddings. Currently, TIM uses the OpenSearch
'index' action to create/update docs using full records, which performs a full replacement of existing docs. To update docs using partial records (e.g., only a subset of TIMDEX fields [i.e., embeddings]), the OpenSearch 'update' action must be used.

How can a reviewer manually see the effects of these changes?

For now, I've decided not to add new unit tests -- partly due to not having created unit tests with vcr and partly because, as discussed, there may be some refactoring with the tim.opensearch.bulk_* commands in the near future. Rest assured, I did run some testing using a local OpenSearch instance! The steps I took are summarized below:

  1. Follow the instructions in How to run and query OpenSearch locally | OpenSearch and OpenSearch Dashboards.

  2. In another terminal, paste credentials for the TIMDEXManagers role from Prod and reindex libguides using the following command:

    pipenv run tim reindex-source -s libguides s3://timdex-extract-prod-300442551476/dataset
  3. Run the following command:

pipenv run tim bulk-update-embeddings -s libguides -d 2025-02-28 -rid 9ec436e6-df82-44fb-803a-25134b700e94 s3://timdex-extract-prod-300442551476/dataset
  1. Review proof of update in OpenSearch Dashboards using the query:
   GET libguides/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "ids": {
            "values": [
              "libguides:guides-175860"
            ]
          }
        }
      ]
    }
  }
}

Note: This record belongs to the run represented by the Run ID above.

Output

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": "libguides-2025-11-12t16-43-05",
        "_id": "libguides:guides-175860",
        "_score": 0,
        "_source": {
          "source": "LibGuides",
          "source_link": "https://libguides.mit.edu/libinfo",
          "timdex_record_id": "libguides:guides-175860",
          "title": "Library and Information Science",
          "citation": "Deborah Lenares. Library and Information Science. MIT Libraries. libguides. https://libguides.mit.edu/libinfo",
          "content_type": [
            "libguides"
          ],
          "contributors": [
            {
              "value": "Deborah Lenares",
              "kind": "Creator"
            }
          ],
          "dates": [
            {
              "kind": "Created",
              "value": "2008-06-27T00:27:31"
            }
          ],
          "format": "electronic resource",
          "identifiers": [
            {
              "value": "oai:libguides.com:guides/175860",
              "kind": "OAI-PMH"
            }
          ],
          "links": [
            {
              "url": "https://libguides.mit.edu/libinfo",
              "kind": "LibGuide URL",
              "text": "LibGuide URL"
            }
          ],
          "publishers": [
            {
              "name": "MIT Libraries"
            }
          ],
          "subjects": [
            {
              "value": [
                "Humanities"
              ],
              "kind": "Subject scheme not provided"
            }
          ],
          "summary": [
            "This is the Library and Information Science subject page"
          ],
          "timdex_provenance": {
            "source": "libguides",
            "run_date": "2025-02-28",
            "run_id": "9ec436e6-df82-44fb-803a-25134b700e94",
            "run_record_offset": 9
          },
          "alternate_titles": [ # proof that new field was added!
            {
              "kind": "Test",
              "value": "Test Alternate Title"
            }
          ]
        }
      }
    ]
  }
}

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES - TIMDEX Pipeline Lambdas will require updates to generate
the 'bulk-update-embeddings' CLI command.

What are the relevant tickets?

Why these changes are being introduced:
* TIM requires a CLI command to update existing docs with
embeddings. Currently, TIM uses the OpenSearch
'index' action to create/update docs using *full* records,
which performs a full replacement of existing docs. To update docs
using *partial* records (e.g., only a subset of TIMDEX fields
[i.e., embeddings]), the OpenSearch 'update' action must be used.

How this addresses that need:
* Add 'bulk_update_embeddings' CLI command
* Create BulkOperationError exception
* Update tim.helpers.generate_bulk_actions to format inputs for
OpenSearch 'update' action

Side effects of this change:
* TIMDEX Pipeline Lambdas will require updates to generate
the 'bulk-update-embeddings' CLI command.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-122
@jonavellecuerdo jonavellecuerdo requested review from a team and ghukill November 12, 2025 17:46
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review November 12, 2025 18:01
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite my "Request changes" here, I think it looks fantastic. Nice work. Thanks for the discussions along the way which helped my understanding.

My only request is a docstring for the CLI command. Even though some of the functionality is stubbed, I think this stubbing work can include what the CLI command will do.

Otherwise, fully approved!

return result


def bulk_update(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here broadly: I think the addition of this helper was a great idea. As discussed off-PR, I think it's the right call to introduce code duplication now and get things working, and then could optionally take a second pass and consider combining bulk_index, bulk_delete, and bulk_update into some kind of bulk_operation with something like action=index|delete|update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Should I go add a ticket in the USE backlog or create a GH issue? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try a GH issue! It doesn't feel at all required, just sort of optional code hygene and maintenance consideration. If we don't touch it, no problem, but can get a feel if GH issues are valuable + remembered.

@ghukill ghukill self-requested a review November 17, 2025 18:35
@jonavellecuerdo jonavellecuerdo merged commit d0511c0 into main Nov 17, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the USE-122-scaffold-add-embeddings-to-existing-documents branch November 17, 2025 19:14
@coveralls
Copy link

Pull Request Test Coverage Report for Build 19438830847

Details

  • 17 of 57 (29.82%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-8.0%) to 89.397%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tim/helpers.py 4 5 80.0%
tim/errors.py 2 8 25.0%
tim/cli.py 10 23 43.48%
tim/opensearch.py 1 21 4.76%
Totals Coverage Status
Change from base Build 18719983394: -8.0%
Covered Lines: 430
Relevant Lines: 481

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants