feat: URL store data model and data access #1050

HollywoodTonight · 2025-10-27T16:01:35Z

Please ensure your pull request adheres to the following guidelines:

make sure to link the related issues in this description
when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes

Related Issues

Thanks for contributing!

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js

- Add rank and traffic fields to AuditUrl schema (optional, nullable) - Implement sortAuditUrls static method for sorting by multiple fields - Add allBySiteIdSorted and allBySiteIdAndSourceSorted methods - Update allBySiteIdAndAuditType to support sorting - Add TypeScript definitions for new fields and methods - Add 20 comprehensive unit tests for sorting functionality - All tests passing (1147 total) - Code coverage: 98.01% for audit-url module

- Add NODE_OPTIONS with --max-old-space-size=4096 to prevent heap out of memory errors - Fixes FATAL ERROR: JavaScript heap out of memory in CI pipeline

Add platformType field to AuditUrl schema to categorize URLs as primary-site or offsite platforms (Wikipedia, YouTube, social media, etc.). Changes: - Add platformType attribute with 11 supported platform types - Add GSI for efficient querying by siteId and platformType - Add collection methods: allBySiteIdAndPlatform(), allOffsiteUrls() - Add model helper methods: isOffsitePlatform(), isPlatformType() - Export PLATFORM_TYPES constant - Update TypeScript definitions - Add 33 comprehensive unit tests Platform types supported: primary-site, wikipedia, youtube-channel, reddit-community, facebook-page, twitter-profile, linkedin-company, instagram-account, tiktok-account, github-org, medium-publication All methods support sorting and pagination.

nitinja

LGTM, cant find any issues. Modern JS optional chaining can be used to shorten some of the functions like

this.getPlatformType?.() ?? this.platformType

nitinja · 2025-11-18T20:24:08Z

I briefly discussed this with @ravkiran
We need a way to store one or more custom fields, just like Opportunity model has data field - to store fields specific to a particular type of audit.
In case of high value pages, it is AI rationale (and possibly couple more such fields like url-score), but it could be anything else.

This reverts commit 483b27a.

Replace the 'source' string attribute with 'byCustomer' boolean to simplify the data model and clearly distinguish between customer-added (true) and system-added (false) URLs. Changes: - Schema: Replace 'source' with 'byCustomer' (boolean, default: true) - Model: Replace isManualSource() with isCustomerUrl() - Collection: Replace allBySiteIdAndSource with allBySiteIdByCustomer - Collection: Replace allBySiteIdAndSourceSorted with allBySiteIdByCustomerSorted - Collection: Replace removeForSiteIdAndSource with removeForSiteIdByCustomer - Update GSI index from siteId+source to siteId+byCustomer - Update TypeScript definitions Migration mapping: - source='manual' → byCustomer=true - source='sitemap'/'discovery'/other → byCustomer=false

- Update test fixtures to use byCustomer instead of source - Update model tests to use isCustomerUrl instead of isManualSource - Update collection tests for byCustomer methods

- Update all unit tests for model and collection - Update all integration tests - Replace source with byCustomer throughout - Replace isManualSource with isCustomerUrl - Replace allBySiteIdAndSource with allBySiteIdByCustomer - Replace removeForSiteIdAndSource with removeForSiteIdByCustomer

Resolve conflict in test/it/util/db.js: - Use main branch's approach for disabling debug logging - Keep console as default logger and disable debug inline - Both approaches achieve same goal of preventing memory issues

iuliag

Some observations below about the areas to cleanup for this iteration and having audits as a set to be able to filter at query time.

iuliag · 2025-12-02T16:41:14Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.model.js

+   * @returns {boolean} True if the URL was added by a customer.
+   */
+  isCustomerUrl() {
+    const byCustomer = this.getByCustomer ? this.getByCustomer() : this.byCustomer;


Why do we need this kind of check here and above in the file, i.e. if function not falsy then call function, otherwise return as property?

iuliag · 2025-12-02T16:43:50Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js

+    required: true,
+    default: [],
+  })
+  .addAttribute('rank', {


Please remove rank and traffic.
There are several "traffic" metrics that could be associated with a URL (organic, paid, agentic etc.), so these should end up as custom fields when these will be implemented.
There's no absolute rank for URLs, it all depends on use case.

iuliag · 2025-12-02T16:48:10Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.collection.js

+
+      // Get values using getter methods if available
+      switch (sortBy) {
+        case 'rank':


I thought we agreed that in this first phase we don't need any sorting, rank, traffic or any other custom fields, just these fields https://wiki.corp.adobe.com/display/AEMSites/AEM+Sites+Optimizer+-+URL+Store#AEMSitesOptimizerURLStore-Datamodel, so that we can expediate the APIs through which customers can provide additional URLs:
https://wiki.corp.adobe.com/display/AEMSites/AEM+Sites+Optimizer+-+URL+Store#AEMSitesOptimizerURLStore-Part1covers3separateconcerns:~:text=Customer%2Dprovided%20URLs%20to%20be%20audited%20(for%20all%20opportunity%20types%20or%20per%20opportunity%20type)%20in%20addition%20to%20the%20currently%20audited%20pages

iuliag · 2025-12-02T17:01:02Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js

+    default: true,
+  })
+  .addAttribute('audits', {
+    type: 'list',


If you turn this into a set and add a GSI by audits, you should be able to filter directly at query time.

iuliag · 2025-12-02T17:18:57Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.collection.js

+
+  /**
+   * Gets all audit URLs for a site that have a specific audit type enabled.
+   * Note: This performs filtering after retrieval since audits is an array.


If you specify the audits as a set with a GSI then you should be able to filter directly when querying DynamoDB.

iuliag · 2025-12-02T17:24:10Z

packages/spacecat-shared-data-access/test/it/audit-url/audit-url.test.js

+  });
+
+  it('finds one audit URL by id', async () => {
+    const auditUrl = await AuditUrl.findById(sampleData.auditUrls[0].getId());


Isn't the siteId + URL the composite primary key for the entity? Do we need to have an additional "id" for it?

sandsinh · 2025-12-02T19:20:32Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js

+4. Get URLs by audit type: allBySiteIdAndAuditType(siteId, auditType) - filtered in code
+
+Indexes:
+- Primary: siteId (PK) + url (SK) - for unique identification


How is this enforced from the schema ?

sandsinh · 2025-12-02T19:21:41Z

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js

+    required: false,
+    default: null,
+  })
+  .addAttribute('createdAt', {


redundant, schema builder already adds this attribute, same for updatedAt, updatedBy

Tud OnBoarding Url store

56b73a4

github-advanced-security bot found potential problems Oct 27, 2025

View reviewed changes

packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js Fixed Show fixed Hide fixed

HollywoodTonight requested a review from a team November 4, 2025 11:28

HollywoodTonight and others added 5 commits November 14, 2025 15:14

chore: trigger PR update

6ea2abb

fix(ci): increase Node.js memory limit to 4GB

1827a40

- Add NODE_OPTIONS with --max-old-space-size=4096 to prevent heap out of memory errors - Fixes FATAL ERROR: JavaScript heap out of memory in CI pipeline

Merge branch 'main' into urlStore_Tud

425d8bf

nitinja approved these changes Nov 18, 2025

View reviewed changes

iuliag changed the title ~~URL Store~~ feat: URL store data model and data access Nov 18, 2025

HollywoodTonight added 5 commits November 27, 2025 17:41

Revert "feat(url-store): add platformType support for offsite URLs"

44df5e3

This reverts commit 483b27a.

test: update tests for byCustomer refactor (partial)

ef3253c

- Update test fixtures to use byCustomer instead of source - Update model tests to use isCustomerUrl instead of isManualSource - Update collection tests for byCustomer methods

Merge branch 'main' into urlStore_Tud

2f44422

Resolve conflict in test/it/util/db.js: - Use main branch's approach for disabling debug logging - Keep console as default logger and disable debug inline - Both approaches achieve same goal of preventing memory issues

iuliag requested changes Dec 2, 2025

View reviewed changes

sandsinh reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: URL store data model and data access #1050

feat: URL store data model and data access #1050

Uh oh!

HollywoodTonight commented Oct 27, 2025

Uh oh!

Uh oh!

nitinja left a comment

Uh oh!

nitinja commented Nov 18, 2025

Uh oh!

iuliag left a comment

Uh oh!

iuliag Dec 2, 2025

Uh oh!

iuliag Dec 2, 2025

Uh oh!

iuliag Dec 2, 2025

Uh oh!

iuliag Dec 2, 2025

Uh oh!

iuliag Dec 2, 2025

Uh oh!

iuliag Dec 2, 2025

Uh oh!

sandsinh Dec 2, 2025

Uh oh!

sandsinh Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: URL store data model and data access #1050

Are you sure you want to change the base?

feat: URL store data model and data access #1050

Uh oh!

Conversation

HollywoodTonight commented Oct 27, 2025

Related Issues

Uh oh!

Uh oh!

nitinja left a comment

Choose a reason for hiding this comment

Uh oh!

nitinja commented Nov 18, 2025

Uh oh!

iuliag left a comment

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

iuliag Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

sandsinh Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

sandsinh Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sandsinh Dec 2, 2025 •

edited

Loading