-
Notifications
You must be signed in to change notification settings - Fork 15
feat: URL store data model and data access #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
packages/spacecat-shared-data-access/src/models/audit-url/audit-url.schema.js
Fixed
Show fixed
Hide fixed
- Add rank and traffic fields to AuditUrl schema (optional, nullable) - Implement sortAuditUrls static method for sorting by multiple fields - Add allBySiteIdSorted and allBySiteIdAndSourceSorted methods - Update allBySiteIdAndAuditType to support sorting - Add TypeScript definitions for new fields and methods - Add 20 comprehensive unit tests for sorting functionality - All tests passing (1147 total) - Code coverage: 98.01% for audit-url module
- Add NODE_OPTIONS with --max-old-space-size=4096 to prevent heap out of memory errors - Fixes FATAL ERROR: JavaScript heap out of memory in CI pipeline
Add platformType field to AuditUrl schema to categorize URLs as primary-site or offsite platforms (Wikipedia, YouTube, social media, etc.). Changes: - Add platformType attribute with 11 supported platform types - Add GSI for efficient querying by siteId and platformType - Add collection methods: allBySiteIdAndPlatform(), allOffsiteUrls() - Add model helper methods: isOffsitePlatform(), isPlatformType() - Export PLATFORM_TYPES constant - Update TypeScript definitions - Add 33 comprehensive unit tests Platform types supported: primary-site, wikipedia, youtube-channel, reddit-community, facebook-page, twitter-profile, linkedin-company, instagram-account, tiktok-account, github-org, medium-publication All methods support sorting and pagination.
nitinja
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cant find any issues. Modern JS optional chaining can be used to shorten some of the functions like
this.getPlatformType?.() ?? this.platformType
|
I briefly discussed this with @ravkiran |
This reverts commit 483b27a.
Replace the 'source' string attribute with 'byCustomer' boolean to simplify the data model and clearly distinguish between customer-added (true) and system-added (false) URLs. Changes: - Schema: Replace 'source' with 'byCustomer' (boolean, default: true) - Model: Replace isManualSource() with isCustomerUrl() - Collection: Replace allBySiteIdAndSource with allBySiteIdByCustomer - Collection: Replace allBySiteIdAndSourceSorted with allBySiteIdByCustomerSorted - Collection: Replace removeForSiteIdAndSource with removeForSiteIdByCustomer - Update GSI index from siteId+source to siteId+byCustomer - Update TypeScript definitions Migration mapping: - source='manual' → byCustomer=true - source='sitemap'/'discovery'/other → byCustomer=false
- Update test fixtures to use byCustomer instead of source - Update model tests to use isCustomerUrl instead of isManualSource - Update collection tests for byCustomer methods
- Update all unit tests for model and collection - Update all integration tests - Replace source with byCustomer throughout - Replace isManualSource with isCustomerUrl - Replace allBySiteIdAndSource with allBySiteIdByCustomer - Replace removeForSiteIdAndSource with removeForSiteIdByCustomer
Resolve conflict in test/it/util/db.js: - Use main branch's approach for disabling debug logging - Keep console as default logger and disable debug inline - Both approaches achieve same goal of preventing memory issues
iuliag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some observations below about the areas to cleanup for this iteration and having audits as a set to be able to filter at query time.
| * @returns {boolean} True if the URL was added by a customer. | ||
| */ | ||
| isCustomerUrl() { | ||
| const byCustomer = this.getByCustomer ? this.getByCustomer() : this.byCustomer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this kind of check here and above in the file, i.e. if function not falsy then call function, otherwise return as property?
| required: true, | ||
| default: [], | ||
| }) | ||
| .addAttribute('rank', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove rank and traffic.
There are several "traffic" metrics that could be associated with a URL (organic, paid, agentic etc.), so these should end up as custom fields when these will be implemented.
There's no absolute rank for URLs, it all depends on use case.
|
|
||
| // Get values using getter methods if available | ||
| switch (sortBy) { | ||
| case 'rank': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed that in this first phase we don't need any sorting, rank, traffic or any other custom fields, just these fields https://wiki.corp.adobe.com/display/AEMSites/AEM+Sites+Optimizer+-+URL+Store#AEMSitesOptimizerURLStore-Datamodel, so that we can expediate the APIs through which customers can provide additional URLs:
https://wiki.corp.adobe.com/display/AEMSites/AEM+Sites+Optimizer+-+URL+Store#AEMSitesOptimizerURLStore-Part1covers3separateconcerns:~:text=Customer%2Dprovided%20URLs%20to%20be%20audited%20(for%20all%20opportunity%20types%20or%20per%20opportunity%20type)%20in%20addition%20to%20the%20currently%20audited%20pages
| default: true, | ||
| }) | ||
| .addAttribute('audits', { | ||
| type: 'list', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you turn this into a set and add a GSI by audits, you should be able to filter directly at query time.
|
|
||
| /** | ||
| * Gets all audit URLs for a site that have a specific audit type enabled. | ||
| * Note: This performs filtering after retrieval since audits is an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you specify the audits as a set with a GSI then you should be able to filter directly when querying DynamoDB.
| }); | ||
|
|
||
| it('finds one audit URL by id', async () => { | ||
| const auditUrl = await AuditUrl.findById(sampleData.auditUrls[0].getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the siteId + URL the composite primary key for the entity? Do we need to have an additional "id" for it?
| 4. Get URLs by audit type: allBySiteIdAndAuditType(siteId, auditType) - filtered in code | ||
|
|
||
| Indexes: | ||
| - Primary: siteId (PK) + url (SK) - for unique identification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this enforced from the schema ?
| required: false, | ||
| default: null, | ||
| }) | ||
| .addAttribute('createdAt', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant, schema builder already adds this attribute, same for updatedAt, updatedBy
Please ensure your pull request adheres to the following guidelines:
Related Issues
Thanks for contributing!