Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Nov 18, 2025

Fixes #2957

Full backend and frontend implementation, with a new email notification to org admins when a crawl is paused because an org quota has been reached.

Backend changes

  • Modify operator to auto-pause crawls when quotas are reached or archiving is disabled rather than stopping the crawls
  • Add new crawl states: paused_storage_quota_reached, paused_time_quota_reached, paused_org_readonly
  • Add uploaded WACZs to org storage totals immediately after upload so that auto-paused crawls will actually put the org's bytesStored above the storage quota
  • Send an email from new template to all org admins when a crawl is auto-paused with information about what to do
  • Fix datetime deprecation in tests

Updated nightly tests all pass: https://github.com/webrecorder/browsertrix/actions/runs/19684324914

Frontend changes

  • Add new paused crawl states
  • Update checks throughout frontend for whether crawl is paused to compare against all paused states

Dependencies

Relies on crawler changes introduced in webrecorder/browsertrix-crawler#919

Out of scope

Crawl workflow counts are a bit off, counting all crawls that complete as successful regardless of state and sometimes incrementing workflow storage counts incorrectly. I started trying to address that in this branch but it's a bit involved and may require a migration so best handled separately, I think. Issue: #3011

@tw4l tw4l force-pushed the issue-2957-pause-crawl-on-quota-reached branch 6 times, most recently from 4e5d015 to 6730c7f Compare November 25, 2025 17:03
@tw4l tw4l marked this pull request as ready for review November 25, 2025 20:14
Comment on lines +1410 to +1413

# sizes = await redis.hkeys(f"{crawl.id}:size")
# for size in sizes:
# await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# sizes = await redis.hkeys(f"{crawl.id}:size")
# for size in sizes:
# await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes})

Remove before merging

Comment on lines +1543 to +1551
print(f"pending size: {pending_size}", flush=True)
print(f"status.filesAdded: {status.filesAdded}", flush=True)
print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True)
print(f"total: {total_size}", flush=True)
print(
f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}",
flush=True,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"pending size: {pending_size}", flush=True)
print(f"status.filesAdded: {status.filesAdded}", flush=True)
print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True)
print(f"total: {total_size}", flush=True)
print(
f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}",
flush=True,
)

Remove before merging, useful for testing

@tw4l tw4l requested review from SuaYoo, emma-sg and ikreymer November 25, 2025 20:15
@tw4l
Copy link
Member Author

tw4l commented Nov 25, 2025

Tagging @emma-sg @SuaYoo for review in addition to @ikreymer , with particular interest in getting your eyes on the frontend, email, and email copy parts of this. Thanks!

Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Still doing manual testing, my initial impression is it's probably worth adding an isPaused helper to utils/crawler.

export function isPaused({ state }: { state: string | null }) {
  return state && (PAUSED_STATES as readonly string[]).includes(state);
}

@ikreymer
Copy link
Member

We want to send the e-mails multiple times, if a crawl reaches quota, then is resumed, then reaches quota again, right?
If so, should also clear autoPausedEmailsSent when crawl is running again

@tw4l
Copy link
Member Author

tw4l commented Nov 26, 2025

Nice! Still doing manual testing, my initial impression is it's probably worth adding an isPaused helper to utils/crawler.

export function isPaused({ state }: { state: string | null }) {
  return state && (PAUSED_STATES as readonly string[]).includes(state);
}

I added a helper but made it except a string or null rather than an object with state property, as none of the uses of this take an object with that key. Take a look and let me know what you think.

@tw4l
Copy link
Member Author

tw4l commented Nov 26, 2025

We want to send the e-mails multiple times, if a crawl reaches quota, then is resumed, then reaches quota again, right? If so, should also clear autoPausedEmailsSent when crawl is running again

Done, and now storing this state in the db to be more reliable.

@SuaYoo SuaYoo self-requested a review November 26, 2025 19:19
Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend portion looks good!

@tw4l tw4l force-pushed the issue-2957-pause-crawl-on-quota-reached branch from 7726a59 to 0ad1644 Compare November 26, 2025 20:34
Copy link
Member

@emma-sg emma-sg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Email language looks good! Left a few suggestions, one splitting a sentence into two and a few just using curly quotes/removing unused code. Nice work!

I'll take another look for frontend & backend changes, just wanted to get you some feedback on the email template now.

@SuaYoo SuaYoo force-pushed the issue-2957-pause-crawl-on-quota-reached branch from d2cba1b to 97dd148 Compare November 27, 2025 00:03
@ikreymer ikreymer added this to the 1.21 Release milestone Dec 2, 2025
@tw4l tw4l force-pushed the issue-2957-pause-crawl-on-quota-reached branch 2 times, most recently from 1aa8519 to 93b2bfd Compare December 2, 2025 20:25
- Backend implementation with new crawl pause states:
paused_storage_quota_reached, paused_time_quota_reached,
paused_org_readonly
- Send an email to all org admins when crawl is auto-paused
- Frontend updates

Partially dependent on crawler changes introduced in
webrecorder/browsertrix-crawler#919
@ikreymer ikreymer force-pushed the issue-2957-pause-crawl-on-quota-reached branch from 93b2bfd to 6b9d101 Compare December 2, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: When a quota is reached, the crawl should be paused instead of stopped.

5 participants