Skip to content

Conversation

@gfyrag
Copy link
Contributor

@gfyrag gfyrag commented Dec 1, 2025

  • experiment: improve migration performance
  • fix: speedup another migration
  • chore: comment migration 27 superseeded by migration 28
  • chore: modify some migrations

Summary by cubic

Speed up several database migrations by switching to row_number-based batching and adding temporary indexes, reducing runtime and lock contention during upgrades.

  • Refactors
    • Replace OFFSET/LIMIT with row_number range batching in migrations 17, 18, 28, 29, 31, 32, 33.
    • Use temporary tables and indexes to precompute join keys for faster updates; avoid adding heavy constraints in 17.
    • Update only necessary rows and derive inserted_at from NEW_TRANSACTION/REVERTED_TRANSACTION logs (no goose_db_version date).
    • Disable migration 27, as 28 supersedes it.

Written for commit f3611ec. Summary will update automatically on new commits.

@gfyrag gfyrag requested a review from a team as a code owner December 1, 2025 13:15
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Multiple SQL migrations switch OFFSET/LIMIT pagination to row_number()-based batching, add row_number to views/CTEs, introduce batched/temp-table updates with per-batch commits for moves, adjust batch bounds semantics in some migrations, and convert one migration into an early no-op notice. (≤50 words)

Changes

Cohort / File(s) Summary
Row-number pagination and batching
internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql, internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql, internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql, internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql, internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql, internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql
Add row_number() columns to views/CTEs, replace ORDER BY + OFFSET/LIMIT with row_number range filters (using WHERE row_number > _offset AND row_number <= _offset + _batch_size or similar), adjust indices to include row_number, and update batch selection and join filters (e.g., add inserted_at IS NULL or type filters).
Batch bounds change
internal/storage/bucket/migrations/19-transactions-fill-pcv/up.sql, internal/storage/bucket/migrations/34-fix-memento-format/up.sql
Change batch predicate semantics from inclusive-lower/exclusive-upper to exclusive-lower/inclusive-upper (e.g., from seq >= _offset AND seq < _offset + _batch_size to seq > _offset AND seq <= _offset + _batch_size), altering which rows fall into each batch.
Move update batching & temp-table index
internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
Replace single-pass update with a temporary transactions_ids table (populated from moves JOIN transactions WHERE transactions_id IS NULL), create an index + ANALYZE, and perform batched updates using row_number ranges with per-batch commits; remove prior NOT NULL constraint alteration and legacy exit-when-not-found loop.
Disabled migration (no-op)
internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql
Replace previous migration body with a single RAISE NOTICE 'Migration superseded by next migration' and exit immediately, removing prior data-migration and schema-change steps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Verify determinism and correctness of each ORDER BY used inside row_number() window functions.
  • Ensure row_number ranges cover full result sets without gaps or overlap and that batch progression variables advance correctly.
  • Review added/changed indexes on row_number (and included columns) for correctness and performance characteristics.
  • Inspect migration 17's temp-table population, index, and batched update join conditions for correctness, concurrency safety, and the implications of removing the NOT NULL alteration.
  • Confirm migration 27 being turned into a no-op is intentional and won't disrupt migration ordering.

Possibly related PRs

Suggested reviewers

  • flemzord

Poem

🐰
I hopped through rows with numbers bright,
Swapped offsets for windowed light,
Temp tables hummed and batches ran,
Commits like carrots in my hand,
Migration fields now stitched up tight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: speedup migration' directly and accurately describes the primary objective of the changeset—optimizing database migrations through row_number-based batching and temporary indexes.
Description check ✅ Passed The description is clearly related to the changeset, detailing multiple migration refactors, performance improvements, and specific technical changes across multiple migration files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/speedup-migration

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (1)

1-66: Consider adding a brief deprecation comment to clarify the intentional disabling.

Migration 28 (28-fix-pcv-missing-asset) is active and successfully replaces this migration with an improved row_number()-based pagination approach. The core logic is identical—both filter on the same dataset (insertion_date < version_id 12) and update post_commit_volumes using the same transformation. On fresh deployments, migration 27 runs as inert and migration 28 handles all processing. On existing databases, migration 27's previous work remains intact and migration 28 will not reprocess already-populated data unnecessarily.

However, adding a brief comment would improve clarity during maintenance:

  set search_path = '{{ .Schema }}';

+-- Superseded by migration 28 (uses row_number() for better pagination performance)
-- 		drop table if exists moves_view;
🧹 Nitpick comments (4)
internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (1)

25-37: Index should be optimized for row_number filtering.

The WHERE clause at line 37 filters on row_number >= _offset and row_number < _offset + _batch_size, but the index at line 25 is created on (log_id, id), not row_number. This will cause PostgreSQL to perform a sequential scan of txs_view instead of an index-based range query.

Consider optimizing the index to support row_number filtering:

create index txs_view_idx on txs_view(row_number) include (log_id);

This allows the range query on row_number to use the index and avoids scanning the entire view for each batch.

internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (1)

28-40: Index should support row_number range queries.

The WHERE clause at line 40 filters on row_number >= _offset and row_number < _offset + _batch_size, but the index at line 28 is on (reversedTransactionID). This forces a sequential scan of txs_view for each batch iteration.

Consider refactoring the index to:

create index txs_view_idx on txs_view(row_number) include (reversedTransactionID);

This enables efficient range queries on row_number and avoids scanning the entire temporary table per batch.

internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql (1)

26-38: Index should be optimized for row_number range filtering.

The WHERE clause at line 38 filters on row_number >= _offset and row_number < _offset + _batch_size, but the index at line 26 is on (log_id, id). This will cause sequential scans of txs_view during batch iteration.

Consider optimizing the index:

create index txs_view_idx on txs_view(row_number) include (log_id, id);

This allows PostgreSQL to use the index for row_number range queries, improving batch iteration performance.

internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (1)

20-29: Index should prioritize row_number for efficient batch filtering.

The index at line 20 on seq supports the UPDATE join but does not support the WHERE clause range query on row_number at line 29. PostgreSQL will perform a sequential scan of txs_view for each batch.

Consider creating an index on row_number as well:

create index txs_view_row_number_idx on txs_view(row_number);

This enables efficient range queries on row_number during batch iteration, while the existing seq index supports the UPDATE join condition.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1e0828a and 3845306.

📒 Files selected for processing (8)
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (1 hunks)
  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (1 hunks)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (3 hunks)
  • internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2 hunks)
  • internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql (2 hunks)
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-20T13:48:07.455Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 935
File: internal/controller/system/state_tracker.go:0-0
Timestamp: 2025-05-20T13:48:07.455Z
Learning: In the Formance ledger codebase, sequence reset queries with `select setval` don't require COALESCE around max(id) even for brand new ledgers, as the system handles this case properly.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql
📚 Learning: 2025-11-28T10:06:01.381Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1167
File: internal/storage/bucket/migrations/43-fix-missing-inserted-at-in-log-data/up.sql:23-23
Timestamp: 2025-11-28T10:06:01.381Z
Learning: In PostgreSQL migrations for the Formance ledger, to ensure ISO 8601/RFC3339 date formatting when setting JSONB fields, use the pattern `to_jsonb(date)#>>'{}'` to extract the timestamp as a properly formatted string. This leverages JSONB's built-in ISO 8601 encoding. Using `date::text` would depend on the server's DateStyle setting and doesn't guarantee ISO 8601 format.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
📚 Learning: 2025-04-29T11:24:28.923Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 892
File: internal/controller/ledger/controller_default.go:196-196
Timestamp: 2025-04-29T11:24:28.923Z
Learning: In the ledger Import function, it's critical to maintain proper log ID tracking by updating lastLogID with the current log.ID after each processed log, rather than setting it to nil. This ensures the system can properly validate the ordering of logs and prevent duplicate or out-of-order processing, which is essential for maintaining data integrity in the ledger.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
📚 Learning: 2025-07-19T18:35:34.260Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.

Applied to files:

  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql
  • internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Tests
  • GitHub Check: Dirty
  • GitHub Check: GoReleaser
  • GitHub Check: Tests
  • GitHub Check: Dirty
🔇 Additional comments (3)
internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (1)

8-39: Batch iteration using row_number window function is correctly implemented.

The pattern of creating a temporary table with row_number(), indexing on row_number with included columns, and batching via row_number bounds is well-structured. Index optimization and per-batch commits should improve migration performance.

internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (1)

10-62: Well-optimized batch processing with efficient indexing.

The index at line 30 on row_number with included columns provides optimal support for both the WHERE clause filtering and the UPDATE join. The batch iteration pattern and if not found exit condition are sound.

internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql (1)

21-46: Batch processing correctly refactored with optimal index design.

The migration cleanly replaces date-based temporal filtering with type-based filtering and implements row_number-based pagination with an optimally-designed index (line 26) that includes all columns needed for the UPDATE join. The added inserted_at is null condition (line 42) also ensures idempotency.

@gfyrag gfyrag force-pushed the feat/speedup-migration branch from 3845306 to 7770a87 Compare December 1, 2025 13:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (1)

39-53: Clarify variable shadowing between CTE name and column name.

The UPDATE statement uses data as both the CTE name (FROM data) and the table column being updated (logs.data). PostgreSQL resolves this correctly—CTE-qualified references like data.id resolve to the CTE—but the overlap is confusing for readers. Consider renaming the CTE to something like batch_data to distinguish it from the logs.data column.

  update logs
- set data = data || jsonb_build_object(
+ set data = batch_data || jsonb_build_object(
      'revertedTransaction', jsonb_build_object(
-         'id', data.id,
-         'postings', data.postings::jsonb,
-         'metadata', data.metadata,
+         'id', batch_data.id,
+         'postings', batch_data.postings::jsonb,
+         'metadata', batch_data.metadata,
          'reverted', true,
-         'revertedAt', to_json(data.reverted_at)#>>'{}' || 'Z',
-         'insertedAt', to_json(data.inserted_at)#>>'{}' || 'Z',
-         'timestamp', to_json(data.timestamp)#>>'{}' || 'Z',
-         'reference', case when data.reference is not null and data.reference <> '' then data.reference end,
-         'postCommitVolumes', data.post_commit_volumes
+         'revertedAt', to_json(batch_data.reverted_at)#>>'{}' || 'Z',
+         'insertedAt', to_json(batch_data.inserted_at)#>>'{}' || 'Z',
+         'timestamp', to_json(batch_data.timestamp)#>>'{}' || 'Z',
+         'reference', case when batch_data.reference is not null and batch_data.reference <> '' then batch_data.reference end,
+         'postCommitVolumes', batch_data.post_commit_volumes
      ))
- from data
+ from batch_data
  where logs.id = data.log_id and
-       logs.ledger = data.ledger;
+       logs.ledger = batch_data.ledger;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3845306 and 7770a87.

📒 Files selected for processing (6)
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (1 hunks)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (3 hunks)
  • internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2 hunks)
  • internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql (2 hunks)
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.
📚 Learning: 2025-05-20T13:48:07.455Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 935
File: internal/controller/system/state_tracker.go:0-0
Timestamp: 2025-05-20T13:48:07.455Z
Learning: In the Formance ledger codebase, sequence reset queries with `select setval` don't require COALESCE around max(id) even for brand new ledgers, as the system handles this case properly.

Applied to files:

  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
📚 Learning: 2025-07-19T18:35:34.260Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.

Applied to files:

  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Tests
  • GitHub Check: Dirty
🔇 Additional comments (4)
internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (2)

8-23: Temporary table and index setup is sound.

The temporary table correctly captures rows needing updates, and the covering index on row_number with INCLUDE columns is well-designed for the batch range queries that follow. Per-table analysis after population is good practice.


25-39: Loop iteration now correctly starts at 1, resolving off-by-one concern.

The batched update logic is correct. The loop now properly aligns with PostgreSQL's row_number() semantics by starting at 1, which resolves the off-by-one issue flagged in the previous review. The range condition row_number >= i and row_number < i + _batch_size correctly partitions all rows without gaps or overlaps. Per-batch commits reduce lock contention on the moves table—a key optimization for migration performance.

internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (1)

20-56: Row_number batching implementation is sound.

The refactor from OFFSET/LIMIT to row_number-based pagination is implemented cleanly. Deterministic ordering via order by transactions.seq ensures consistent batching, and the per-batch commits reduce lock contention during the migration. Index on reversedTransactionID appropriately supports the join condition.

internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (1)

10-21: Verify temporary table declaration and foreign key necessity.

Line 10 creates a table without the TEMP keyword, unlike migrations 29 and 33 which use create temp table. Confirm whether this should be create temp table for consistency and to ensure the table is session-scoped and automatically dropped.

Additionally, the foreign key constraint on line 21 is unusual for temporary tables. While indexes (line 20) are standard for optimizing joins, foreign keys are primarily for data integrity. Verify whether the foreign key serves a purpose beyond the index or if it can be removed. The comment "speed up hash join" suggests intent for query optimization, which indexes typically provide; the foreign key may be redundant.

@codecov
Copy link

codecov bot commented Dec 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.43%. Comparing base (1e0828a) to head (f3611ec).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1173   +/-   ##
=======================================
  Coverage   81.43%   81.43%           
=======================================
  Files         190      190           
  Lines        9334     9334           
=======================================
  Hits         7601     7601           
- Misses       1228     1229    +1     
+ Partials      505      504    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gfyrag gfyrag force-pushed the feat/speedup-migration branch from 7770a87 to 3f0c2c3 Compare December 1, 2025 13:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (2)

3-4: Remove unused variable declarations.

Since all code using _offset and _batch_size is commented out, these declarations are dead code and should be removed.

 do $$
 	declare
-		_offset integer := 0;
-		_batch_size integer := 1000;
 	begin
 		set search_path = '{{ .Schema }}';

8-64: Replace commented-out implementation with explicit supersession marker.

The entire migration logic is commented out because migration 28 supersedes this one. Rather than leaving 57 lines of commented code, replace it with a single-line comment clearly documenting the supersession intent. This improves maintainability and signals to future readers why the migration is inert.

 	begin
 		set search_path = '{{ .Schema }}';
 
-		-- [58 lines of commented-out implementation omitted...]
+		-- Migration 27 is superseded by migration 28 (fix-pcv-missing-asset).
+		-- All post-commit volume fixes are now handled by migration 28 using row_number-based batching.
 	end
 $$;

This keeps the file intact (in case the migration framework requires it) while making intent explicit and removing the cognitive burden of maintaining large commented-out blocks.

internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (1)

28-28: Remove potentially unused index.

The index on reversedTransactionID appears redundant:

  • Not used in the WHERE clause (which filters by row_number)
  • Not used in the UPDATE join (which filters from an already-sliced temp table)

Temporary table indexes are cleaned up with the table, but removing this index reduces migration overhead.

-	create index txs_view_idx on txs_view(reversedTransactionID);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7770a87 and 3f0c2c3.

📒 Files selected for processing (8)
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (1 hunks)
  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (1 hunks)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (3 hunks)
  • internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2 hunks)
  • internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql (2 hunks)
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql
  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.
📚 Learning: 2025-07-19T18:35:34.260Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.

Applied to files:

  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql
  • internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
📚 Learning: 2025-05-20T13:48:07.455Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 935
File: internal/controller/system/state_tracker.go:0-0
Timestamp: 2025-05-20T13:48:07.455Z
Learning: In the Formance ledger codebase, sequence reset queries with `select setval` don't require COALESCE around max(id) even for brand new ledgers, as the system handles this case properly.

Applied to files:

  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql
  • internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql
🔇 Additional comments (7)
internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (3)

11-12: Excellent row_number-based batching refactor.

The migration effectively addresses the performance goal by replacing OFFSET/LIMIT pagination with row_number()-based range filtering. The index design on row_number with included columns (transactions_seq, volumes) is optimal for covering the CTE query (lines 42–46), and per-batch commits (line 61) reduce lock contention during long-running updates.

Also applies to: 30-30, 45-45


32-32: Foreign key and early-exit optimizations are sound.

The foreign key constraint (line 32) supports hash-join acceleration during the UPDATE, and the empty-view check (lines 34–37) avoids unnecessary work. Both align with prior learnings on acceptable lock strategies for migration-scoped temporary tables.

Also applies to: 34-37


42-62: Verify termination logic handles all edge cases.

The loop termination condition (line 52: if not found then) relies on the UPDATE matching zero rows when the row_number range exhausts all data. This is correct, but ensure the logic is resilient:

  • First batch: row_number >= 0 and row_number < 1000 — processes rows 1–1000; if rows exist, UPDATE succeeds, found = true, loop continues.
  • Final batch: row_number >= N and row_number < N + 1000 — range is empty or depleted; UPDATE matches 0 rows, found = false, loop exits.

Confirm through testing that the cursor position (_offset) advances correctly and no rows are skipped or double-processed across batch boundaries.

internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (1)

19-61: Row_number-based batching implementation is correct and performant.

The migration correctly implements the row_number pagination pattern: stable ordering by transactions.seq ensures deterministic batching across iterations, the range filter (row_number >= _offset and row_number < _offset + _batch_size) avoids expensive OFFSET scanning, and per-batch commits reduce lock contention. The temporary index on txs_view(log_id, id) and early exit for empty results are both optimizations well-suited for a large migration. This aligns with the optimization pattern in the PR.

internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2)

20-20: ✓ Excellent row_number-based batching optimization.

The migration correctly replaces inefficient OFFSET/LIMIT pagination with deterministic row_number range filtering. The window function is ordered by transactions.seq, ensuring consistent batch boundaries and deterministic processing across restarts. The exit condition exit when not found properly detects when all rows have been processed.

Also applies to: 40-40


44-45: Verify intent: overwriting updated_at with revertedAt.

Line 45 sets updated_at = data.revertedAt, which overwrites the actual update timestamp with the revert date. Confirm this is intentional for this migration's semantics and won't cause issues with audit trails or other logic depending on actual modification times.

internal/storage/bucket/migrations/17-moves-fill-transaction-id/up.sql (1)

1-41: Loop logic is correct; past off-by-one concern does not apply.

The loop correctly covers all rows. With for i in 1.._max by _batch_size and row_number >= i AND row_number < i + _batch_size:

  • For _max=10,000 and _batch_size=1,000: iterations cover rows 1-1000, 1001-2000, ..., 9001-10000 (all rows).
  • For non-divisible totals (e.g., 10,001 rows): the loop extends to i=10001 and processes row 10001 via the range check.

The past Graphite comment suggesting a fix appears to reference an earlier version; the current code starting at i=1 is correct.

Strengths of this refactor:

  • Deterministic row_number()-based batching replaces OFFSET/LIMIT, improving performance on large result sets.
  • Index on row_number with included columns optimizes batch filtering.
  • Per-batch commits reduce lock contention during the long migration.
  • The WHERE transactions_id is null filter ensures idempotency (re-runs don't duplicate updates).

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql">

<violation number="1" location="internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql:37">
P2: The batching predicate skips every `_batch_size`-th row (including the last row when total rows are a multiple of the batch size), leaving some transactions’ inserted_at unset.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

flemzord
flemzord previously approved these changes Dec 3, 2025
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files

Prompt for AI agents (all 2 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql">

<violation number="1" location="internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:3">
P3: Typo: &#39;superseeded&#39; should be &#39;superseded&#39;.</violation>
</file>

<file name="internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql">

<violation number="1" location="internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql:38">
P2: Index `txs_view_idx` is on `(log_id, id)`, but the batch selection query now filters by `row_number`. For efficient range scans, consider adding an index on `row_number`. Migration 28 in this same PR demonstrates the correct pattern: `create index ... on ...(row_number) include (...)`. The current index is still useful for the UPDATE join, but won&#39;t help with batch selection.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

@gfyrag gfyrag force-pushed the feat/speedup-migration branch from 4d1261e to 137d69c Compare December 3, 2025 12:47
@gfyrag gfyrag force-pushed the feat/speedup-migration branch from 137d69c to 77a64f5 Compare December 3, 2025 12:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (1)

25-25: Consider whether the temporary index is cost-effective.

The index on txs_view(log_id, id) (line 25) is created on a temporary table. However, the main WHERE filter in the loop (line 37) only conditions on row_number, not on log_id or id. This index may not be directly used by the query planner for row filtering. It could potentially help with the JOIN condition or as a covering index, but inspect whether it meaningfully accelerates the UPDATE or JOIN. For a temp table during migration, a minimal index strategy may suffice.

If profiling shows the index is unused or negligible, consider removing it or narrowing it to only the necessary columns.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 137d69c and 77a64f5.

📒 Files selected for processing (4)
  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (1 hunks)
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql (2 hunks)
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/storage/bucket/migrations/32-fix-log-data-for-reverted-transactions/up.sql
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.
📚 Learning: 2025-07-19T18:35:34.260Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
📚 Learning: 2025-05-20T13:48:07.455Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 935
File: internal/controller/system/state_tracker.go:0-0
Timestamp: 2025-05-20T13:48:07.455Z
Learning: In the Formance ledger codebase, sequence reset queries with `select setval` don't require COALESCE around max(id) even for brand new ledgers, as the system handles this case properly.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
  • internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql
📚 Learning: 2025-05-20T13:07:54.504Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 935
File: internal/controller/system/state_tracker.go:50-55
Timestamp: 2025-05-20T13:07:54.504Z
Learning: In the ledger codebase's `handleState` method, when updating ledger state from `StateInitializing` to `StateInUse`, it's intentional to proceed silently when `rowsAffected == 0`. This indicates another parallel transaction has already updated the ledger state and configured the sequences, so no error needs to be returned and no sequence updating is required.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
📚 Learning: 2025-11-28T10:06:01.381Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1167
File: internal/storage/bucket/migrations/43-fix-missing-inserted-at-in-log-data/up.sql:23-23
Timestamp: 2025-11-28T10:06:01.381Z
Learning: In PostgreSQL migrations for the Formance ledger, to ensure ISO 8601/RFC3339 date formatting when setting JSONB fields, use the pattern `to_jsonb(date)#>>'{}'` to extract the timestamp as a properly formatted string. This leverages JSONB's built-in ISO 8601 encoding. Using `date::text` would depend on the server's DateStyle setting and doesn't guarantee ISO 8601 format.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
📚 Learning: 2025-04-29T11:24:28.923Z
Learnt from: gfyrag
Repo: formancehq/ledger PR: 892
File: internal/controller/ledger/controller_default.go:196-196
Timestamp: 2025-04-29T11:24:28.923Z
Learning: In the ledger Import function, it's critical to maintain proper log ID tracking by updating lastLogID with the current log.ID after each processed log, rather than setting it to nil. This ensures the system can properly validate the ordering of logs and prevent duplicate or out-of-order processing, which is essential for maintaining data integrity in the ledger.

Applied to files:

  • internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql
🔇 Additional comments (5)
internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql (1)

3-3: Well-executed migration deprecation.

The typo ("superseeded" → "superseded") has been corrected, and converting this migration to a no-op with a notice is a clean approach to deprecate it. This maintains migration sequence integrity while allowing existing databases (which already executed migration 27) to skip it, and new databases to proceed directly to migration 28.

Please verify that migration 28 contains all necessary data-migration logic that migration 27 previously implemented, ensuring no functionality is lost during the transition.

internal/storage/bucket/migrations/33-fix-invalid-date-format/up.sql (2)

19-19: ✓ Row_number-based batching replaces OFFSET/LIMIT efficiently.

The pagination refactor from OFFSET/LIMIT to row_number-based filtering (line 37: where row_number > _offset and row_number <= _offset + _batch_size) is correct and aligns with the PR's performance improvement goals. The batching boundaries are accurate—each iteration processes exactly _batch_size rows without the overhead of OFFSET scanning.

The addition of log_id (line 19) as an alias for reversed.id is used correctly in the UPDATE join condition (line 52), so the projection is well-suited to the downstream logic.

Also applies to: 37-37


33-62: ✓ Batched loop with per-batch commits reduces lock contention.

The loop structure with batched UPDATE (lines 39–53), per-batch commit (line 61), and exit condition (line 55) is sound. Each iteration processes a fixed window of rows, avoiding the risk of prolonged table locks during large migrations. The exit when not found; correctly terminates when no rows are affected by the UPDATE.

internal/storage/bucket/migrations/18-transactions-fill-inserted-at/up.sql (2)

21-26: Row_number-based batching is well-optimized for performance.

The temporary table now uses row_number-based windowing (line 22) with a corresponding index (line 26) keyed on row_number. This replaces the earlier OFFSET/LIMIT approach and should significantly reduce lock contention during large batch updates.

  • Index key on row_number aligns with the WHERE filter on lines 37
  • INCLUDE columns (ledger, date, transaction_id) support the CTE SELECT and downstream join
  • Type-based filtering (line 24) correctly identifies relevant transaction logs per PR objectives

33-47: Batch loop logic is correct; no rows are skipped.

The loop for i in 0.._count-1 by _batch_size combined with where row_number > i and row_number <= i + _batch_size correctly partitions all rows without gaps or overlaps:

  • First iteration (i=0): row_number > 0 AND row_number <= 1000 → rows 1–1000
  • Second iteration (i=1000): row_number > 1000 AND row_number <= 2000 → rows 1001–2000
  • And so on…

The defensive inserted_at is null check on line 42 ensures idempotency and prevents re-updating rows if the migration is retried.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (1)

28-28: Index on reversedTransactionID may be underutilized.

An index on txs_view(reversedTransactionID) is created, but the loop's WHERE clause (line 40) filters by row_number, not reversedTransactionID. The index is unlikely to be used for the SELECT in the data CTE or the subsequent UPDATE join.

Consider whether a row_number index would be more beneficial (if indexing is needed at all), or remove the index if it's unnecessary for temporary table scans.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 77a64f5 and f3611ec.

📒 Files selected for processing (5)
  • internal/storage/bucket/migrations/19-transactions-fill-pcv/up.sql (1 hunks)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql (3 hunks)
  • internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2 hunks)
  • internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (2 hunks)
  • internal/storage/bucket/migrations/34-fix-memento-format/up.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/storage/bucket/migrations/28-fix-pcv-missing-asset/up.sql
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: gfyrag
Repo: formancehq/ledger PR: 1017
File: internal/storage/bucket/migrations/27-fix-invalid-pcv/up.sql:30-31
Timestamp: 2025-07-19T18:35:34.260Z
Learning: In Formance ledger migrations, SHARE ROW EXCLUSIVE locks on migration-specific temporary tables like moves_view are acceptable since these tables are dedicated for the migration. ACCESS SHARE locks on transactions table don't block writes, only schema changes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Dirty
  • GitHub Check: Tests
🔇 Additional comments (7)
internal/storage/bucket/migrations/31-fix-transaction-updated-at/up.sql (3)

11-11: ✓ Row_number-based pagination correctly replaces OFFSET/LIMIT.

The batch boundary logic is sound: with _offset = 0 and _batch_size = 1000, the filter row_number > _offset and row_number <= _offset+_batch_size correctly partitions rows (1–1000, 1001–2000, etc.) without gaps or overlaps. The loop exits cleanly when the data CTE is empty.

Also applies to: 29-29


20-21: Index and foreign key enhance both performance and data integrity.

The index on seq accelerates the hash join during the UPDATE statement (line 31–34), and the foreign key constraint ensures that all rows in the temporary table reference valid transactions. These are sound optimizations for a migration-specific temporary table.


15-18: Batched processing with per-batch commits is well-structured.

Empty-table handling (early return), loop exit on exhaustion (if not found), and per-batch commits all contribute to efficient lock management and graceful completion. The strategy aligns with the broader migration speedup pattern in this PR.

Also applies to: 36-39, 45-45

internal/storage/bucket/migrations/34-fix-memento-format/up.sql (1)

18-18: No issue found. The seq column is a bigserial primary key that starts at 1 and cannot contain values ≤ 0.

The boundary change from seq >= _offset and seq < _offset + _batch_size to seq > _offset and seq <= _offset + _batch_size is valid and does not cause data loss. The seq column in the logs table is defined as bigserial primary key in migration 0-init-schema, which guarantees all values are >= 1. Therefore, the concern about rows with seq ≤ 0 being skipped is not applicable—no such rows can exist.

Likely an incorrect or invalid review comment.

internal/storage/bucket/migrations/29-fix-invalid-metadata-on-reverts/up.sql (2)

20-20: Good migration to row_number-based batching for performance.

The row_number() window function replaces offset/limit pagination, which is more efficient for large datasets—each iteration avoids the cost of scanning and discarding offset rows.


36-56: Verify row_number batch boundary logic in loop.

The batching follows the pattern: where row_number > _offset and row_number <= _offset + _batch_size. With _offset = 0:

  • Iteration 1: rows 1–1000
  • Iteration 2: rows 1001–2000
  • etc.

This is correct and non-overlapping. The exit when not found at line 50 triggers when the UPDATE affects 0 rows (i.e., the data CTE returns no rows), exiting after the final batch.

internal/storage/bucket/migrations/19-transactions-fill-pcv/up.sql (1)

3-3: The batch boundary logic is correct and poses no data completeness risk. The transactions.seq column is defined as bigserial primary key in the initial schema, which means it auto-increments starting from 1—never 0. Therefore, with _offset = 0, the WHERE clause transactions_seq > 0 AND transactions_seq <= 1000 correctly selects all rows in the first batch (1–1000), and no rows are skipped. Subsequent batches process correctly without overlap. The loop exits when no rows match the WHERE clause, which is the appropriate termination condition. No changes are needed.

Likely an incorrect or invalid review comment.

@gfyrag gfyrag requested a review from flemzord December 3, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants