Skip to content

FIX: Ensure copy_data callbacks run even when all rows are skipped #33002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 2, 2025

Conversation

s3lase
Copy link
Contributor

@s3lase s3lase commented May 30, 2025

Currently, if a batch "copy" of an import step results in all rows being skipped, the after_commit_of_skipped_rows callback is never triggered. This happens because the callback is nested inside a block that only runs when at least one row is inserted.

This change ensures the DB copy operation returns both inserted and skipped rows, allowing the caller to respond appropriately in either case.

Currently, if a "copy" batch of an import step results in all rows being skipped,
the `after_commit_of_skipped_rows` callback is never triggered.
This happens because the callback is nested inside a block that only runs
when at least one row is inserted.

This change ensures the DB copy operation returns both inserted and skipped rows,
allowing the caller to respond appropriately in either case.
@s3lase s3lase requested a review from a team as a code owner May 30, 2025 01:33
@github-actions github-actions bot added the migrations-tooling PR which includes changes to migrations tooling label May 30, 2025
@s3lase s3lase requested a review from gschlager May 30, 2025 01:41
Avoids creating a Hash object for each row.
This also optimized the column mapping in `copy_data` because a while loop and some caching makes it faster.
@gschlager
Copy link
Member

Oh, good catch. It looks like it can't be avoided to yield skipped rows in the enumerator, so that's fine. But fetch_rows is a hot path and potentially enumerates millions of rows. Creating a Hash for each row seems wasteful. I pushed a commit with a solution that I'm more comfortable with.

@s3lase
Copy link
Contributor Author

s3lase commented Jun 1, 2025

Yes, you’re right, instantiating a new hash for each row isn’t ideal for that path. I started off wanting to avoid enumerating the skipped rows entirely like you were doing, but most of the solutions I came up with ended up either breaking the existing batching implementation for skipped rows or taking things in a totally different direction.

I also considered marking skipped rows in the existing row hash with something like row[:skip] = true, but I was concerned about potentially clashing with a future skip column. Your use of :"$skip" to get around this is a pretty cool workaround.

Comment on lines +18 to +19
inserted_rows = []
skipped_rows = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same spirit of micro-optimizing, what do you think about pre-allocating these too? We know they won't be larger than COPY_BATCH_SIZE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@gschlager gschlager merged commit a48f33f into main Jun 2, 2025
5 checks passed
@gschlager gschlager deleted the fix/mt/copy-data-callbacks branch June 2, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migrations-tooling PR which includes changes to migrations tooling
Development

Successfully merging this pull request may close these issues.

2 participants