Skip to content

S3: implement new data integrity for MultipartUpload #12183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 28, 2025
Merged

Conversation

bentsku
Copy link
Contributor

@bentsku bentsku commented Jan 25, 2025

Motivation

This PR implements the new data integrity checks for Multipart Upload (shortened MPU), namely the ability to set the FULL_OBJECT checksum type for a MPU, allowing the user to not keep track of individual checksum parts to calculate the final one, but instead send the full object checksum to S3. This can be done thanks to CRC* type checksums, which can linearize, meaning you can get the full object checksum from the individual parts' checksums (via a somewhat complicated function).

The PR looks big (3k lines), but 70% of it is snapshots, because I had to cover a lot of different cases, operations' responses, and there was a lot of ground to cover that wasn't previously covered.

There is a lot of additional validation and default values to check (different relationship between checksum algorithm and checksum type), new validation of the MPU total size at completion time, I've also ported the fix with different error message depending if the checksum value is valid or not.

The biggest part and challenge of this PR is to implement to logic to combine CRC checksums together in order to create the FULL_OBJECT checksum. More can be read about this in the following Stack Overflow post: https://stackoverflow.com/a/23126768
I had to look everywhere online to try to find someway to do that, and it was pretty hard to find. I found an old repository of Alibaba, with the function available in Python2 to combine CRC64 checksums, I had to adapt it a little to run with Python3 and fit our use case better.

More resources to read about the new S3 data integrity push:

There's also a bit of cleanup done, removing some unused functions/code.

Changes

  • implement the new checksum types FULL_OBJECT and COMPOSITE for MPU
  • implement the helpers to combine CRC checksums together
  • clean up some dead code
  • implement a lot of new tests covering as wide as I could think

@bentsku bentsku added aws:s3 Amazon Simple Storage Service semver: patch Non-breaking changes which can be included in patch releases labels Jan 25, 2025
@bentsku bentsku added this to the 4.1 milestone Jan 25, 2025
@bentsku bentsku self-assigned this Jan 25, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was testing botocore code, so it wasn't really useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the duplication in the tests, multipart uploads are super verbose and annoying to work with

Comment on lines -393 to -426
def get_object_checksum_for_algorithm(checksum_algorithm: str, data: bytes) -> str:
match checksum_algorithm:
case ChecksumAlgorithm.CRC32:
return checksum_crc32(data)

case ChecksumAlgorithm.CRC32C:
return checksum_crc32c(data)

case ChecksumAlgorithm.SHA1:
return hash_sha1(data)

case ChecksumAlgorithm.SHA256:
return hash_sha256(data)

case _:
# TODO: check proper error? for now validated client side, need to check server response
raise InvalidRequest("The value specified in the x-amz-trailer header is not supported")


def verify_checksum(checksum_algorithm: str, data: bytes, request: Dict):
# TODO: you don't have to specify the checksum algorithm
# you can use only the checksum-{algorithm-type} header
# https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
key = f"Checksum{checksum_algorithm.upper()}"
# TODO: is there a message if the header is missing?
checksum = request.get(key)
calculated_checksum = get_object_checksum_for_algorithm(checksum_algorithm, data)

if calculated_checksum != checksum:
raise InvalidRequest(
f"Value for x-amz-checksum-{checksum_algorithm.lower()} header is invalid."
)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code wasn't used anymore anywhere, since we're properly calculating checksums on streams instead of the full string

@@ -454,40 +442,6 @@ def base_64_content_md5_to_etag(content_md5: ContentMD5) -> str | None:
return hex_digest


def decode_aws_chunked_object(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we did not use this code anymore, as localstack.services.s3.codec.AwsChunkedDecoder superseded it

Copy link

github-actions bot commented Jan 25, 2025

S3 Image Test Results (AMD64 / ARM64)

  2 files  ± 0    2 suites  ±0   8m 8s ⏱️ + 2m 58s
475 tests +29  423 ✅ +29   52 💤 ±0  0 ❌ ±0 
950 runs  +58  846 ✅ +58  104 💤 ±0  0 ❌ ±0 

Results for commit c3b2220. ± Comparison against base commit faef710.

This pull request removes 2 and adds 31 tests. Note that renamed tests count towards both.
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_parts_checksum_exceptions
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[CRC32]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[SHA1]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[SHA256]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_default
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC32]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC64NVME]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_checksum_type_compatibility[COMPOSITE-CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_checksum_type_compatibility[COMPOSITE-CRC32]
…

♻️ This comment has been updated with latest results.

@bentsku bentsku added semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases and removed semver: patch Non-breaking changes which can be included in patch releases labels Jan 25, 2025
@bentsku bentsku changed the title S3: implement S3 checksum MultipartUpload FULL_OBJECT S3: implement new data integrity for MultipartUpload Jan 25, 2025
@bentsku bentsku force-pushed the s3-feat-checksum branch 2 times, most recently from 1167d27 to 0c5c40e Compare January 27, 2025 10:38
Base automatically changed from s3-feat-checksum-1 to master January 27, 2025 17:35
@bentsku bentsku requested a review from k-a-il January 27, 2025 17:37
Copy link

github-actions bot commented Jan 27, 2025

LocalStack Community integration with Pro

    2 files  ± 0      2 suites  ±0   1h 53m 49s ⏱️ + 1m 6s
4 070 tests +46  3 755 ✅ +46  315 💤 ±0  0 ❌ ±0 
4 072 runs  +46  3 755 ✅ +46  317 💤 ±0  0 ❌ ±0 

Results for commit c3b2220. ± Comparison against base commit faef710.

This pull request removes 2 and adds 48 tests. Note that renamed tests count towards both.
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_parts_checksum_exceptions
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[CRC32]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[SHA1]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_composite[SHA256]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_default
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC32]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_complete_multipart_parts_checksum_full_object[CRC64NVME]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_checksum_type_compatibility[COMPOSITE-CRC32C]
tests.aws.services.s3.test_s3.TestS3MultipartUploadChecksum ‑ test_multipart_checksum_type_compatibility[COMPOSITE-CRC32]
…

♻️ This comment has been updated with latest results.

@bentsku bentsku marked this pull request as ready for review January 27, 2025 21:58
Copy link
Contributor

@k-a-il k-a-il left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Code and the whole solution is easy to understand thanks to comments :)

@bentsku bentsku merged commit 25b920c into master Jan 28, 2025
38 checks passed
@bentsku bentsku deleted the s3-feat-checksum branch January 28, 2025 15:05
@alexrashed alexrashed mentioned this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws:s3 Amazon Simple Storage Service semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants