-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
S3: implement new data integrity for MultipartUpload #12183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/unit/test_s3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was testing botocore code, so it wasn't really useful
tests/aws/services/s3/test_s3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all the duplication in the tests, multipart uploads are super verbose and annoying to work with
def get_object_checksum_for_algorithm(checksum_algorithm: str, data: bytes) -> str: | ||
match checksum_algorithm: | ||
case ChecksumAlgorithm.CRC32: | ||
return checksum_crc32(data) | ||
|
||
case ChecksumAlgorithm.CRC32C: | ||
return checksum_crc32c(data) | ||
|
||
case ChecksumAlgorithm.SHA1: | ||
return hash_sha1(data) | ||
|
||
case ChecksumAlgorithm.SHA256: | ||
return hash_sha256(data) | ||
|
||
case _: | ||
# TODO: check proper error? for now validated client side, need to check server response | ||
raise InvalidRequest("The value specified in the x-amz-trailer header is not supported") | ||
|
||
|
||
def verify_checksum(checksum_algorithm: str, data: bytes, request: Dict): | ||
# TODO: you don't have to specify the checksum algorithm | ||
# you can use only the checksum-{algorithm-type} header | ||
# https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html | ||
key = f"Checksum{checksum_algorithm.upper()}" | ||
# TODO: is there a message if the header is missing? | ||
checksum = request.get(key) | ||
calculated_checksum = get_object_checksum_for_algorithm(checksum_algorithm, data) | ||
|
||
if calculated_checksum != checksum: | ||
raise InvalidRequest( | ||
f"Value for x-amz-checksum-{checksum_algorithm.lower()} header is invalid." | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code wasn't used anymore anywhere, since we're properly calculating checksums on streams instead of the full string
@@ -454,40 +442,6 @@ def base_64_content_md5_to_etag(content_md5: ContentMD5) -> str | None: | |||
return hex_digest | |||
|
|||
|
|||
def decode_aws_chunked_object( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we did not use this code anymore, as localstack.services.s3.codec.AwsChunkedDecoder
superseded it
S3 Image Test Results (AMD64 / ARM64) 2 files ± 0 2 suites ±0 8m 8s ⏱️ + 2m 58s Results for commit c3b2220. ± Comparison against base commit faef710. This pull request removes 2 and adds 31 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
1167d27
to
0c5c40e
Compare
This reverts commit ad6e619.
0c5c40e
to
1d155af
Compare
1d155af
to
7fc465e
Compare
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 53m 49s ⏱️ + 1m 6s Results for commit c3b2220. ± Comparison against base commit faef710. This pull request removes 2 and adds 48 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Code and the whole solution is easy to understand thanks to comments :)
Motivation
This PR implements the new data integrity checks for Multipart Upload (shortened MPU), namely the ability to set the
FULL_OBJECT
checksum type for a MPU, allowing the user to not keep track of individual checksum parts to calculate the final one, but instead send the full object checksum to S3. This can be done thanks toCRC*
type checksums, which can linearize, meaning you can get the full object checksum from the individual parts' checksums (via a somewhat complicated function).The PR looks big (3k lines), but 70% of it is snapshots, because I had to cover a lot of different cases, operations' responses, and there was a lot of ground to cover that wasn't previously covered.
There is a lot of additional validation and default values to check (different relationship between checksum algorithm and checksum type), new validation of the MPU total size at completion time, I've also ported the fix with different error message depending if the checksum value is valid or not.
The biggest part and challenge of this PR is to implement to logic to combine CRC checksums together in order to create the
FULL_OBJECT
checksum. More can be read about this in the following Stack Overflow post: https://stackoverflow.com/a/23126768I had to look everywhere online to try to find someway to do that, and it was pretty hard to find. I found an old repository of Alibaba, with the function available in Python2 to combine CRC64 checksums, I had to adapt it a little to run with Python3 and fit our use case better.
More resources to read about the new S3 data integrity push:
There's also a bit of cleanup done, removing some unused functions/code.
Changes
FULL_OBJECT
andCOMPOSITE
for MPU