Skip to content

gh-117779: Fix reading duplicated entries in zipfile by name #129254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jan 24, 2025

@serhiy-storchaka
Copy link
Member Author

I wonder how far should we backport this change. #110016 was a security fix and was backported to 3.8.

@serhiy-storchaka serhiy-storchaka force-pushed the zipfile-duplicated-entries branch from c683cf7 to b8a9cea Compare January 24, 2025 11:21
@obfusk
Copy link
Contributor

obfusk commented Jan 25, 2025

I didn't expect such archives to be found in the real world.
I cannot imagine why anybody would create such archive.

I didn't expect it either. And I suspect the creation of those duplicate entries to be a bug, but they do clearly exist in the wild.

Now the last of the duplicated entries can be read, so you will not get an error when read by name. You will still get errors when reading other duplicated entries, this is necessary to prevent ZIP bombing.

I'm slightly unsure about this fix though. For one, it breaks the workaround we added to diffoscope (use the first entry in infolist() if there are duplicates). This fix is simple and handles opening the file by name, yes, but for other use cases the problem is simply moved around. I have plenty of code that iterates over infolist() and passes the ZipInfo to read(), which would still fail even though there is no actual overlap indicating a ZIP bomb.

IMO it would be admittedly more complex but also more consistent -- and avoid any errors processing these ZIP files -- to explicitly handle entries with duplicate offsets: letting all of them have the same ._end_offset, that being the start of the actual next (i.e. not having the same .header_offset) entry instead of still having the entries with duplicate offsets pointing to each other in a different order. This should still prevent actual ZIP bombs as it ensures there is no actual overlap whilst treating entries with the same .header_offset consistently instead of having some that can and others that cannot be opened depending on ordering.

(edited for clarity)

@obfusk
Copy link
Contributor

obfusk commented Jan 25, 2025

cc @h01ger @lamby

@serhiy-storchaka
Copy link
Member Author

Sorry for the delay in responding. I experimented with different solutions and wrote new tests.

To me, this is a ZIP bomb. Even if it does not allow to fill the disk space by decompressing, it still spend CPU time for uncompressing the same data. It may consume a lot of memory if the decompressed entries are store in a list instead of a dict.

But since ZIP archives with duplicated entries encountered in practice, created not by malicious, but just bad software, perhaps we could lower the severity level from error to warning. Warnings can be converted to exceptions, if you want to make your program safer.

@gpshead, what is your opinion on this?

@gpshead
Copy link
Member

gpshead commented Mar 28, 2025

I like the warning idea.

@gpshead
Copy link
Member

gpshead commented Mar 28, 2025

@jaraco - any thoughts on this one? in re-enables processing of one form of zip bomb, just with a warning instead of error. because there are also maybe legit situations where this zip structure occurs. It mostly undoes one of the protections added in #110016 for #109858.

@obfusk
Copy link
Contributor

obfusk commented Mar 28, 2025

A warning seems reasonable, yeah. But I dislike the "except the last one" edge case where it still breaks, surely that can be avoided?

@obfusk
Copy link
Contributor

obfusk commented Mar 29, 2025

IMO a warning indeed makes perfect sense but I think entries with the same .header_offset should all be treated consistently; AFAIK there is zero documentation for the fact that iterating over entries in different ways produces different results and that end user code needs to be aware, not just that e.g. duplicate file names may exist, but of how these entries with identical offsets are being handled internally (and I consider the fact that now matters a regression still).

@serhiy-storchaka
Copy link
Member Author

Warnings or exceptions are emitted for some abnormal ZIP files from ancient times, but it is not possible to detect all anomalies, and even if it is possible, it is not always practical. For example, in a "quote-overlap" zip bomb, the most nested file can be read without error.

It is not a large issue if one of entries with the same offset can be read without error or warning. For consistency with how non-overlapped files with the same name are handled, it should be the last entry.

@serhiy-storchaka serhiy-storchaka merged commit 0f04f24 into python:main Apr 8, 2025
39 checks passed
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Apr 8, 2025
…ythonGH-129254)

(cherry picked from commit 0f04f24)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Apr 8, 2025
…ythonGH-129254)

(cherry picked from commit 0f04f24)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Apr 8, 2025

GH-132263 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Apr 8, 2025
@bedevere-app
Copy link

bedevere-app bot commented Apr 8, 2025

GH-132264 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Apr 8, 2025
serhiy-storchaka added a commit that referenced this pull request Apr 8, 2025
…H-129254) (GH-132264)

(cherry picked from commit 0f04f24)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this pull request Apr 8, 2025
…H-129254) (GH-132263)

(cherry picked from commit 0f04f24)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants