-
Notifications
You must be signed in to change notification settings - Fork 313
Reading entries with data descriptors broken on many levels #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've recently opened PR #358 which falls back to central directory size if present when the local header general purpose bit 3 is set; not the cleanest fix but the simplest without re-architecting rubyzip. It worked for what I needed it for. |
Thanks for this @julik and sorry it's taken so long to get round to looking at things around data descriptors. I think you are absolutely right about not trying to merge the central directory and local header data. I tried it once and it did not go well. Unfortunately, it's not quite that simple thanks to extra fields 😢 In a few cases the extra field stored in the central directory is a marker, or cut-down version of that in the local header, so we do have to read the local header version. I fixed this recently (#459) by ignoring everything in the local header, except the extra fields. But the rule of thumb is: use I have tried your I will look at making sure we do something more sensible with May I use |
Update: I've removed |
Lovely! Feel free to close the issue if appropriate, and thanks for getting back on this. |
Thanks! I will close this one, since it is referenced in #460. |
I am currently working out the last details on our own ZIP writer implementation. When doing so, I wrote an implementation of data descriptors + bit3 of the GP flags. Unfortunately, I simply cannot manage to make Rubyzip to read the entries that have data descriptors.
So far I saw the following spots where things go really wrong:
data_descriptor_size
is inaccurate.The data descriptor size is assumed to be 16 bytes (the data descriptor signature, the CRC and two sizes). However, for Zip64 entries the sizes will be 8 bytes long. Additionally, the signature is optional according to the APPNOTE. It is safe to assume it will be present, but computing offsets based off of it blindly is an invitation to a party.
##
get_input_stream
is supposed to raise an error, but doesntWhen reading a file using
Zip::File.open()
and obtaining an input stream for the entry body, the GP flag apparently does not get interpreted correctly. In the code I see that if bit3 is set, it is supposed to raise an error and tell me to useZip::File
. However, when I try to use Zip::File I get the same Entry returned that cannot be readget_input_stream
relies on reading local headers (which is THE issue here)In general, relying on local headers for reading the entry body when you already have the central directory is madness. The local header for an entry with a data descriptor is forward-declared, and is very likely not to contain Zip64 extras - because when the entry was being written the writing code probably did not know the sizes of the file yet, and did not know the sizes of the compressed file data either. The fact that
get_input_stream
wants to read the local entry header, and potentially even clobbers the entry read from the central directory, makes it pretty impossible to do anything with data descriptors reliably - since if you use the forward-declared "dummy" local header, the only way to figure out where the file ends is to scan all the way ahead until you stumble upon the signature for the data descriptor. Which will never happen to be within a stored entry that you are reading, because people never put ZIPs within other ZIPs, right?@local_header_size
computed incorrectly (well, not computed rather)When I try to open a file with data descriptors and raise-inspect the entries that
Zip::File.foreach()
gives me, the entries have their local header size set to zero. No wonder that trying to read them yields odd results, because probably the code thinks that the file data starts right at the offset where the local header is.How to solve it
The only practical way to solve this, IMO, is to get rid of reading local headers altogether, because this way lies madness. Additionally, we have many situations where the local header can be written without the Zip64 extra fields, but since the file is located beyound the Zip64 span of the archive body you will have to write out the Zip64 central directory and the extra field anyway - but it will only be present in the central directory. So the only way I can see this being solvable is reading the entry metadata once, from the central directory, and using that as authoritative data. It also provides the exact offsets where the actual entry can be found.
Consequently, this is the approach taken by https://github.com/thejoshwolfe/yauzl - the author himself says that combining the local headers and the central directory information is insane and is basically unsolvable.
However, obviously, if you fix it that way it will not be backwards-compatible 100%, because some APIs will have to be removed. So I can tackle this but we'll have to break a few eggs in the process.
Curious about your opinions.
An example file that we write with our library can be found here: https://we.tl/IChJczPVYr
Opens perfectly fine with all the tools I tested it with, except... Rubyzip.
The text was updated successfully, but these errors were encountered: