Reading entries with data descriptors broken on many levels #295

julik · 2016-07-27T11:00:16Z

I am currently working out the last details on our own ZIP writer implementation. When doing so, I wrote an implementation of data descriptors + bit3 of the GP flags. Unfortunately, I simply cannot manage to make Rubyzip to read the entries that have data descriptors.

So far I saw the following spots where things go really wrong:

`data_descriptor_size` is inaccurate.

The data descriptor size is assumed to be 16 bytes (the data descriptor signature, the CRC and two sizes). However, for Zip64 entries the sizes will be 8 bytes long. Additionally, the signature is optional according to the APPNOTE. It is safe to assume it will be present, but computing offsets based off of it blindly is an invitation to a party.

##get_input_stream is supposed to raise an error, but doesnt

When reading a file using Zip::File.open() and obtaining an input stream for the entry body, the GP flag apparently does not get interpreted correctly. In the code I see that if bit3 is set, it is supposed to raise an error and tell me to use Zip::File. However, when I try to use Zip::File I get the same Entry returned that cannot be read

`get_input_stream` relies on reading local headers (which is THE issue here)

In general, relying on local headers for reading the entry body when you already have the central directory is madness. The local header for an entry with a data descriptor is forward-declared, and is very likely not to contain Zip64 extras - because when the entry was being written the writing code probably did not know the sizes of the file yet, and did not know the sizes of the compressed file data either. The fact that get_input_stream wants to read the local entry header, and potentially even clobbers the entry read from the central directory, makes it pretty impossible to do anything with data descriptors reliably - since if you use the forward-declared "dummy" local header, the only way to figure out where the file ends is to scan all the way ahead until you stumble upon the signature for the data descriptor. Which will never happen to be within a stored entry that you are reading, because people never put ZIPs within other ZIPs, right?

`@local_header_size` computed incorrectly (well, not computed rather)

When I try to open a file with data descriptors and raise-inspect the entries that Zip::File.foreach() gives me, the entries have their local header size set to zero. No wonder that trying to read them yields odd results, because probably the code thinks that the file data starts right at the offset where the local header is.

How to solve it

The only practical way to solve this, IMO, is to get rid of reading local headers altogether, because this way lies madness. Additionally, we have many situations where the local header can be written without the Zip64 extra fields, but since the file is located beyound the Zip64 span of the archive body you will have to write out the Zip64 central directory and the extra field anyway - but it will only be present in the central directory. So the only way I can see this being solvable is reading the entry metadata once, from the central directory, and using that as authoritative data. It also provides the exact offsets where the actual entry can be found.

Consequently, this is the approach taken by https://github.com/thejoshwolfe/yauzl - the author himself says that combining the local headers and the central directory information is insane and is basically unsolvable.

However, obviously, if you fix it that way it will not be backwards-compatible 100%, because some APIs will have to be removed. So I can tackle this but we'll have to break a few eggs in the process.

Curious about your opinions.

An example file that we write with our library can be found here: https://we.tl/IChJczPVYr
Opens perfectly fine with all the tools I tested it with, except... Rubyzip.

The text was updated successfully, but these errors were encountered:

idoru · 2018-04-06T13:38:08Z

I've recently opened PR #358 which falls back to central directory size if present when the local header general purpose bit 3 is set; not the cleanest fix but the simplest without re-architecting rubyzip. It worked for what I needed it for.

hainesr · 2021-06-25T19:22:20Z

Thanks for this @julik and sorry it's taken so long to get round to looking at things around data descriptors.

I think you are absolutely right about not trying to merge the central directory and local header data. I tried it once and it did not go well. Unfortunately, it's not quite that simple thanks to extra fields 😢 In a few cases the extra field stored in the central directory is a marker, or cut-down version of that in the local header, so we do have to read the local header version. I fixed this recently (#459) by ignoring everything in the local header, except the extra fields.

But the rule of thumb is: use Zip::File if at all possible, and certainly if you want to treat the archive like a folder; use Zip::InputStream if you really must treat the archive like a stream.

I have tried your data-descriptors.zip file linked above - thanks! - and the current HEAD of rubyzip can read it with no issues using Zip::File.get_input_stream, so hopefully that means some things have been fixed over the years. And I've recently tightened up some checking to ensure that Zip::InputStream complains when it is used as the entry point with this type of file.

I will look at making sure we do something more sensible with data_descriptor_size, which I agree is completely wrong.

May I use data-descriptors.zip as a fixture to test against in future versions of rubyzip?

hainesr · 2021-06-28T08:29:34Z

Update: I've removed Entry.data_descriptor_size and we now raise an error if something tries to use InputStream when a data descriptor is present (commit).

julik · 2021-10-31T18:31:09Z

Lovely! Feel free to close the issue if appropriate, and thanks for getting back on this.

hainesr · 2021-10-31T19:57:37Z

Thanks! I will close this one, since it is referenced in #460.

idoru mentioned this issue Apr 6, 2018

Handle stored files with general purpose bit 3 set #358

Merged

hainesr mentioned this issue Oct 29, 2020

Rubyzip should be able to handle data descriptors. #460

Open

hainesr added enhancement bug labels Jun 8, 2021

hainesr mentioned this issue Jun 16, 2021

Zip::InputStream reads only first file, no errors raised #493

Closed

hainesr closed this as completed Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading entries with data descriptors broken on many levels #295

Reading entries with data descriptors broken on many levels #295

julik commented Jul 27, 2016 •

edited

Loading

idoru commented Apr 6, 2018

hainesr commented Jun 25, 2021

hainesr commented Jun 28, 2021

julik commented Oct 31, 2021

hainesr commented Oct 31, 2021

Reading entries with data descriptors broken on many levels #295

Reading entries with data descriptors broken on many levels #295

Comments

julik commented Jul 27, 2016 • edited Loading

data_descriptor_size is inaccurate.

get_input_stream relies on reading local headers (which is THE issue here)

@local_header_size computed incorrectly (well, not computed rather)

How to solve it

idoru commented Apr 6, 2018

hainesr commented Jun 25, 2021

hainesr commented Jun 28, 2021

julik commented Oct 31, 2021

hainesr commented Oct 31, 2021

julik commented Jul 27, 2016 •

edited

Loading

`data_descriptor_size` is inaccurate.

`get_input_stream` relies on reading local headers (which is THE issue here)

`@local_header_size` computed incorrectly (well, not computed rather)