Don't apply any text encoding to binary input data #557

vkrause · 2023-04-25T15:14:56Z

This fixes barcode generation for arbitrary binary input failing entirely or producing incomplete/incorrect results.

Observed for example with Hungarian domestic railway tickets (zlib compressed data in PDF417).

axxel · 2023-04-26T09:27:56Z

I have not looked into why this CI test fails. Have you?

vkrause · 2023-04-26T15:49:16Z

I have not looked into why this CI test fails. Have you?

Should hopefully be fixed now, the decoder side and the test data were both also applying UTF-8 for BINARY data.

axxel · 2023-04-26T18:16:50Z

First I thought, there might very well be something off with the encoder because I did not spend too much time and thought in this area. Now that I see what you changed on the decoding side I believe I understand the issue. My current mental model says that the code is correct as it is. The problem is that you used the PDFEncoder incorrectly.

For the decoder, the API is defined to return UTF8 encoded bits in the Result::text() property, independent of the ECI that was used inside the symbol. If you need the raw bytes, use Result::bytes(). The encoding side does not have an equivalent of the raw bytes channel, the only API available takes a UTF8 encoded string as input. That means if you wish to store a number of random bytes inside the symbol, you'd need to convert them to UTF8, then pass that to encode which internally will recode them into whatever the characterset is specified (or a default). I have not checked if that actually works this way currently.

I understand that this is not very user friendly and also counter intuitive but your change would break the API as specified. We could add some kind of encodeBytes function that does what you'd like. It might also be worth checking out how @gitlost serves this use case in libzint. In that regard it is also related to the still open #332.

vkrause · 2023-04-26T19:46:39Z

For reference, that's the user code in question: https://invent.kde.org/frameworks/prison/-/blob/master/src/lib/pdf417barcode.cpp - this used to work for binary content pre-2.0 as-is.

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

vkrause · 2023-04-26T20:36:17Z

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

No luck so far, I either end up with an exception going back to an zueci_utf8_to_eci error, or I get an incomplete output.

axxel · 2023-04-26T21:55:54Z

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

No luck so far, I either end up with an exception going back to an zueci_utf8_to_eci error, or I get an incomplete output.

I would believe you would need to use the encode(std::string) overload and then basically do what zueci_encode_utf8 does. If that does not work I'm either missing something or there is a bug.

I checked how zint handles this. The library treats every input as UTF-8, except when the --binary switch is present. I think that is basically equivalent to my encodeBytes() idea above.

I could personally also live with a 'hack' that is somewhat similar to your approach above with the difference, that it would only be triggered if there is an explicit use of CharacterSet::BINARY. The way the default is set right now, your code would always be triggered.

This fixes barcode generation for arbitrary binary input failing entirely or producing incomplete/incorrect results. Observed for example with Hungarian domestic railway tickets (zlib compressed data in PDF417).

vkrause · 2023-04-27T15:17:48Z

Using the std::string overload of encode() seems to truncate at the first null byte (in FromUtf8() in MultiFormatWriter::encode). Using Qt's UTF-8 codec instead gets through null bytes but also doesn't produce something that survives the roundtrip.

Bypassing this by using the std::wstring overload then results in ToUtf8() in TextEncoder::GetBytes() messing things up.

Updated the PR to a more minimal version that only applies to BINARY mode.

And yes, I agree that an explicit API for binary data would probably be best.

axxel · 2023-04-28T09:38:18Z

In my above mentioned mental model the TextDecoder must not be modified. To make sure this would actually work in the end I started digging into the current implementation trying to really understand where exactly your use case breaks. And now I believe the way you use the writer in your above linked user code should indeed work. Would it be possible for you to provide a (smallish) example of your binary input that actually fails your round trip test?

vkrause · 2023-04-28T15:45:08Z

Correct, my problem is also fixed without the TextDecoder changes, those are only needed to restore the symmetry expected by the unit test.

I don't have a minimized test case unfortunately, but I'll send you the full 202 byte sequence that fails here.

axxel · 2023-04-28T17:19:26Z

Thanks for the data. I think I have an idea what is going wrong. Your user code std::copy(b.begin(), b.end(), std::back_inserter(input)); casts a signed char to a signed int (on a Linux machine), so the value 0x8b becomes 0xffffff8b instead of simply 0x8b. This leads to invalid unicode codepoints and results in the error in the zueci code. If you do the following instead:

		for (uint8_t c : b)
			input.push_back(c);

it should work, at least the bailing out of the zueci code stops.

vkrause · 2023-04-28T18:36:28Z

No luck with that so far, still the same "Unexpected charcode" exception. However, this does make a difference in the input data, now each character is indeed between 0 and 255 as one would expect, checking the same on the previous code fails (contained chars < 0, as you pointed out). So that's definitely a step in the right direction.

I'll do some more checking tomorrow to be sure we don't mess up the input data anywhere else.

axxel · 2023-04-28T19:19:47Z

No luck with that so far, still the same "Unexpected charcode" exception.

I know what is missing (forgot to mention it): you need to explicitly specify the character set:

		writer.setEncoding(CharacterSet::BINARY);

vkrause · 2023-04-28T19:31:14Z

Ah, that works!

Thank you very much for the help, and sorry for (again) having bothered you with what turned out to be a bug on our side!

There's two issues here: - The QByteArray to std::wstring conversion has to use unsigned bytes, the previous code did an implicit conversion to int producing wrong results for value above 127. - We now have to explicitly specify we are inputting binary data. See also zxing-cpp/zxing-cpp#557. This fixes for example domestic Hungarian railway ticket barcodes not being shown in KDE Itinerary anymore.

There's two issues here: - The QByteArray to std::wstring conversion has to use unsigned bytes, the previous code did an implicit conversion to int producing wrong results for value above 127. - We now have to explicitly specify we are inputting binary data. See also zxing-cpp/zxing-cpp#557. This fixes for example domestic Hungarian railway ticket barcodes not being shown in KDE Itinerary anymore. (cherry picked from commit 6b234bc)

vkrause force-pushed the work/vkrause/dont-apply-text-encoding-to-binary-data branch from 2bc0afe to bd3dba0 Compare April 26, 2023 15:48

Don't apply any text encoding to binary input data

cdc8153

This fixes barcode generation for arbitrary binary input failing entirely or producing incomplete/incorrect results. Observed for example with Hungarian domestic railway tickets (zlib compressed data in PDF417).

vkrause force-pushed the work/vkrause/dont-apply-text-encoding-to-binary-data branch from bd3dba0 to cdc8153 Compare April 27, 2023 15:15

vkrause closed this Apr 28, 2023

This was referenced Aug 5, 2023

Discussion about the future of the barcode generation / writer part of this project #332

Open

MultiFormatWriter: fix encoding binary std::string #599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Don't apply any text encoding to binary input data #557

Don't apply any text encoding to binary input data #557

Uh oh!

vkrause commented Apr 25, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 27, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

Uh oh!

Uh oh!

Don't apply any text encoding to binary input data #557

Don't apply any text encoding to binary input data #557

Uh oh!

Conversation

vkrause commented Apr 25, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

vkrause commented Apr 26, 2023

Uh oh!

axxel commented Apr 26, 2023

Uh oh!

vkrause commented Apr 27, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

axxel commented Apr 28, 2023

Uh oh!

vkrause commented Apr 28, 2023

Uh oh!

Uh oh!