Skip to content

Don't apply any text encoding to binary input data #557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vkrause
Copy link
Contributor

@vkrause vkrause commented Apr 25, 2023

This fixes barcode generation for arbitrary binary input failing entirely or producing incomplete/incorrect results.

Observed for example with Hungarian domestic railway tickets (zlib compressed data in PDF417).

@axxel
Copy link
Collaborator

axxel commented Apr 26, 2023

I have not looked into why this CI test fails. Have you?

@vkrause vkrause force-pushed the work/vkrause/dont-apply-text-encoding-to-binary-data branch from 2bc0afe to bd3dba0 Compare April 26, 2023 15:48
@vkrause
Copy link
Contributor Author

vkrause commented Apr 26, 2023

I have not looked into why this CI test fails. Have you?

Should hopefully be fixed now, the decoder side and the test data were both also applying UTF-8 for BINARY data.

@axxel
Copy link
Collaborator

axxel commented Apr 26, 2023

First I thought, there might very well be something off with the encoder because I did not spend too much time and thought in this area. Now that I see what you changed on the decoding side I believe I understand the issue. My current mental model says that the code is correct as it is. The problem is that you used the PDFEncoder incorrectly.

For the decoder, the API is defined to return UTF8 encoded bits in the Result::text() property, independent of the ECI that was used inside the symbol. If you need the raw bytes, use Result::bytes(). The encoding side does not have an equivalent of the raw bytes channel, the only API available takes a UTF8 encoded string as input. That means if you wish to store a number of random bytes inside the symbol, you'd need to convert them to UTF8, then pass that to encode which internally will recode them into whatever the characterset is specified (or a default). I have not checked if that actually works this way currently.

I understand that this is not very user friendly and also counter intuitive but your change would break the API as specified. We could add some kind of encodeBytes function that does what you'd like. It might also be worth checking out how @gitlost serves this use case in libzint. In that regard it is also related to the still open #332.

@vkrause
Copy link
Contributor Author

vkrause commented Apr 26, 2023

For reference, that's the user code in question: https://invent.kde.org/frameworks/prison/-/blob/master/src/lib/pdf417barcode.cpp - this used to work for binary content pre-2.0 as-is.

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

@vkrause
Copy link
Contributor Author

vkrause commented Apr 26, 2023

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

No luck so far, I either end up with an exception going back to an zueci_utf8_to_eci error, or I get an incomplete output.

@axxel
Copy link
Collaborator

axxel commented Apr 26, 2023

I'll try if applying UTF-8 encoding to the binary data works, but that sounds very wrong and error-prone.

No luck so far, I either end up with an exception going back to an zueci_utf8_to_eci error, or I get an incomplete output.

I would believe you would need to use the encode(std::string) overload and then basically do what zueci_encode_utf8 does. If that does not work I'm either missing something or there is a bug.

I checked how zint handles this. The library treats every input as UTF-8, except when the --binary switch is present. I think that is basically equivalent to my encodeBytes() idea above.

I could personally also live with a 'hack' that is somewhat similar to your approach above with the difference, that it would only be triggered if there is an explicit use of CharacterSet::BINARY. The way the default is set right now, your code would always be triggered.

This fixes barcode generation for arbitrary binary input failing entirely
or producing incomplete/incorrect results.

Observed for example with Hungarian domestic railway tickets (zlib
compressed data in PDF417).
@vkrause vkrause force-pushed the work/vkrause/dont-apply-text-encoding-to-binary-data branch from bd3dba0 to cdc8153 Compare April 27, 2023 15:15
@vkrause
Copy link
Contributor Author

vkrause commented Apr 27, 2023

Using the std::string overload of encode() seems to truncate at the first null byte (in FromUtf8() in MultiFormatWriter::encode). Using Qt's UTF-8 codec instead gets through null bytes but also doesn't produce something that survives the roundtrip.

Bypassing this by using the std::wstring overload then results in ToUtf8() in TextEncoder::GetBytes() messing things up.

Updated the PR to a more minimal version that only applies to BINARY mode.

And yes, I agree that an explicit API for binary data would probably be best.

@axxel
Copy link
Collaborator

axxel commented Apr 28, 2023

In my above mentioned mental model the TextDecoder must not be modified. To make sure this would actually work in the end I started digging into the current implementation trying to really understand where exactly your use case breaks. And now I believe the way you use the writer in your above linked user code should indeed work. Would it be possible for you to provide a (smallish) example of your binary input that actually fails your round trip test?

@vkrause
Copy link
Contributor Author

vkrause commented Apr 28, 2023

Correct, my problem is also fixed without the TextDecoder changes, those are only needed to restore the symmetry expected by the unit test.

I don't have a minimized test case unfortunately, but I'll send you the full 202 byte sequence that fails here.

@axxel
Copy link
Collaborator

axxel commented Apr 28, 2023

Thanks for the data. I think I have an idea what is going wrong. Your user code std::copy(b.begin(), b.end(), std::back_inserter(input)); casts a signed char to a signed int (on a Linux machine), so the value 0x8b becomes 0xffffff8b instead of simply 0x8b. This leads to invalid unicode codepoints and results in the error in the zueci code. If you do the following instead:

		for (uint8_t c : b)
			input.push_back(c);

it should work, at least the bailing out of the zueci code stops.

@vkrause
Copy link
Contributor Author

vkrause commented Apr 28, 2023

No luck with that so far, still the same "Unexpected charcode" exception. However, this does make a difference in the input data, now each character is indeed between 0 and 255 as one would expect, checking the same on the previous code fails (contained chars < 0, as you pointed out). So that's definitely a step in the right direction.

I'll do some more checking tomorrow to be sure we don't mess up the input data anywhere else.

@axxel
Copy link
Collaborator

axxel commented Apr 28, 2023

No luck with that so far, still the same "Unexpected charcode" exception.

I know what is missing (forgot to mention it): you need to explicitly specify the character set:

		writer.setEncoding(CharacterSet::BINARY);

@vkrause
Copy link
Contributor Author

vkrause commented Apr 28, 2023

Ah, that works!

Thank you very much for the help, and sorry for (again) having bothered you with what turned out to be a bug on our side!

@vkrause vkrause closed this Apr 28, 2023
kdesysadmin pushed a commit to KDE/prison that referenced this pull request Apr 28, 2023
There's two issues here:
- The QByteArray to std::wstring conversion has to use unsigned bytes, the
previous code did an implicit conversion to int producing wrong results for
value above 127.
- We now have to explicitly specify we are inputting binary data.

See also zxing-cpp/zxing-cpp#557.

This fixes for example domestic Hungarian railway ticket barcodes not
being shown in KDE Itinerary anymore.
kdesysadmin pushed a commit to KDE/prison that referenced this pull request May 1, 2023
There's two issues here:
- The QByteArray to std::wstring conversion has to use unsigned bytes, the
previous code did an implicit conversion to int producing wrong results for
value above 127.
- We now have to explicitly specify we are inputting binary data.

See also zxing-cpp/zxing-cpp#557.

This fixes for example domestic Hungarian railway ticket barcodes not
being shown in KDE Itinerary anymore.
kdesysadmin pushed a commit to KDE/prison that referenced this pull request May 2, 2023
There's two issues here:
- The QByteArray to std::wstring conversion has to use unsigned bytes, the
previous code did an implicit conversion to int producing wrong results for
value above 127.
- We now have to explicitly specify we are inputting binary data.

See also zxing-cpp/zxing-cpp#557.

This fixes for example domestic Hungarian railway ticket barcodes not
being shown in KDE Itinerary anymore.

(cherry picked from commit 6b234bc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants