Skip to content

MultiFormatWriter: fix encoding binary std::string #599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

markusfisch
Copy link
Contributor

While MultiFormatWriter::encode(std::wstring) can encode binary data just fine, MultiFormatWriter::encode(std::string) corrupts the data by trying to interpret it as UTF-8.

This may also break immediately because not all byte combinations are allowed in UTF-8.

So when _encoding is set to CharacterSet::BINARY, the std::string needs to be converted to a std::wstring of unsigned values, because TextEncoder::GetBytes(std::wstring) calls ToUtf8() which cannot handle negative values.

For example, without this commit, try:

auto writer = MultiFormatWriter(BarcodeFormat::QRCode)
	.setEncoding(CharacterSet::BINARY).
auto bitmap = writer.encode(std::string("\x7e\x7f\x80\x81"), 200, 200);

Which will result in: "ValueError: Unexpected charcode".

For a QRCode, MultiFormatWriter::encode(std::string("\x7e\x7f\x80\x81")) will run:

  1. MultiFormatWriter::encode(std::wstring)
  2. Writer::encode(std::wstring)
  3. QREncoder::Append8BitBytes(std::wstring)
  4. TextEncoder::FromUnicode(std::wstring)
  5. TextEncoder::GetBytes(std::wstring) <-- here it will call ToUtf8(str)
  6. TextEncoder::GetBytes(std::string) where str will now be 0x7e, 0x7f, 0xff, 0xbf, 0xbe, 0x80, 0xff, 0xbf, 0xbe, 0x81

Please note ZXingWriter is NOT affected by this!
This works just fine:

$ printf "\\x7e\\x7f\\x80\\x81" > file
$ example/ZXingWriter -binary QRCode file out.png

Because:

  1. it's calling MultiFormatWriter::encode(std::wstring) and
  2. builds the std::wstring from uint8_t

While `MultiFormatWriter::encode(std::wstring)` can encode binary data
just fine, `MultiFormatWriter::encode(std::string)` corrupts the data
by trying to interpret it as UTF-8.

This may also break immediately because not all byte combinations are
allowed in UTF-8.

So when `_encoding` is set to `CharacterSet::BINARY`, the `std::string`
needs to be converted to a `std::wstring` of unsigned values, because
`TextEncoder::GetBytes(std::wstring)` calls `ToUtf8()` which cannot
handle negative values.

For example, without this commit, try:

	auto writer = MultiFormatWriter(BarcodeFormat::QRCode)
		.setEncoding(CharacterSet::BINARY).
	auto bitmap = writer.encode(std::string("\x7e\x7f\x80\x81"), 200, 200);

Which will result in: "ValueError: Unexpected charcode".

Please note ZXingWriter is NOT affected by this!
This works just fine:
	$ printf "\\x7e\\x7f\\x80\\x81" > file
	$ example/ZXingWriter -binary QRCode file out.png
Because:
1) it's calling `MultiFormatWriter::encode(std::wstring)` and
2) builds the `std::wstring` from `uint8_t`
@axxel
Copy link
Collaborator

axxel commented Aug 5, 2023

Your fix/workaround is exactly what I did in ZXingReader and also suggested in #557 (comment). I see that there is a point to be made to make this more convenient but for one, your change is technically breaking the API (encode(std::string) takes UTF8 input) and it is also still not self-explanatory and a bit of a "hack" given the requirement to pass BINARY as the encoding character set. I'd like to make sure this use case is properly handled in the new API (see #332 (comment)) but am wondering if your problem could not also be solved by simply doing this on the client side (see the two examples above) until a proper solution is available?

@markusfisch
Copy link
Contributor Author

Okay, I look forward to the new API then 😉

@markusfisch markusfisch closed this Aug 7, 2023
@markusfisch
Copy link
Contributor Author

@axxel
About the new API for encoding binary data, we have just stumbled across this very problem again 😬 We need to generate a DataMatrix code that holds binary data (for a ICAO Visible Digital Seal).

DataMatrix seems to especially problematic here, probably because of this transformation.

What surprised me a little is that even UTF-8 sequences apparently cannot be encoded at the moment (at least with the stock ZXingWriter from example):

$ ./ZXingWriter DataMatrix "This is test 😁" test.png       
Unexpected charcode

Here's a little sample for trying ZXingWriter with binary data:

$ for ((I=0; I < 256; ++I)); do printf "%02x" "$I"; done | xxd -p -r > 0to255
$ ./ZXingWriter -binary DataMatrix 0to255 test.png
Unexpected charcode

On the other hand, QRCode works:

$ ./ZXingWriter -binary QRCode 0to255 test.png

So before taking action I thought it would be wise to check your opinion on that.

@axxel
Copy link
Collaborator

axxel commented Oct 6, 2023

DataMatrix seems to especially problematic here, probably because of this transformation.

Exactly. This has indeed come up already earlier this year (see here).

So before taking action I thought it would be wise to check your opinion on that.

Very wise ;). I have to admit that my progress on the new Writer API has been stalled recently. I have an early prototypish hack lying around but there was a lack of time and there are still quite a few open questions regarding the types used in the API (as mentioned in #332)

If you need this 'now' I see two options:

  1. you find a quick (and dirty?) hack to make the existing DataMatrix encoder work with binary data.
  2. I try to throw something out there behind the EXPERIMENTAL build option soonish and you can be the first to alpha-test it ;)

@markusfisch
Copy link
Contributor Author

Thanks for the quick reply!

I think I'll try to hack something together then as it is a bit urgent, but of course, I would always happily be the first to alpha-test your solution! 😉

If you maybe already have a concrete idea, please tell me so 😉

@markusfisch
Copy link
Contributor Author

@axxel
Here's what I did to make it work in case you want to merge it: #628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants