Skip to content

Clarify base64.a85(en,de)code documentation for Adobe mode #134837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dhdaines opened this issue May 28, 2025 · 7 comments
Open

Clarify base64.a85(en,de)code documentation for Adobe mode #134837

dhdaines opened this issue May 28, 2025 · 7 comments
Labels
docs Documentation in the Doc dir

Comments

@dhdaines
Copy link

dhdaines commented May 28, 2025

Bug report

Bug description:

It seems that whitespace is allowed everywhere by base64.a85decode, except after the end-of-data delimiter b'~>' in adobe mode:

>>> base64.a85decode(b"6#q'\\F`JTK<-N74;eT`QF!;`!@:O(oDf,~>", adobe=True)
b'Arthur "Two-Sheds" Jackson'
>>> base64.a85decode(b"  6  # q' \\     F`JTK<-N 7 4 ;eT`QF!;`!@:O(oDf,~>", adobe=True)
b'Arthur "Two-Sheds" Jackson'
>>> base64.a85decode(b"  6  # q' \\     F`JTK<-N 7 4 ;eT`QF!;`!@:O(oDf,  ")
b'Arthur "Two-Sheds" Jackson'
>>> base64.a85decode(b"  6  # q' \\     F`JTK<-N 7 4 ;eT`QF!;`!@:O(oDf,~>  ", adobe=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/base64.py", line 388, in a85decode
    raise ValueError(
ValueError: Ascii85 encoded byte sequences must end with b'~>'

While this behaviour is actually compliant with the very latest PDF standard, including errata, in practice it's quite surprising, and also causes problems due to the legacy of centuriesdecades of ambiguous PDF standards and implementations that emit and accept extra whitespace due to these amgibuities.

A separate but related issue is that some very broken PDF implementations have even been known to insert whitespace between the ~ and > bytes. It maybe useful for "Adobe" mode to be tolerant of this as well.

Obviously, also, PostScript doesn't care about extra whitespace after ~> in ASCII85 literal strings. (Note that the leading <~ is only accepted in PostScript and not in PDF).

Because > is a valid ASCII85 digit, an improved rule would be to only accept the regular expression ~\s*>\s* at the end of input in Adobe mode.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

@dhdaines dhdaines added the type-bug An unexpected behavior, bug, or error label May 28, 2025
@emmatyping emmatyping added type-feature A feature request or enhancement stdlib Python modules in the Lib dir and removed type-bug An unexpected behavior, bug, or error labels May 28, 2025
@emmatyping
Copy link
Member

emmatyping commented May 28, 2025

If you need to be more permissive about whitespace, could you call .rstrip() on the input to the decoder? Or otherwise replace whitespace in the input?

I changed this to a feature since the decoder is standard compliant, and you're asking for a behavior change, but even then I'm not sure if this is something we should make more flexible if there is an easy solution for users that want flexibility around whitespace.

@dhdaines
Copy link
Author

dhdaines commented May 29, 2025

If you need to be more permissive about whitespace, could you call .rstrip() on the input to the decoder? Or otherwise replace whitespace in the input?

Yes, that works! In practice I do re.sub with ~\s*>\s*$: https://github.com/dhdaines/playa/blob/main/playa/ascii85.py#L8

I changed this to a feature since the decoder is standard compliant, and you're asking for a behavior change, but even then I'm not sure if this is something we should make more flexible if there is an easy solution for users that want flexibility around whitespace.

In the end, I think it should simply be a documentation change, to make it explicit that adobe=True will throw a ValueError on trailing whitespace. As you say, it is standard compliant, and also changing the behaviour could create all sorts of confusion.

I can make a PR for this.

@dhdaines
Copy link
Author

dhdaines commented May 29, 2025

Actually there are a few things to be improved in the documentation:

  • ASCII85 is formally defined, in both the PostScript Language Reference and the PDF standard (ISO32000-2).
  • As mentioned above, PDF and PostScript do not agree on delimiters, as the opening <~ is in PostScript but not PDF. This also means that the behaviour of a85encode in "Adobe mode" is not standards-compliant for PDF.

@kevinveenbirkenbach
Copy link

Here is a quick solution to repair the broken pdfs: https://github.com/kevinveenbirkenbach/pdf-healer

@emmatyping emmatyping added docs Documentation in the Doc dir and removed type-feature A feature request or enhancement labels May 31, 2025
@emmatyping emmatyping changed the title base64.a85decode throws exception on trailing whitespace in Adobe mode Clarify base64.a85(en,de)code documentation for Adobe mode May 31, 2025
@emmatyping
Copy link
Member

emmatyping commented May 31, 2025

I changed this to a docs issue since it sounds like the work that needs to be done here is mostly around documenting the semantic differences when using adobe mode and expand on what limitations it enforces.

@dhdaines would you be interested in making a PR to expand the documentation? https://devguide.python.org/documentation/start-documenting/

@emmatyping emmatyping removed the stdlib Python modules in the Lib dir label Jun 1, 2025
@dhdaines
Copy link
Author

dhdaines commented Jun 2, 2025

I changed this to a docs issue since it sounds like the work that needs to be done here is mostly around documenting the semantic differences when using adobe mode and expand on what limitations it enforces.

@dhdaines would you be interested in making a PR to expand the documentation? https://devguide.python.org/documentation/start-documenting/

Absolutely! I already started one, will submit it in the next few days.

@dhdaines
Copy link
Author

dhdaines commented Jun 2, 2025

Here is a quick solution to repair the broken pdfs: https://github.com/kevinveenbirkenbach/pdf-healer

Interesting, didn't realize it was such a widespread issue! The bug has been fixed for a while in (shameless plug) PLAYA-PDF and also more recently in pdfminer.six.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
Status: Todo
Development

No branches or pull requests

3 participants