Skip to content

Conversation

QuLogic
Copy link
Member

@QuLogic QuLogic commented Sep 4, 2025

PR summary

I extracted this out of #30512 because it was causing issues with the pre-loading of test images. I may update this as/when I find more redundant tests.

  • test_backend_ps::test_type3_font is covered by test_backend_ps::test_multi_font_type3
  • test_text::test_pdf_chars_beyond_bmp is covered by test_backend_pdf::test_multi_font_type3 and test_backend_pdf::test_multi_font_type42
  • test_text::test_pdf_kerning is covered by test_backend_pdf::test_kerning
  • test_text::test_pdf_type42_kerning is covered by test_backend_pdf::test_kerning

PR checklist

With libraqm, string layout produces glyph indices, not character codes,
and font features may even produce different glyphs for the same
character code (e.g., by picking a different Stylistic Set). Thus we
cannot rely on character codes as unique items within a font, and must
move toward glyph indices everywhere.
Currently, we split text into single byte chunks and multi-byte glyphs,
then iterate through single byte chunks for output and multi-byte glyphs
for output.

Instead, output the single byte chunks as we finish them, then do the
multi-byte glyphs at the end.
For a Type 3 font, its encoding is entirely defined by its `Encoding`
dictionary (which we create), so there's no reason to use a specific
encoding like `cp1252`. Instead, switch to Latin-1, which corresponds
exactly to the first 256 character codes in Unicode, and can be mapped
directly with `ord`.
By tracking both character codes and glyph indices, we can handle
producing multiple font subsets if needed by a file format.
For character codes outside the embedded font limits (256 for type 3 and
65536 for type 42), we output them as XObjects instead of using text
commands. But there is nothing in the PDF spec that requires any
specific encoding like this.

Since we now support subsetting all fonts before embedding, split each
font into groups based on the maximum character code (e.g., 256-entry
groups for type 3), then switch text strings to a different font subset
and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding),
and we no longer need any XObjects for glyphs. For users of non-English
text, this means it will become selectable and copyable again.

Fixes matplotlib#21797
For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF
1.2), and for Type 42 fonts, correct the Unicode encoding, which should
be UTF-16BE, not UCS2.
These characters are outside the BMP and should test subset splitting
for type 42 output in PDF.
- `test_backend_ps::test_type3_font` is covered by
  `test_backend_ps::test_multi_font_type3`
- `test_text::test_pdf_chars_beyond_bmp` is covered by
  `test_backend_pdf::test_multi_font_type3` and
  `test_backend_pdf::test_multi_font_type42`
- `test_text::test_pdf_kerning` is covered by
  `test_backend_pdf::test_kerning`
- `test_text::test_pdf_type42_kerning` is covered by
  `test_backend_pdf::test_kerning`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant