pdf: Improve text with characters outside embedded font limits #30512

QuLogic · 2025-09-04T05:50:08Z

PR summary

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them as XObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this.

Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again.

There are 3 steps to achieve this change:

Track both character codes and glyphs in CharacterTracker. This class takes care of splitting characters into subsets that fit the desired PDF font type limits.
Output each used font block as a separate subsetted font. Also change the subset prefix to use the glyph indices, which are unique, unlike the character codes.
Generate a ToUnicode dictionary for the subset font. We already did this for type 42 fonts, but the implementation was incorrect as it didn't correctly handle non-BMP characters. For type 3, support was added in PDF 1.2, but we produce 1.4; there is a fallback to the glyph names, but it is inconsistent and probably depends on the original font having the right names.

In the future, we may wish to extend the implementation in CharacterTracker to "compress" the character map it produces (i.e., if you use 255 characters all from a different 256-sized block with type 3, you get 255 fonts, but we could compress that to a single font.) I tried to avoid hard-coding any assumptions that the mapping is block-by-block, but it is possible that something slipped through, so I do not want to spend too much time on that right now.

Formerly, with multi_font_type3.pdf (after adding the emoji to the test), copying the text in evince would produce:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !”#$%&’()*+,-./:;¡=¿?@[“]ˆ˙‘—–˝˜
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
in between!

and with multi_font_type42.pdf:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃǄǅǆǇǈǉǊǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰǱǲǳǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ
in between!

and now we get for both type 3 and 42:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃǄǅǆǇǈǉǊǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰǱǲǳǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏
in between!

Note how in the third line for type 3:

the quotes are 'curly' instead of straight quotes
the chevrons <> are inverted exclamation/question marks ¡¿
the backslash \ is a curly opening double quote “
the caret ^, underscore _, and tilde ~ are (circumflex, dot, tilde) accents/smaller glyphs ˆ˙˜
the braces {} are em-dash and curly quotes —˝
the pipe | is en-dash –
Everything from the seventh to second-last line is missing in type 3 since it's outside of the 256 limit, and all the emoji are missing from type 42 since that's outside the 65536 limit.

This depends on #30335.

PR checklist

"closes #0000" is in the body of the PR description to link the related issue
new and changed code is tested
[n/a] Plotting related features are demonstrated in an example
New Features and API Changes are noted with a directive and release note
Documentation complies with general and docstring guidelines

With libraqm, string layout produces glyph indices, not character codes, and font features may even produce different glyphs for the same character code (e.g., by picking a different Stylistic Set). Thus we cannot rely on character codes as unique items within a font, and must move toward glyph indices everywhere.

Currently, we split text into single byte chunks and multi-byte glyphs, then iterate through single byte chunks for output and multi-byte glyphs for output. Instead, output the single byte chunks as we finish them, then do the multi-byte glyphs at the end.

For a Type 3 font, its encoding is entirely defined by its `Encoding` dictionary (which we create), so there's no reason to use a specific encoding like `cp1252`. Instead, switch to Latin-1, which corresponds exactly to the first 256 character codes in Unicode, and can be mapped directly with `ord`.

By tracking both character codes and glyph indices, we can handle producing multiple font subsets if needed by a file format.

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them as XObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this. Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary. This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again. Fixes matplotlib#21797

For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF 1.2), and for Type 42 fonts, correct the Unicode encoding, which should be UTF-16BE, not UCS2.

These characters are outside the BMP and should test subset splitting for type 42 output in PDF.

anntzer · 2025-09-04T09:16:30Z

This is great and would also allow getting rid of _get_pdf_charprocs. I'll try to have a look at #30335 to start...

anntzer · 2025-09-05T09:01:13Z

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)
I still need to properly review the next one (charmap tracking) but that can also come next by itself?

QuLogic · 2025-09-06T00:04:07Z

I split the type3 encoding to #30520, but the loop merge has conflicts with the glyph index change.

QuLogic added 7 commits September 3, 2025 05:06

pdf/ps: Track full character map in CharacterTracker

dbd689f

By tracking both character codes and glyph indices, we can handle producing multiple font subsets if needed by a file format.

pdf: Correct Unicode mapping for out-of-range font chunks

ab8981f

For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF 1.2), and for Type 42 fonts, correct the Unicode encoding, which should be UTF-16BE, not UCS2.

Add emoji to multi-font text

72deb44

These characters are outside the BMP and should test subset splitting for type 42 output in PDF.

QuLogic added this to the v3.11.0 milestone Sep 4, 2025

QuLogic added this to Font and text overhaul Sep 4, 2025

QuLogic added the status: waiting for other PR label Sep 4, 2025

github-project-automation bot moved this to Waiting for other PR in Font and text overhaul Sep 4, 2025

github-actions bot added topic: text backend: ps backend: pdf backend: svg backend: cairo topic: text/mathtext labels Sep 4, 2025

Update test images for previous change

3fc92f4

QuLogic force-pushed the pdf-text-subsets branch from 7ffffb5 to 3fc92f4 Compare September 4, 2025 06:06

QuLogic mentioned this pull request Sep 4, 2025

TST: Remove redundant font tests #30513

Draft

1 task

QuLogic mentioned this pull request Sep 4, 2025

Use glyph indices for font tracking in vector formats #30335

Open

1 task

QuLogic mentioned this pull request Sep 6, 2025

pdf: Simplify Type 3 font character encoding #30520

Merged

1 task

github-actions bot added the status: needs rebase label Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pdf: Improve text with characters outside embedded font limits #30512

pdf: Improve text with characters outside embedded font limits #30512

QuLogic commented Sep 4, 2025

Uh oh!

anntzer commented Sep 4, 2025

Uh oh!

anntzer commented Sep 5, 2025 •

edited

Loading

Uh oh!

QuLogic commented Sep 6, 2025

Uh oh!

Uh oh!

Uh oh!

pdf: Improve text with characters outside embedded font limits #30512

Are you sure you want to change the base?

pdf: Improve text with characters outside embedded font limits #30512

Conversation

QuLogic commented Sep 4, 2025

PR summary

PR checklist

Uh oh!

anntzer commented Sep 4, 2025

Uh oh!

anntzer commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuLogic commented Sep 6, 2025

Uh oh!

Uh oh!

anntzer commented Sep 5, 2025 •

edited

Loading