FIX: always decode byte strings from AFM files as utf8 #9198

tacaswell · 2017-09-18T19:01:44Z

closes #9196

This fixes a (long standing) 'can not import' bug.

PR Summary

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant

dopplershift

LGTM

anntzer · 2017-09-18T20:19:43Z

Actually, can someone point to an actual example of a file that fails? http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5004.AFM_Spec.pdf suggests that everything should be ASCII compatible (page 10). I'm just curious whether it's some afm file that doesn't follow the spec, or if I'm misreading the spec...

tacaswell · 2017-09-18T20:51:43Z

The OP in #9196 has such a font. From digging into this for the headers it tends to be comments and names that are non-ascii. Not sure what could be in the metrics that would be non-ascii though...

tacaswell · 2017-09-18T20:53:50Z

It may also be the comment lines which are not explicitly constrained to ascii.

anntzer · 2017-09-18T21:46:29Z

Would

diff --git a/lib/matplotlib/afm.py b/lib/matplotlib/afm.py
index 30d335879..3a06c4618 100644
--- a/lib/matplotlib/afm.py
+++ b/lib/matplotlib/afm.py
@@ -189,23 +189,22 @@ def _parse_char_metrics(fh):
     ascii_d = {}
     name_d = {}
     for line in fh:
-        line = line.rstrip().decode('ascii')  # Convert from byte-literal
-        if line.startswith('EndCharMetrics'):
+        line = line.rstrip()
+        if line.startswith(b'EndCharMetrics'):
             return ascii_d, name_d
         # Split the metric line into a dictionary, keyed by metric identifiers
-        vals = dict(s.strip().split(' ', 1) for s in line.split(';') if s)
+        vals = dict(s.strip().split(b' ', 1) for s in line.split(b';') if s)
         # There may be other metrics present, but only these are needed
-        if not {'C', 'WX', 'N', 'B'}.issubset(vals):
+        if not {b'C', b'WX', b'N', b'B'}.issubset(vals):
             raise RuntimeError('Bad char metrics line: %s' % line)
-        num = _to_int(vals['C'])
-        wx = _to_float(vals['WX'])
-        name = vals['N']
-        bbox = _to_list_of_floats(vals['B'])
-        bbox = list(map(int, bbox))
+        num = _to_int(vals[b'C'])
+        wx = _to_float(vals[b'WX'])
+        name = vals[b'N'].decode('ascii')
+        bbox = _to_list_of_ints(vals[b'B'])
         # Workaround: If the character name is 'Euro', give it the
         # corresponding character code, according to WinAnsiEncoding (see PDF
         # Reference).
-        if name == 'Euro':
+        if name == b'Euro':
             num = 128
         if num != -1:
             ascii_d[num] = (wx, name, bbox)

(i.e. use bytes throughout, just decode at the very end) not be better?

tacaswell · 2017-09-19T14:01:44Z

Until we know what the offending font looks like, I am inclined to be more defensive.

In either case, we should not fail to import on a non-compliant font.

anntzer · 2017-09-19T15:07:37Z

Please add a comment to the code explaining why we are seemingly not following the spec, then. Otherwise lgtm.

closes matplotlib#9196

tacaswell · 2017-09-22T01:57:03Z

@anntzer done and force-pushed

tacaswell · 2017-09-24T16:46:39Z

Self-merging this as it has 2 approvals.

tacaswell added the Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. label Sep 18, 2017

tacaswell added this to the 2.1 (next point release) milestone Sep 18, 2017

dopplershift approved these changes Sep 18, 2017

View reviewed changes

FIX: always decode byte strings from AFM files as utf8

e3e78be

closes matplotlib#9196

tacaswell force-pushed the fix_afm_unicode branch from 7758b08 to e3e78be Compare September 22, 2017 01:57

anntzer approved these changes Sep 22, 2017

View reviewed changes

tacaswell merged commit cece2d4 into matplotlib:v2.1.x Sep 24, 2017

tacaswell deleted the fix_afm_unicode branch September 24, 2017 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX: always decode byte strings from AFM files as utf8 #9198

FIX: always decode byte strings from AFM files as utf8 #9198

Uh oh!

tacaswell commented Sep 18, 2017 •

edited

Loading

Uh oh!

dopplershift left a comment

Uh oh!

anntzer commented Sep 18, 2017

Uh oh!

tacaswell commented Sep 18, 2017

Uh oh!

tacaswell commented Sep 18, 2017

Uh oh!

anntzer commented Sep 18, 2017 •

edited by tacaswell

Loading

Uh oh!

tacaswell commented Sep 19, 2017

Uh oh!

anntzer commented Sep 19, 2017

Uh oh!

tacaswell commented Sep 22, 2017

Uh oh!

tacaswell commented Sep 24, 2017

Uh oh!

Uh oh!

Uh oh!

FIX: always decode byte strings from AFM files as utf8 #9198

FIX: always decode byte strings from AFM files as utf8 #9198

Uh oh!

Conversation

tacaswell commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

dopplershift left a comment

Choose a reason for hiding this comment

Uh oh!

anntzer commented Sep 18, 2017

Uh oh!

tacaswell commented Sep 18, 2017

Uh oh!

tacaswell commented Sep 18, 2017

Uh oh!

anntzer commented Sep 18, 2017 • edited by tacaswell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tacaswell commented Sep 19, 2017

Uh oh!

anntzer commented Sep 19, 2017

Uh oh!

tacaswell commented Sep 22, 2017

Uh oh!

tacaswell commented Sep 24, 2017

Uh oh!

Uh oh!

tacaswell commented Sep 18, 2017 •

edited

Loading

anntzer commented Sep 18, 2017 •

edited by tacaswell

Loading