|
5 | 5 | ----------
|
6 | 6 | 问题
|
7 | 7 | ----------
|
8 |
| -You are converting strings back and forth between C and Python, but the C encoding |
9 |
| -is of a dubious or unknown nature. For example, perhaps the C data is supposed to be |
10 |
| -UTF-8, but it’s not being strictly enforced. You would like to write code that can handle |
11 |
| -malformed data in a graceful way that doesn’t crash Python or destroy the string data |
12 |
| -in the process. |
| 8 | +你要在C和Python直接来回转换字符串,但是C中的编码格式并不确定。 |
| 9 | +例如,可能C中的数据期望是UTF-8,但是并没有强制它必须是。 |
| 10 | +你想编写代码来以一种优雅的方式处理这些不合格数据,这样就不会让Python奔溃或者破坏进程中的字符串数据。 |
13 | 11 |
|
14 | 12 | |
|
15 | 13 |
|
16 | 14 | ----------
|
17 | 15 | 解决方案
|
18 | 16 | ----------
|
19 |
| -Here is some C data and a function that illustrates the nature of this problem: |
20 |
| - |
21 |
| -/* Some dubious string data (malformed UTF-8) */ |
22 |
| -const char *sdata = "Spicy Jalape\xc3\xb1o\xae"; |
23 |
| -int slen = 16; |
24 |
| -
|
25 |
| -/* Output character data */ |
26 |
| -void print_chars(char *s, int len) { |
27 |
| - int n = 0; |
28 |
| - while (n < len) { |
29 |
| - printf("%2x ", (unsigned char) s[n]); |
30 |
| - n++; |
31 |
| - } |
32 |
| - printf("\n"); |
33 |
| -} |
34 |
| -
|
35 |
| -In this code, the string sdata contains a mix of UTF-8 and malformed data. Neverthe‐ |
36 |
| -less, if a user calls print_chars(sdata, slen) in C, it works fine. |
37 |
| -Now suppose you want to convert the contents of sdata into a Python string. Further |
38 |
| -suppose you want to later pass that string to the print_chars() function through an |
39 |
| -extension. Here’s how to do it in a way that exactly preserves the original data even |
40 |
| -though there are encoding problems: |
41 |
| - |
42 |
| -/* Return the C string back to Python */ |
43 |
| -static PyObject *py_retstr(PyObject *self, PyObject *args) { |
44 |
| - if (!PyArg_ParseTuple(args, "")) { |
45 |
| - return NULL; |
46 |
| - } |
47 |
| - return PyUnicode_Decode(sdata, slen, "utf-8", "surrogateescape"); |
48 |
| -} |
49 |
| -
|
50 |
| -/* Wrapper for the print_chars() function */ |
51 |
| -static PyObject *py_print_chars(PyObject *self, PyObject *args) { |
52 |
| - PyObject *obj, *bytes; |
53 |
| - char *s = 0; |
54 |
| - Py_ssize_t len; |
55 |
| -
|
56 |
| - if (!PyArg_ParseTuple(args, "U", &obj)) { |
57 |
| - return NULL; |
58 |
| - } |
59 |
| - |
60 |
| - if ((bytes = PyUnicode_AsEncodedString(obj,"utf-8","surrogateescape")) |
61 |
| - == NULL) { |
62 |
| - return NULL; |
63 |
| - } |
64 |
| - PyBytes_AsStringAndSize(bytes, &s, &len); |
65 |
| - print_chars(s, len); |
66 |
| - Py_DECREF(bytes); |
67 |
| - Py_RETURN_NONE; |
68 |
| -} |
69 |
| - |
70 |
| -If you try these functions from Python, here’s what happens: |
71 |
| - |
72 |
| ->>> s = retstr() |
73 |
| ->>> s |
74 |
| -'Spicy Jalapeño\udcae' |
75 |
| ->>> print_chars(s) |
76 |
| -53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f ae |
77 |
| ->>> |
78 |
| - |
79 |
| -Careful observation will reveal that the malformed string got encoded into a Python |
80 |
| -string without errors, and that when passed back into C, it turned back into a byte string |
81 |
| -that exactly encoded the same bytes as the original C string. |
| 17 | +下面是一些C的数据和一个函数来演示这个问题: |
| 18 | + |
| 19 | +:: |
| 20 | + |
| 21 | + /* Some dubious string data (malformed UTF-8) */ |
| 22 | + const char *sdata = "Spicy Jalape\xc3\xb1o\xae"; |
| 23 | + int slen = 16; |
| 24 | + |
| 25 | + /* Output character data */ |
| 26 | + void print_chars(char *s, int len) { |
| 27 | + int n = 0; |
| 28 | + while (n < len) { |
| 29 | + printf("%2x ", (unsigned char) s[n]); |
| 30 | + n++; |
| 31 | + } |
| 32 | + printf("\n"); |
| 33 | + } |
| 34 | + |
| 35 | +在这个代码中,字符串 ``sdata`` 包含了UTF-8和不合格数据。 |
| 36 | +不过,如果用户在C中调用 ``print_chars(sdata, slen)`` ,它缺能正常工作。 |
| 37 | +现在假设你想将 ``sdata`` 的内容转换为一个Python字符串。 |
| 38 | +进一步假设你在后面还想通过一个扩展将那个字符串传个 ``print_chars()`` 函数。 |
| 39 | +下面是一种用来保护原始数据的方法,就算它编码有问题。 |
| 40 | + |
| 41 | +:: |
| 42 | + |
| 43 | + /* Return the C string back to Python */ |
| 44 | + static PyObject *py_retstr(PyObject *self, PyObject *args) { |
| 45 | + if (!PyArg_ParseTuple(args, "")) { |
| 46 | + return NULL; |
| 47 | + } |
| 48 | + return PyUnicode_Decode(sdata, slen, "utf-8", "surrogateescape"); |
| 49 | + } |
| 50 | + |
| 51 | + /* Wrapper for the print_chars() function */ |
| 52 | + static PyObject *py_print_chars(PyObject *self, PyObject *args) { |
| 53 | + PyObject *obj, *bytes; |
| 54 | + char *s = 0; |
| 55 | + Py_ssize_t len; |
| 56 | + |
| 57 | + if (!PyArg_ParseTuple(args, "U", &obj)) { |
| 58 | + return NULL; |
| 59 | + } |
| 60 | + |
| 61 | + if ((bytes = PyUnicode_AsEncodedString(obj,"utf-8","surrogateescape")) |
| 62 | + == NULL) { |
| 63 | + return NULL; |
| 64 | + } |
| 65 | + PyBytes_AsStringAndSize(bytes, &s, &len); |
| 66 | + print_chars(s, len); |
| 67 | + Py_DECREF(bytes); |
| 68 | + Py_RETURN_NONE; |
| 69 | + } |
| 70 | + |
| 71 | +如果你在Python中尝试这些函数,下面是运行效果: |
| 72 | + |
| 73 | +:: |
| 74 | + |
| 75 | + >>> s = retstr() |
| 76 | + >>> s |
| 77 | + 'Spicy Jalapeño\udcae' |
| 78 | + >>> print_chars(s) |
| 79 | + 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f ae |
| 80 | + >>> |
| 81 | + |
| 82 | +仔细观察结果你会发现,不合格字符串被编码到一个Python字符串中,并且并没有产生错误, |
| 83 | +并且当它被回传给C的时候,被转换为和之前原始C字符串一样的字节。 |
82 | 84 |
|
83 | 85 | |
|
84 | 86 |
|
85 | 87 | ----------
|
86 | 88 | 讨论
|
87 | 89 | ----------
|
88 |
| -This recipe addresses a subtle, but potentially annoying problem with string handling |
89 |
| -in extension modules. Namely, the fact that C strings in extensions might not follow the |
90 |
| -strict Unicode encoding/decoding rules that Python normally expects. Thus, it’s possible |
91 |
| -that some malformed C data would pass to Python. A good example might be C strings |
92 |
| -associated with low-level system calls such as filenames. For instance, what happens if |
93 |
| -a system call returns a broken string back to the interpreter that can’t be properly |
94 |
| -decoded. |
95 |
| - |
96 |
| -Normally, Unicode errors are often handled by specifying some sort of error policy, such |
97 |
| -as strict, ignore, replace, or something similar. However, a downside of these policies |
98 |
| -is that they irreparably destroy the original string content. For example, if the malformed |
99 |
| -data in the example was decoded using one of these polices, you would get results such |
100 |
| -as this: |
101 |
| - |
102 |
| ->>> raw = b'Spicy Jalape\xc3\xb1o\xae' |
103 |
| ->>> raw.decode('utf-8','ignore') |
104 |
| -'Spicy Jalapeño' |
105 |
| ->>> raw.decode('utf-8','replace') |
106 |
| -'Spicy Jalapeño?' |
107 |
| ->>> |
108 |
| - |
109 |
| -The surrogateescape error handling policies takes all nondecodable bytes and turns |
110 |
| -them into the low-half of a surrogate pair (\udcXX where XX is the raw byte value). For |
111 |
| -example: |
112 |
| - |
113 |
| ->>> raw.decode('utf-8','surrogateescape') |
114 |
| -'Spicy Jalapeño\udcae' |
115 |
| ->>> |
116 |
| - |
117 |
| -Isolated low surrogate characters such as \udcae never appear in valid Unicode. Thus, |
118 |
| -this string is technically an illegal representation. In fact, if you ever try to pass it to |
119 |
| -functions that perform output, you’ll get encoding errors: |
120 |
| - |
121 |
| ->>> s = raw.decode('utf-8', 'surrogateescape') |
122 |
| ->>> print(s) |
123 |
| -Traceback (most recent call last): |
124 |
| - File "<stdin>", line 1, in <module> |
125 |
| -UnicodeEncodeError: 'utf-8' codec can't encode character '\udcae' |
126 |
| -in position 14: surrogates not allowed |
127 |
| ->>> |
128 |
| - |
129 |
| -However, the main point of allowing the surrogate escapes is to allow malformed strings |
130 |
| -to pass from C to Python and back into C without any data loss. When the string is |
131 |
| -encoded using surrogateescape again, the surrogate characters are turned back into |
132 |
| -their original bytes. For example: |
133 |
| - |
134 |
| ->>> s |
135 |
| -'Spicy Jalapeño\udcae' |
136 |
| ->>> s.encode('utf-8','surrogateescape') |
137 |
| -b'Spicy Jalape\xc3\xb1o\xae' |
138 |
| ->>> |
139 |
| - |
140 |
| -As a general rule, it’s probably best to avoid surrogate encoding whenever possible— |
141 |
| -your code will be much more reliable if it uses proper encodings. However, sometimes |
142 |
| -there are situations where you simply don’t have control over the data encoding and |
143 |
| -you aren’t free to ignore or replace the bad data because other functions may need to |
144 |
| -use it. This recipe shows how to do it. |
145 |
| - |
146 |
| -As a final note, many of Python’s system-oriented functions, especially those related to |
147 |
| -filenames, environment variables, and command-line options, use surrogate encoding. |
148 |
| -For example, if you use a function such as os.listdir() on a directory containing a |
149 |
| -undecodable filename, it will be returned as a string with surrogate escapes. See |
150 |
| -Recipe 5.15 for a related recipe. |
151 |
| -PEP 383 has more information about the problem addressed by this recipe and surro |
152 |
| -gateescape error handling. |
| 90 | +本节展示了在扩展模块中处理字符串时会配到的一个棘手又很恼火的问题。 |
| 91 | +也就是说,在扩展中的C字符串可能不会严格遵循Python所期望的Unicode编码/解码规则。 |
| 92 | +因此,很可能一些不合格C数据传递到Python中去。 |
| 93 | +一个很好的例子就是涉及到底层系统调用比如文件名这样的字符串。 |
| 94 | +例如,如果一个系统调用返回给解释器一个损坏的字符串,不能被正确解码的时候会怎样呢? |
| 95 | + |
| 96 | +一般来讲,可以通过制定一些错误策略比如严格、忽略、替代或其他类似的来处理Unicode错误。 |
| 97 | +不过,这些策略的一个缺点是它们永久性破坏了原始字符串的内容。 |
| 98 | +例如,如果例子中的不合格数据使用这些策略之一解码,你会得到下面这样的结果: |
| 99 | + |
| 100 | +:: |
| 101 | + |
| 102 | + >>> raw = b'Spicy Jalape\xc3\xb1o\xae' |
| 103 | + >>> raw.decode('utf-8','ignore') |
| 104 | + 'Spicy Jalapeño' |
| 105 | + >>> raw.decode('utf-8','replace') |
| 106 | + 'Spicy Jalapeño?' |
| 107 | + >>> |
| 108 | + |
| 109 | +``surrogateescape`` 错误处理策略会将所有不可解码字节转化为一个代理对的低位字节(\udcXX中XX是原始字节值)。 |
| 110 | +例如: |
| 111 | + |
| 112 | +:: |
| 113 | + |
| 114 | + >>> raw.decode('utf-8','surrogateescape') |
| 115 | + 'Spicy Jalapeño\udcae' |
| 116 | + >>> |
| 117 | + |
| 118 | +单独的低位代理字符比如 ``\udcae`` 在Unicode中是非法的。 |
| 119 | +因此,这个字符串就是一个非法表示。 |
| 120 | +实际上,如果你将它传个一个执行输出的函数,你会得到一个错误: |
| 121 | + |
| 122 | +:: |
| 123 | + |
| 124 | + >>> s = raw.decode('utf-8', 'surrogateescape') |
| 125 | + >>> print(s) |
| 126 | + Traceback (most recent call last): |
| 127 | + File "<stdin>", line 1, in <module> |
| 128 | + UnicodeEncodeError: 'utf-8' codec can't encode character '\udcae' |
| 129 | + in position 14: surrogates not allowed |
| 130 | + >>> |
| 131 | + |
| 132 | +然而,允许代理转换的关键点在于从C传给Python又回传给C的不合格字符串不会有任何数据丢失。 |
| 133 | +当这个字符串再次使用 ``surrogateescape`` 编码时,代理字符会转换回原始字节。例如: |
| 134 | + |
| 135 | +:: |
| 136 | + |
| 137 | + >>> s |
| 138 | + 'Spicy Jalapeño\udcae' |
| 139 | + >>> s.encode('utf-8','surrogateescape') |
| 140 | + b'Spicy Jalape\xc3\xb1o\xae' |
| 141 | + >>> |
| 142 | + |
| 143 | +作为一般准则,最好避免代理编码——如果你正确的使用了编码,那么你的代码就值得信赖。 |
| 144 | +不过,有时候确实会出现你并不能控制数据编码并且你又不能忽略或替换坏数据,因为其他函数可能会用到它。 |
| 145 | +那么就可以使用本节的技术了。 |
| 146 | + |
| 147 | +最后一点要注意的是,Python中许多面向系统的函数,特别是和文件名、环境变量和命令行参数相关的 |
| 148 | +都会使用代理编码。例如,如果你使用像 ``os.listdir()`` 这样的函数, |
| 149 | +传入一个包含了不可解码文件名的目录的话,它会返回一个代理转换后的字符串。 |
| 150 | +参考5.15的相关章节。 |
| 151 | + |
| 152 | +`PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_ |
| 153 | +中有更多关于本机提到的以及和surrogateescape错误处理相关的信息。 |
0 commit comments