Skip to content

Commit 3159599

Browse files
committed
Perform conversion from Python unicode to string/bytes object via UTF-8.
We used to convert the unicode object directly to a string in the server encoding by calling Python's PyUnicode_AsEncodedString function. In other words, we used Python's routines to do the encoding. However, that has a few problems. First of all, it required keeping a mapping table of Python encoding names and PostgreSQL encodings. But the real killer was that Python doesn't support EUC_TW and MULE_INTERNAL encodings at all. Instead, convert the Python unicode object to UTF-8, and use PostgreSQL's encoding conversion functions to convert from UTF-8 to server encoding. We were already doing the same in the other direction in PLyUnicode_FromString, so this is more consistent, too. Note: This makes SQL_ASCII to behave more leniently. We used to map SQL_ASCII to Python's 'ascii', which on Python means strict 7-bit ASCII only, so you got an error if the python string contained anything but pure ASCII. You no longer get an error; you get the UTF-8 representation of the string instead. Backpatch to 9.0, where these conversions were introduced. Jan Urbański
1 parent c9c9520 commit 3159599

File tree

2 files changed

+44
-108
lines changed

2 files changed

+44
-108
lines changed

src/pl/plpython/expected/plpython_unicode_3.out

Lines changed: 0 additions & 54 deletions
This file was deleted.

src/pl/plpython/plpython.c

Lines changed: 44 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -4869,66 +4869,56 @@ PLy_free(void *ptr)
48694869
static PyObject *
48704870
PLyUnicode_Bytes(PyObject *unicode)
48714871
{
4872-
PyObject *rv;
4873-
const char *serverenc;
4872+
PyObject *bytes, *rv;
4873+
char *utf8string, *encoded;
4874+
4875+
/* First encode the Python unicode object with UTF-8. */
4876+
bytes = PyUnicode_AsUTF8String(unicode);
4877+
if (bytes == NULL)
4878+
PLy_elog(ERROR, "could not convert Python Unicode object to bytes");
4879+
4880+
utf8string = PyBytes_AsString(bytes);
4881+
if (utf8string == NULL) {
4882+
Py_DECREF(bytes);
4883+
PLy_elog(ERROR, "could not extract bytes from encoded string");
4884+
}
48744885

48754886
/*
4876-
* Map PostgreSQL encoding to a Python encoding name.
4887+
* Then convert to server encoding if necessary.
4888+
*
4889+
* PyUnicode_AsEncodedString could be used to encode the object directly
4890+
* in the server encoding, but Python doesn't support all the encodings
4891+
* that PostgreSQL does (EUC_TW and MULE_INTERNAL). UTF-8 is used as an
4892+
* intermediary in PLyUnicode_FromString as well.
48774893
*/
4878-
switch (GetDatabaseEncoding())
4894+
if (GetDatabaseEncoding() != PG_UTF8)
48794895
{
4880-
case PG_SQL_ASCII:
4881-
/*
4882-
* Mapping SQL_ASCII to Python's 'ascii' is a bit bogus. Python's
4883-
* 'ascii' means true 7-bit only ASCII, while PostgreSQL's
4884-
* SQL_ASCII means that anything is allowed, and the system doesn't
4885-
* try to interpret the bytes in any way. But not sure what else
4886-
* to do, and we haven't heard any complaints...
4887-
*/
4888-
serverenc = "ascii";
4889-
break;
4890-
case PG_WIN1250:
4891-
serverenc = "cp1250";
4892-
break;
4893-
case PG_WIN1251:
4894-
serverenc = "cp1251";
4895-
break;
4896-
case PG_WIN1252:
4897-
serverenc = "cp1252";
4898-
break;
4899-
case PG_WIN1253:
4900-
serverenc = "cp1253";
4901-
break;
4902-
case PG_WIN1254:
4903-
serverenc = "cp1254";
4904-
break;
4905-
case PG_WIN1255:
4906-
serverenc = "cp1255";
4907-
break;
4908-
case PG_WIN1256:
4909-
serverenc = "cp1256";
4910-
break;
4911-
case PG_WIN1257:
4912-
serverenc = "cp1257";
4913-
break;
4914-
case PG_WIN1258:
4915-
serverenc = "cp1258";
4916-
break;
4917-
case PG_WIN866:
4918-
serverenc = "cp866";
4919-
break;
4920-
case PG_WIN874:
4921-
serverenc = "cp874";
4922-
break;
4923-
default:
4924-
/* Other encodings have the same name in Python. */
4925-
serverenc = GetDatabaseEncodingName();
4926-
break;
4896+
PG_TRY();
4897+
{
4898+
encoded = (char *) pg_do_encoding_conversion(
4899+
(unsigned char *) utf8string,
4900+
strlen(utf8string),
4901+
PG_UTF8,
4902+
GetDatabaseEncoding());
4903+
}
4904+
PG_CATCH();
4905+
{
4906+
Py_DECREF(bytes);
4907+
PG_RE_THROW();
4908+
}
4909+
PG_END_TRY();
49274910
}
4911+
else
4912+
encoded = utf8string;
49284913

4929-
rv = PyUnicode_AsEncodedString(unicode, serverenc, "strict");
4930-
if (rv == NULL)
4931-
PLy_elog(ERROR, "could not convert Python Unicode object to PostgreSQL server encoding");
4914+
/* finally, build a bytes object in the server encoding */
4915+
rv = PyBytes_FromStringAndSize(encoded, strlen(encoded));
4916+
4917+
/* if pg_do_encoding_conversion allocated memory, free it now */
4918+
if (utf8string != encoded)
4919+
pfree(encoded);
4920+
4921+
Py_DECREF(bytes);
49324922
return rv;
49334923
}
49344924

0 commit comments

Comments
 (0)