Skip to content

Commit 7fbe5aa

Browse files
committed
Perform conversion from Python unicode to string/bytes object via UTF-8.
We used to convert the unicode object directly to a string in the server encoding by calling Python's PyUnicode_AsEncodedString function. In other words, we used Python's routines to do the encoding. However, that has a few problems. First of all, it required keeping a mapping table of Python encoding names and PostgreSQL encodings. But the real killer was that Python doesn't support EUC_TW and MULE_INTERNAL encodings at all. Instead, convert the Python unicode object to UTF-8, and use PostgreSQL's encoding conversion functions to convert from UTF-8 to server encoding. We were already doing the same in the other direction in PLyUnicode_FromString, so this is more consistent, too. Note: This makes SQL_ASCII to behave more leniently. We used to map SQL_ASCII to Python's 'ascii', which on Python means strict 7-bit ASCII only, so you got an error if the python string contained anything but pure ASCII. You no longer get an error; you get the UTF-8 representation of the string instead. Backpatch to 9.0, where these conversions were introduced. Jan Urbański
1 parent bb49e35 commit 7fbe5aa

File tree

3 files changed

+44
-158
lines changed

3 files changed

+44
-158
lines changed

src/pl/plpython/expected/plpython_unicode_2.out

Lines changed: 0 additions & 52 deletions
This file was deleted.

src/pl/plpython/expected/plpython_unicode_3.out

Lines changed: 0 additions & 52 deletions
This file was deleted.

src/pl/plpython/plpython.c

Lines changed: 44 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -3686,66 +3686,56 @@ PLy_free(void *ptr)
36863686
static PyObject *
36873687
PLyUnicode_Bytes(PyObject *unicode)
36883688
{
3689-
PyObject *rv;
3690-
const char *serverenc;
3689+
PyObject *bytes, *rv;
3690+
char *utf8string, *encoded;
3691+
3692+
/* First encode the Python unicode object with UTF-8. */
3693+
bytes = PyUnicode_AsUTF8String(unicode);
3694+
if (bytes == NULL)
3695+
PLy_elog(ERROR, "could not convert Python Unicode object to bytes");
3696+
3697+
utf8string = PyBytes_AsString(bytes);
3698+
if (utf8string == NULL) {
3699+
Py_DECREF(bytes);
3700+
PLy_elog(ERROR, "could not extract bytes from encoded string");
3701+
}
36913702

36923703
/*
3693-
* Map PostgreSQL encoding to a Python encoding name.
3704+
* Then convert to server encoding if necessary.
3705+
*
3706+
* PyUnicode_AsEncodedString could be used to encode the object directly
3707+
* in the server encoding, but Python doesn't support all the encodings
3708+
* that PostgreSQL does (EUC_TW and MULE_INTERNAL). UTF-8 is used as an
3709+
* intermediary in PLyUnicode_FromString as well.
36943710
*/
3695-
switch (GetDatabaseEncoding())
3711+
if (GetDatabaseEncoding() != PG_UTF8)
36963712
{
3697-
case PG_SQL_ASCII:
3698-
/*
3699-
* Mapping SQL_ASCII to Python's 'ascii' is a bit bogus. Python's
3700-
* 'ascii' means true 7-bit only ASCII, while PostgreSQL's
3701-
* SQL_ASCII means that anything is allowed, and the system doesn't
3702-
* try to interpret the bytes in any way. But not sure what else
3703-
* to do, and we haven't heard any complaints...
3704-
*/
3705-
serverenc = "ascii";
3706-
break;
3707-
case PG_WIN1250:
3708-
serverenc = "cp1250";
3709-
break;
3710-
case PG_WIN1251:
3711-
serverenc = "cp1251";
3712-
break;
3713-
case PG_WIN1252:
3714-
serverenc = "cp1252";
3715-
break;
3716-
case PG_WIN1253:
3717-
serverenc = "cp1253";
3718-
break;
3719-
case PG_WIN1254:
3720-
serverenc = "cp1254";
3721-
break;
3722-
case PG_WIN1255:
3723-
serverenc = "cp1255";
3724-
break;
3725-
case PG_WIN1256:
3726-
serverenc = "cp1256";
3727-
break;
3728-
case PG_WIN1257:
3729-
serverenc = "cp1257";
3730-
break;
3731-
case PG_WIN1258:
3732-
serverenc = "cp1258";
3733-
break;
3734-
case PG_WIN866:
3735-
serverenc = "cp866";
3736-
break;
3737-
case PG_WIN874:
3738-
serverenc = "cp874";
3739-
break;
3740-
default:
3741-
/* Other encodings have the same name in Python. */
3742-
serverenc = GetDatabaseEncodingName();
3743-
break;
3713+
PG_TRY();
3714+
{
3715+
encoded = (char *) pg_do_encoding_conversion(
3716+
(unsigned char *) utf8string,
3717+
strlen(utf8string),
3718+
PG_UTF8,
3719+
GetDatabaseEncoding());
3720+
}
3721+
PG_CATCH();
3722+
{
3723+
Py_DECREF(bytes);
3724+
PG_RE_THROW();
3725+
}
3726+
PG_END_TRY();
37443727
}
3728+
else
3729+
encoded = utf8string;
3730+
3731+
/* finally, build a bytes object in the server encoding */
3732+
rv = PyBytes_FromStringAndSize(encoded, strlen(encoded));
3733+
3734+
/* if pg_do_encoding_conversion allocated memory, free it now */
3735+
if (utf8string != encoded)
3736+
pfree(encoded);
37453737

3746-
rv = PyUnicode_AsEncodedString(unicode, serverenc, "strict");
3747-
if (rv == NULL)
3748-
PLy_elog(ERROR, "could not convert Python Unicode object to PostgreSQL server encoding");
3738+
Py_DECREF(bytes);
37493739
return rv;
37503740
}
37513741

0 commit comments

Comments
 (0)