Skip to content

UTF-8 Email parsing/serialising: Roundtrip exits with “surrogates not allowed” #113594

Closed
@bronger

Description

@bronger

Bug report

Bug description:

In the attached Python minimal example, email_raw_1 survives a round-trip from UTF-8 bytes string to an EmailMessage object and back to a string, while email_raw_2 does not:

Traceback (most recent call last):
File "//surrogate_issue.py", line 29, in
print(message_2)

File "/usr/local/lib/python3.12/email/_encoded_words.py", line 224, in encode
bstring = string.encode(charset)
^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-2: surrogates not allowed

Funny thing is that the only difference is an additional digit in the middle of it.

The email is malformed, however, it is taken from an actual mail at https://wilson.bronger.org/5105.txt. Malformed or not, my other email machinery can deal with it, so I think Python should handle such real-world specimen on best-effort basis without exiting.

#!/bin/python

import email, email.policy


email_raw_1 = """Content-Type: multipart/mixed; boundary="==="

--===
Content-Type: message/plain
 
 您0123456789012.3456789

--===--
""".encode()

email_raw_2 = """Content-Type: multipart/mixed; boundary="==="

--===
Content-Type: message/plain
 
 您0123456789012.34567890

--===--
""".encode()

message_1 = email.message_from_bytes(email_raw_1, policy=email.policy.SMTPUTF8)
message_2 = email.message_from_bytes(email_raw_2, policy=email.policy.SMTPUTF8)
print(message_1)
print(message_2)

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Labels

3.11only security fixes3.12only security fixes3.13bugs and security fixestopic-emailtype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions