Description
Bug report
Bug description:
In the attached Python minimal example, email_raw_1
survives a round-trip from UTF-8 bytes string to an EmailMessage object and back to a string, while email_raw_2
does not:
Traceback (most recent call last):
File "//surrogate_issue.py", line 29, in
print(message_2)
…
File "/usr/local/lib/python3.12/email/_encoded_words.py", line 224, in encode
bstring = string.encode(charset)
^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-2: surrogates not allowed
Funny thing is that the only difference is an additional digit in the middle of it.
The email is malformed, however, it is taken from an actual mail at https://wilson.bronger.org/5105.txt. Malformed or not, my other email machinery can deal with it, so I think Python should handle such real-world specimen on best-effort basis without exiting.
#!/bin/python
import email, email.policy
email_raw_1 = """Content-Type: multipart/mixed; boundary="==="
--===
Content-Type: message/plain
您0123456789012.3456789
--===--
""".encode()
email_raw_2 = """Content-Type: multipart/mixed; boundary="==="
--===
Content-Type: message/plain
您0123456789012.34567890
--===--
""".encode()
message_1 = email.message_from_bytes(email_raw_1, policy=email.policy.SMTPUTF8)
message_2 = email.message_from_bytes(email_raw_2, policy=email.policy.SMTPUTF8)
print(message_1)
print(message_2)
CPython versions tested on:
3.12
Operating systems tested on:
Linux