-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
Steps to reproduce
- Send a message like "👩👩👧👧http://google.com" (that is, a family emoji (or really any complicated unicode codepoint) followed by a link)
- The link should appear in a MessageEntity in
update.message.entities
with the type 'url'. - Try to use the
length
oroffset
attribute for basically any purpose... Fx.
entity = update.message.entities[0]
link = update.message.text[entity.offset:entity.offset + entity.length]
Expected behaviour
In the example above you'd expect to have "http://google.com" in the link
variable.
Actual behaviour
link
contains "://google.com".
Why
This happens because telegram servers calculate lengths in 'UTF-16' codepoints, while python clearly doesn't (and really that's good since utf-16 is bad...). This means that python sees the family emoji as 8 characters while telegram sees it as 12 characters...
Solution
Either:
- Patch the length and offset inside MessageEntity's such that it matches what python thinks...
- Add some sort of util function to convert to and from utf-8 and utf-16
- Add some sort of custom slicing util function that does the conversion internally
Configuration
Operating System:
Windows 10 Education
Version of Python, python-telegram-bot & dependencies:
python-telegram-bot 5.0.0
urllib3 1.16
certifi 2016.08.08
future 0.15.2
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
Note
The behaviour of wide strings (weird unicode stuff) apparently changed as of python 3.3 (it now has a certain behaviour always where before it was a build-flag or something like that), so to support all python versions, we'd probably have to do some sort of wizardry... (See discussion at https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion for more info (kinda))