Skip to content

Handling of MessageEntity length and offset #400

@jsmnbom

Description

@jsmnbom

Steps to reproduce

  1. Send a message like "👩‍👩‍👧‍👧http://google.com" (that is, a family emoji (or really any complicated unicode codepoint) followed by a link)
  2. The link should appear in a MessageEntity in update.message.entities with the type 'url'.
  3. Try to use the length or offset attribute for basically any purpose... Fx.
entity = update.message.entities[0]
link = update.message.text[entity.offset:entity.offset + entity.length]

Expected behaviour

In the example above you'd expect to have "http://google.com" in the link variable.

Actual behaviour

link contains "://google.com".

Why

This happens because telegram servers calculate lengths in 'UTF-16' codepoints, while python clearly doesn't (and really that's good since utf-16 is bad...). This means that python sees the family emoji as 8 characters while telegram sees it as 12 characters...

Solution

Either:

  1. Patch the length and offset inside MessageEntity's such that it matches what python thinks...
  2. Add some sort of util function to convert to and from utf-8 and utf-16
  3. Add some sort of custom slicing util function that does the conversion internally

Configuration

Operating System:
Windows 10 Education

Version of Python, python-telegram-bot & dependencies:
python-telegram-bot 5.0.0
urllib3 1.16
certifi 2016.08.08
future 0.15.2
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Note
The behaviour of wide strings (weird unicode stuff) apparently changed as of python 3.3 (it now has a certain behaviour always where before it was a build-flag or something like that), so to support all python versions, we'd probably have to do some sort of wizardry... (See discussion at https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion for more info (kinda))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions