Skip to content

[DISCUSSION] Dataclasses #2698

@Bibo-Joshi

Description

@Bibo-Joshi

The idea to use dataclasses has been discussed a number of times (see #2056), more recently in #2491 and there is also a connection to #2607. So here I am opening an issue to write down some thoughts.

This is rather for from being my final judgement on the matter and I'm aware that it will be a while until we properly discuss this. So more of a reminder for future-me.

Motivation

We have a lot of boilerplate code. especially the inits of classes like Message take up a lot of place, along with __slots__ and de/to_json.
There are ways to hide those away by using 3rd pary libraries (that inspired dataclasses) or the built-in dataclasses. Here is what such approaches can possibly do for us:

  • removing boilerplate inits
  • removing boilerplate slots
  • Solving big parts of [FEATURE/DISCUSSION] Representations of TG objects #2491 without any work at all
  • Making TelegramObjects immutable. I'm really not sure if we want that, but it's at least an interesting thought given that incoming updates represent a static information and changing them doesn't make sense at all.

Edits:

  • inspired by Remove redundant setters #2747 dataclasses could also be used for ext.Defaults and CallbackContext, especially making them immutable
  • dataclasses would (as side effect) solve large parts of problem with types in python-telegram-bot package microsoft/pylance-release#1983
  • with dataclasses, we could get rid of the __new__ workaround in TelegramObject: The reason for that was that we didn't want to have to call super().__init__ in all the subclasses. if TO is a dataclass as well, then the auto-generated inits of the subclasses would automatically do that for us

A word on 3rd party libs

A number of 3rd party libraries that offer similar functionality as dataclasses, most notably attrs and pydantic but also some others. I know that we try to keep our dependencies at a minimum and I'm fully backing that mentality. Still, let me list some pros & cons that I see of possible usage of attrs & pydantic - compared to using dataclasses

  • ➕ Automated handling of (nested!) to/de_json logic. pydantic has this built-in, attrs can do this together with the cattrs library
  • ➕/➖ Data validation. pydantic is heavily inspired by that. This means that e.g. self.text = text is replaced by self.text = str(text) in pydantic, but apparently also slows it down
  • attrs comes with support for slots, which dataclass only adds in py3.10. IISC pydantic does notautomatically
  • ➖ IISC neither pydantic nor attrs can do both extra arguments and slots
  • ➖IDE- & type-checker integration may be non-optimal (pydantice provides plugins for mypy and PyCharm)
  • ➖ They are dependencies. duh. If your own boilerplate is fasly, you can change it. If the dependency has a bug, then good night.
  • ➕/➖ Handling of extra arguments - this is actually really important, also for [Discussion] kwargs for input-only classes (like InputMedia*) #2607 and I'll go into detail on that later. TL;DR: pydantic can dismiss additional kwargs that are passed to init, e.g. Chat(…, new_argument=42) will work. attrs can't, IISC. dataclass can't either

But let's focus on the last bullet point. If we can't handle that properly, then there is no use for attrs or dataclass at all

Dismissal of additional kwargs

Currently there is no way to make dataclass add **kwargs to the auto-generated __init__.
But what if we approach this problem differently? Instead of passing the additional kwargs, we can filter them out before. Ofc this is more expensive, but

  • the chance of this happening is low (new arguments are not added too aften & we're usually lightning fast on API updates *cough*)
  • similar to the discussion Refactor kwargs handling in Bot methods #1924, by removing **kwargs we no longer just swallow typos in arguments etc. Admittedly, this benefit is mall, since classes like Message, Chat etc are usually not manually instantiated.

Such an approach should work with both attrs and dataclass

One thing to keep in mind is that we currently pass bot to all classes, so a lot of them always have one additional kwarg. This should be easily solvable by different means. E.g. one could

  • add bot to all signatures
  • add a class variable HAS_SHORTCUTS that tells us whether or not we need to pass bot
  • add a method TelegramObject.set_bot to remove bot from all signatures

So here's a small comparison of the two approaches:

import inspect
import timeit


class InspectDeJson:
    _init_params = None

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7

    @classmethod
    def de_json(cls, data):
        if cls._init_params is None:
            signature = inspect.signature(cls)
            cls._init_params = signature.parameters.keys()

        # try-except is significantly faster in case we already have a correct argument set
        try:
            return cls(**data)
        except TypeError as exc:
            if str(exc).startswith('__init__() got an unexpected keyword argument'):
                return cls(**{k: v for k, v in data.items() if k in cls._init_params})
            else:
                raise exc


class KwargsDeJson:

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7, **kwargs):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7

    @classmethod
    def de_json(cls, data):
        return cls(**data)


correct_data = {
        'arg1': 'arg1',
        'arg2': 'arg2',
        'arg3': 'arg3',
        'arg4': 'arg4',
        'arg5': 'arg5',
        'arg6': 'arg6',
        'arg7': 'arg7',
}
overfull_data = correct_data.copy()
overfull_data.update({str(i): i for i in range(5)})

number = int(5*10e6)

print('Passing correct data')
a = timeit.timeit('InspectDeJson.de_json(correct_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(correct_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')

print('\n' + 20*"=" + '\n')


print('Passing too much data')
a = timeit.timeit('InspectDeJson.de_json(overfull_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(overfull_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')

Example output on my machine:

InspectDeJson took 9.67620756e-07 seconds on average
KwargsDeJson took 9.222072120000002e-07 seconds on average
InspectDeJson took 104.92444034367406 % of KwargsDeJsons runtime

====================

Passing too much data
InspectDeJson took 4.28629354e-06 seconds on average
KwargsDeJson took 1.8550962000000004e-06 seconds on average
InspectDeJson took 231.05505471899508 % of KwargsDeJsons runtime

So in the "usual" case the overhead is neglectible IMO and the usage of inspect-dark magic is not too extensive.
In the cases where we have unhandled data, the slow down is heavily notiable - but we're still talking 4 vs 2 micro seconds and as explained above, this shouldn't happen too often.

Summary

For today, much punchlines are:

  • Dismissal of additional kwargs can be handled in a different way, that works reasonably well IMO
  • Using 3rd party lib could reduce some more boilerplate code, specifically to/de_json, but neither pydantic nor attrs has all the features that I would like. Neither does dataclass, but at least that's included batteries

PS:

PyCharm apparently still has no proper support for dataclasses - not that this would stop us, just mentioning it …

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions