-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
The idea to use dataclasses has been discussed a number of times (see #2056), more recently in #2491 and there is also a connection to #2607. So here I am opening an issue to write down some thoughts.
This is rather for from being my final judgement on the matter and I'm aware that it will be a while until we properly discuss this. So more of a reminder for future-me.
Motivation
We have a lot of boilerplate code. especially the inits of classes like Message
take up a lot of place, along with __slots__
and de/to_json
.
There are ways to hide those away by using 3rd pary libraries (that inspired dataclasses
) or the built-in dataclasses
. Here is what such approaches can possibly do for us:
- removing boilerplate inits
- removing boilerplate slots
- Solving big parts of [FEATURE/DISCUSSION] Representations of TG objects #2491 without any work at all
- Making TelegramObjects immutable. I'm really not sure if we want that, but it's at least an interesting thought given that incoming updates represent a static information and changing them doesn't make sense at all.
Edits:
- inspired by Remove redundant setters #2747 dataclasses could also be used for
ext.Defaults
andCallbackContext
, especially making them immutable - dataclasses would (as side effect) solve large parts of problem with types in python-telegram-bot package microsoft/pylance-release#1983
- with dataclasses, we could get rid of the
__new__
workaround inTelegramObject
: The reason for that was that we didn't want to have to callsuper().__init__
in all the subclasses. ifTO
is a dataclass as well, then the auto-generated inits of the subclasses would automatically do that for us
A word on 3rd party libs
A number of 3rd party libraries that offer similar functionality as dataclasses
, most notably attrs
and pydantic
but also some others. I know that we try to keep our dependencies at a minimum and I'm fully backing that mentality. Still, let me list some pros & cons that I see of possible usage of attrs
& pydantic
- compared to using dataclasses
- ➕ Automated handling of (nested!) to/de_json logic.
pydantic
has this built-in,attrs
can do this together with thecattrs
library - ➕/➖ Data validation.
pydantic
is heavily inspired by that. This means that e.g.self.text = text
is replaced byself.text = str(text)
inpydantic
, but apparently also slows it down - ➕
attrs
comes with support for slots, which dataclass only adds in py3.10. IISCpydantic
does notautomatically - ➖ IISC neither
pydantic
norattrs
can do both extra arguments and slots - ➖IDE- & type-checker integration may be non-optimal (
pydantice
provides plugins formypy
and PyCharm) - ➖ They are dependencies. duh. If your own boilerplate is fasly, you can change it. If the dependency has a bug, then good night.
- ➕/➖ Handling of extra arguments - this is actually really important, also for [Discussion] kwargs for input-only classes (like
InputMedia*
) #2607 and I'll go into detail on that later. TL;DR:pydantic
can dismiss additional kwargs that are passed to init, e.g.Chat(…, new_argument=42)
will work.attrs
can't, IISC.dataclass
can't either
But let's focus on the last bullet point. If we can't handle that properly, then there is no use for attrs
or dataclass
at all
Dismissal of additional kwargs
Currently there is no way to make dataclass
add **kwargs
to the auto-generated __init__
.
But what if we approach this problem differently? Instead of passing the additional kwargs, we can filter them out before. Ofc this is more expensive, but
- the chance of this happening is low (new arguments are not added too aften & we're usually lightning fast on API updates *cough*)
- similar to the discussion Refactor kwargs handling in Bot methods #1924, by removing
**kwargs
we no longer just swallow typos in arguments etc. Admittedly, this benefit is mall, since classes likeMessage
,Chat
etc are usually not manually instantiated.
Such an approach should work with both attrs
and dataclass
One thing to keep in mind is that we currently pass bot
to all classes, so a lot of them always have one additional kwarg. This should be easily solvable by different means. E.g. one could
- add
bot
to all signatures - add a class variable
HAS_SHORTCUTS
that tells us whether or not we need to passbot
- add a method
TelegramObject.set_bot
to removebot
from all signatures
So here's a small comparison of the two approaches:
import inspect
import timeit
class InspectDeJson:
_init_params = None
def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7):
self.arg1 = arg1
self.arg2 = arg2
self.arg3 = arg3
self.arg4 = arg4
self.arg5 = arg5
self.arg6 = arg6
self.arg7 = arg7
@classmethod
def de_json(cls, data):
if cls._init_params is None:
signature = inspect.signature(cls)
cls._init_params = signature.parameters.keys()
# try-except is significantly faster in case we already have a correct argument set
try:
return cls(**data)
except TypeError as exc:
if str(exc).startswith('__init__() got an unexpected keyword argument'):
return cls(**{k: v for k, v in data.items() if k in cls._init_params})
else:
raise exc
class KwargsDeJson:
def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7, **kwargs):
self.arg1 = arg1
self.arg2 = arg2
self.arg3 = arg3
self.arg4 = arg4
self.arg5 = arg5
self.arg6 = arg6
self.arg7 = arg7
@classmethod
def de_json(cls, data):
return cls(**data)
correct_data = {
'arg1': 'arg1',
'arg2': 'arg2',
'arg3': 'arg3',
'arg4': 'arg4',
'arg5': 'arg5',
'arg6': 'arg6',
'arg7': 'arg7',
}
overfull_data = correct_data.copy()
overfull_data.update({str(i): i for i in range(5)})
number = int(5*10e6)
print('Passing correct data')
a = timeit.timeit('InspectDeJson.de_json(correct_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(correct_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')
print('\n' + 20*"=" + '\n')
print('Passing too much data')
a = timeit.timeit('InspectDeJson.de_json(overfull_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(overfull_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')
Example output on my machine:
InspectDeJson took 9.67620756e-07 seconds on average
KwargsDeJson took 9.222072120000002e-07 seconds on average
InspectDeJson took 104.92444034367406 % of KwargsDeJsons runtime
====================
Passing too much data
InspectDeJson took 4.28629354e-06 seconds on average
KwargsDeJson took 1.8550962000000004e-06 seconds on average
InspectDeJson took 231.05505471899508 % of KwargsDeJsons runtime
So in the "usual" case the overhead is neglectible IMO and the usage of inspect-dark magic is not too extensive.
In the cases where we have unhandled data, the slow down is heavily notiable - but we're still talking 4 vs 2 micro seconds and as explained above, this shouldn't happen too often.
Summary
For today, much punchlines are:
- Dismissal of additional kwargs can be handled in a different way, that works reasonably well IMO
- Using 3rd party lib could reduce some more boilerplate code, specifically
to/de_json
, but neitherpydantic
norattrs
has all the features that I would like. Neither doesdataclass
, but at least that's included batteries
PS:
PyCharm apparently still has no proper support for dataclasses
- not that this would stop us, just mentioning it …