Skip to content

[DISCUSSION] Dataclasses #2698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Bibo-Joshi opened this issue Oct 4, 2021 · 14 comments
Closed

[DISCUSSION] Dataclasses #2698

Bibo-Joshi opened this issue Oct 4, 2021 · 14 comments
Milestone

Comments

@Bibo-Joshi
Copy link
Member

Bibo-Joshi commented Oct 4, 2021

The idea to use dataclasses has been discussed a number of times (see #2056), more recently in #2491 and there is also a connection to #2607. So here I am opening an issue to write down some thoughts.

This is rather for from being my final judgement on the matter and I'm aware that it will be a while until we properly discuss this. So more of a reminder for future-me.

Motivation

We have a lot of boilerplate code. especially the inits of classes like Message take up a lot of place, along with __slots__ and de/to_json.
There are ways to hide those away by using 3rd pary libraries (that inspired dataclasses) or the built-in dataclasses. Here is what such approaches can possibly do for us:

  • removing boilerplate inits
  • removing boilerplate slots
  • Solving big parts of [FEATURE/DISCUSSION] Representations of TG objects #2491 without any work at all
  • Making TelegramObjects immutable. I'm really not sure if we want that, but it's at least an interesting thought given that incoming updates represent a static information and changing them doesn't make sense at all.

Edits:

  • inspired by Remove redundant setters #2747 dataclasses could also be used for ext.Defaults and CallbackContext, especially making them immutable
  • dataclasses would (as side effect) solve large parts of problem with types in python-telegram-bot package microsoft/pylance-release#1983
  • with dataclasses, we could get rid of the __new__ workaround in TelegramObject: The reason for that was that we didn't want to have to call super().__init__ in all the subclasses. if TO is a dataclass as well, then the auto-generated inits of the subclasses would automatically do that for us

A word on 3rd party libs

A number of 3rd party libraries that offer similar functionality as dataclasses, most notably attrs and pydantic but also some others. I know that we try to keep our dependencies at a minimum and I'm fully backing that mentality. Still, let me list some pros & cons that I see of possible usage of attrs & pydantic - compared to using dataclasses

  • ➕ Automated handling of (nested!) to/de_json logic. pydantic has this built-in, attrs can do this together with the cattrs library
  • ➕/➖ Data validation. pydantic is heavily inspired by that. This means that e.g. self.text = text is replaced by self.text = str(text) in pydantic, but apparently also slows it down
  • attrs comes with support for slots, which dataclass only adds in py3.10. IISC pydantic does notautomatically
  • ➖ IISC neither pydantic nor attrs can do both extra arguments and slots
  • ➖IDE- & type-checker integration may be non-optimal (pydantice provides plugins for mypy and PyCharm)
  • ➖ They are dependencies. duh. If your own boilerplate is fasly, you can change it. If the dependency has a bug, then good night.
  • ➕/➖ Handling of extra arguments - this is actually really important, also for [Discussion] kwargs for input-only classes (like InputMedia*) #2607 and I'll go into detail on that later. TL;DR: pydantic can dismiss additional kwargs that are passed to init, e.g. Chat(…, new_argument=42) will work. attrs can't, IISC. dataclass can't either

But let's focus on the last bullet point. If we can't handle that properly, then there is no use for attrs or dataclass at all

Dismissal of additional kwargs

Currently there is no way to make dataclass add **kwargs to the auto-generated __init__.
But what if we approach this problem differently? Instead of passing the additional kwargs, we can filter them out before. Ofc this is more expensive, but

  • the chance of this happening is low (new arguments are not added too aften & we're usually lightning fast on API updates *cough*)
  • similar to the discussion Refactor kwargs handling in Bot methods #1924, by removing **kwargs we no longer just swallow typos in arguments etc. Admittedly, this benefit is mall, since classes like Message, Chat etc are usually not manually instantiated.

Such an approach should work with both attrs and dataclass

One thing to keep in mind is that we currently pass bot to all classes, so a lot of them always have one additional kwarg. This should be easily solvable by different means. E.g. one could

  • add bot to all signatures
  • add a class variable HAS_SHORTCUTS that tells us whether or not we need to pass bot
  • add a method TelegramObject.set_bot to remove bot from all signatures

So here's a small comparison of the two approaches:

import inspect
import timeit


class InspectDeJson:
    _init_params = None

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7

    @classmethod
    def de_json(cls, data):
        if cls._init_params is None:
            signature = inspect.signature(cls)
            cls._init_params = signature.parameters.keys()

        # try-except is significantly faster in case we already have a correct argument set
        try:
            return cls(**data)
        except TypeError as exc:
            if str(exc).startswith('__init__() got an unexpected keyword argument'):
                return cls(**{k: v for k, v in data.items() if k in cls._init_params})
            else:
                raise exc


class KwargsDeJson:

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7, **kwargs):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7

    @classmethod
    def de_json(cls, data):
        return cls(**data)


correct_data = {
        'arg1': 'arg1',
        'arg2': 'arg2',
        'arg3': 'arg3',
        'arg4': 'arg4',
        'arg5': 'arg5',
        'arg6': 'arg6',
        'arg7': 'arg7',
}
overfull_data = correct_data.copy()
overfull_data.update({str(i): i for i in range(5)})

number = int(5*10e6)

print('Passing correct data')
a = timeit.timeit('InspectDeJson.de_json(correct_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(correct_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')

print('\n' + 20*"=" + '\n')


print('Passing too much data')
a = timeit.timeit('InspectDeJson.de_json(overfull_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(overfull_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')

Example output on my machine:

InspectDeJson took 9.67620756e-07 seconds on average
KwargsDeJson took 9.222072120000002e-07 seconds on average
InspectDeJson took 104.92444034367406 % of KwargsDeJsons runtime

====================

Passing too much data
InspectDeJson took 4.28629354e-06 seconds on average
KwargsDeJson took 1.8550962000000004e-06 seconds on average
InspectDeJson took 231.05505471899508 % of KwargsDeJsons runtime

So in the "usual" case the overhead is neglectible IMO and the usage of inspect-dark magic is not too extensive.
In the cases where we have unhandled data, the slow down is heavily notiable - but we're still talking 4 vs 2 micro seconds and as explained above, this shouldn't happen too often.

Summary

For today, much punchlines are:

  • Dismissal of additional kwargs can be handled in a different way, that works reasonably well IMO
  • Using 3rd party lib could reduce some more boilerplate code, specifically to/de_json, but neither pydantic nor attrs has all the features that I would like. Neither does dataclass, but at least that's included batteries

PS:

PyCharm apparently still has no proper support for dataclasses - not that this would stop us, just mentioning it …

@eldbud
Copy link
Contributor

eldbud commented Oct 5, 2021

i actually encountered similar issue in the branch i am currently working on: because media object share many common attributes, but without any real hierarchy, i created mixin classes for attributes, and incorporated them into the media classes. but then of course i cannot leave __slots__ empty if i ever want to ```to_dict``, but i can neither populate it due to mro conflicts that arise upon multiple inheritance.
So it kinda brings me back to creating boilerplate attributes in each of those media classes.

@Bibo-Joshi
Copy link
Member Author

TBH I can't recall discussing mixins in #2573. Is this something additional that you experimented with? If you want to discuss it further, we should probably do that on #2573 for now :)

@Poolitzer
Copy link
Member

Making TelegramObjects immutable. I'm really not sure if we want that, but it's at least an interesting thought given that incoming updates represent a static information and changing them doesn't make sense at all.

Hmm, I can see someone writing custom logic to add attributes to Updates. Ofc they could/maybe they should write it for context, but if they have an old project, making the objects immutable will break their flow. We could maybe make a channel post about that?

And btw, if someone uses our library just for the objects and does something with them, would we break their code as well? Differently speaking, can you change the objects easily by e.g. inheriting from them, or before you innit those? If not, I would very much vote for asking the community if making TelegramObjects breaks their flow.

@Bibo-Joshi
Copy link
Member Author

True, it would be a major breaking change. IMO we should at least discuss it for v14, because if we decide we would like this (as maintainers, or after asking the community - nice idea btw!), and we don't put it in v14, it will be a long while until it happens :)

@Poolitzer
Copy link
Member

Poolitzer commented Oct 19, 2021

sure. maybe put it after/before async if that makes a change for you?

@Bibo-Joshi
Copy link
Member Author

Don't think it will make a big difference …

@Bibo-Joshi
Copy link
Member Author

Bibo-Joshi commented Feb 2, 2022

Another thought came up in the user group: When TG adds new attributes to a class, those are currently not accessible in PTB until we make a new release. one could handle that similarly to the api_kwargs argument of the bot methods, i.e. gather all arguments that we don't handle individually in an attribute TelegramObject.api_kwargs . that should easily work with both the **kwargs approach or the dataclass + inspect approach

PS: One would need to think about whether or not the api_kwargs should be included into to_json. maybe one could even add a parameter for that to to_json

PPS: Updated MWE with api_kwargs, also comparing the runtime with pydantic - showing that the inspect approach isn't so bad after all :D

Click to expand
import inspect
import timeit

from pydantic import Extra, BaseModel


class InspectDeJson:
    _init_params = None

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7, api_kwargs=None):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7
        self.api_kwargs = api_kwargs

    @classmethod
    def de_json(cls, data):
        # try-except is significantly faster in case we already have a correct argument set
        try:
            return cls(**data)
        except TypeError as exc:
            if str(exc).startswith('__init__() got an unexpected keyword argument'):
                if cls._init_params is None:
                    signature = inspect.signature(cls)
                    cls._init_params = signature.parameters.keys()

                allowed, api_kwargs = {}, {}
                for param, value in data.items():
                    if param in cls._init_params:
                        allowed[param] = value
                    else:
                        api_kwargs[param] = value
                return cls(**allowed, api_kwargs=api_kwargs)
            else:
                raise exc


class PydanticDeJson(BaseModel):
    arg1: str
    arg2: str
    arg3: str
    arg4: str
    arg5: str
    arg6: str
    arg7: str

    class Config:
        extra = Extra.allow

    @classmethod
    def de_json(cls, data):
        return cls(**data)


class KwargsDeJson:

    def __init__(self, arg1, arg2, arg3, arg4, arg5, arg6, arg7, **kwargs):
        self.arg1 = arg1
        self.arg2 = arg2
        self.arg3 = arg3
        self.arg4 = arg4
        self.arg5 = arg5
        self.arg6 = arg6
        self.arg7 = arg7

    @classmethod
    def de_json(cls, data):
        return cls(**data)


correct_data = {
        'arg1': 'arg1',
        'arg2': 'arg2',
        'arg3': 'arg3',
        'arg4': 'arg4',
        'arg5': 'arg5',
        'arg6': 'arg6',
        'arg7': 'arg7',
}
overfull_data = correct_data.copy()
overfull_data.update({str(i): i for i in range(5)})

number = int(5*10e3)

print('Passing correct data')
a = timeit.timeit('InspectDeJson.de_json(correct_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(correct_data)', globals=globals(), number=number)
c = timeit.timeit('PydanticDeJson.de_json(correct_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'PydanticDeJson took {c/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')
print(f'PydanticDeJson took {c*100/b} % of KwargsDeJsons runtime')

print('\n' + 20*"=" + '\n')


print('Passing too much data')
a = timeit.timeit('InspectDeJson.de_json(overfull_data)', globals=globals(), number=number)
b = timeit.timeit('KwargsDeJson.de_json(overfull_data)', globals=globals(), number=number)
c = timeit.timeit('PydanticDeJson.de_json(overfull_data)', globals=globals(), number=number)
print(f'InspectDeJson took {a/number} seconds on average')
print(f'KwargsDeJson took {b/number} seconds on average')
print(f'PydanticDeJson took {c/number} seconds on average')
print(f'InspectDeJson took {a*100/b} % of KwargsDeJsons runtime')
print(f'PydanticDeJson took {c*100/b} % of KwargsDeJsons runtime')

Output on my machine:

Passing correct data
InspectDeJson took 1.635704e-06 seconds on average
KwargsDeJson took 1.9409600000000003e-06 seconds on average
PydanticDeJson took 3.8132918e-05 seconds on average
InspectDeJson took 84.27293710328908 % of KwargsDeJsons runtime
PydanticDeJson took 1964.6421358503005 % of KwargsDeJsons runtime

====================

Passing too much data
InspectDeJson took 9.215419999999996e-06 seconds on average
KwargsDeJson took 3.0822979999999946e-06 seconds on average
PydanticDeJson took 3.994753999999999e-05 seconds on average
InspectDeJson took 298.97887874566356 % of KwargsDeJsons runtime
PydanticDeJson took 1296.0310781112034 % of KwargsDeJsons runtime

Copy link
Member

tsnoam commented Feb 4, 2022

@Bibo-Joshi That's a nice idea but we should also mind the pitfall:
Let's say you persist an object with these api_kwargs.
Now let's say you upgrade ptb, restart the bot and try to load from persistence the persisted object.
How would you like loading from persistence to behave?

@Bibo-Joshi
Copy link
Member Author

How would you like loading from persistence to behave?

Some thoughts that we had on this in the dev chat:

  1. on __init__, we can do something like

    def __init__(self, arg, api_kwargs):
        if arg is None:
            arg = api_kwargs.get('arg', None)

    This assigns the additional api kwargs to the proper attributes while still having them in api_kwargs for compatibility

  2. We can make sure that the pickle behavior of TelegramObject does something similar (either by explicitly calling __init__ or by other means)

  3. We can't cover all persistence implementations, b/c not every serialization method uses pickle or calls __init__.

@Bibo-Joshi
Copy link
Member Author

Bibo-Joshi commented May 20, 2022

Some more thoughts about immutability:

  • if you inherit from a dataclass(frozen=True), there are two cases
    • if the child class is a dataclass as well, it must have frozen=True as well
    • if the child class is no dataclass, added attributes are not frozen. inherited attributes are frozen.
  • Nothing is really immutable in python. object.__setattr__(immutable_obj, 'attr', value) will set immutable_obj.attr to value.
  • if a class implements __hash__, two objects should have the same hash value if and only if they are equal in terms of __eq__. Moreover, the hash value of on object should not change. This also implies that mutable objects should not be hashable.
    • attributes of TelegramObject & subclasses are currently mutable, i.e. users can change them. These changes are reflected neither in __eq__ nor in __hash__.
    • We even break the "hash value should not change" rule with InlineKeyboardButton.update_callback_data. (Although this is probably only a minor violation as the button is exposed to the user only after the one-time change of the hash value)
    • We currently rely on hashability of TelegramObject e.g. in Message.parse_entities which returns a dict with MessageEntity objects as keys
    • We use lists as pythons default equivalent of json arrays. lists are mutable - in contrast to tuples. If we would like to achieve immutability, we should consider switching to tuples.
  • Immutability has two different aspects: For classes like Messages, which a user will never create, it makes sense, IMO. For classes like BotCommandScope*, which are only created by the user and then passed to TG, it's easier to think of use cases, where mutability may be a small advantage.

@Bibo-Joshi Bibo-Joshi modified the milestones: v20.0a1, v20.0 May 21, 2022
@harshil21
Copy link
Member

Also just wanted to say since py 3.11, there is zero cost exception handling, so another plus point for the inspect approach.

And note that there is apparently a small performance penalty when using frozen instances. Can we benchmark this and see how much it really is for our case?

@Bibo-Joshi
Copy link
Member Author

Yesterday I tried to get started on a proof of concept. And I almost immediately ran into two major problems, which frankly I'm embarrased to not have seen so far. Both are related to optional arguments.

Slots

Not only do dataclasses bring built-in support for slots only on python 3.10+, but it's actually not possible to directly use them manually on 3.9- with default arguments:

@dataclass
class Test:
    __slots__ = ("arg",)

    arg: str = "arg"

will raise an exception. Actually this is not due to dataclasses directly (removing @dataclass still raises) but due to __slots__ inserting stuff into the class namespace. See e.g. https://stackoverflow.com/questions/50180735 for more input. We actually face this very problem currently in TelegramObject where we work around this with a custom __new__ method. Unfortunately, this can't help us with dataclasses.

Inheritance

The second problem appears when we want to inherit from a dataclass with optional arguments:

@dataclass
class Parent:
    arg: str = "arg"


@dataclass
class Child(Parent):
    barg: str

now dataclass will build an __init__ for Child which looks like __init__(self, arg: str = "arg", barg: str) - and that's not allowed.
Most of our classes inherit directly from TelegramObject, which doesn't have any arguments at all, so this might sound not so important. However, this becomes relevant when adding an optional argument api_kwargs to TO as mentioned a few comments up. A straightforward workaround is to shift this from TO into all the inheriting classes. Note nice, but does the trick. We face more severe problem when considering InputMedia, which does have optional arguments and multiple child classes. Moving the default arguments would mean to rob InputMedia of these arguments, which strictly speaking is more in line with the API (where InputMedia doesn't have any attributes at all) but still feels wrong to me. Note that this also affects internal helpers like _BaseMedium.
In python 3.10, one can mark dataclass arguments as "kw only" which somehow works around this problem because __init__(self, *, arg: str = "arg", barg: str) is a valid signature. IMO it still doesn't really address the issue.
I'm not sure why dataclasses don't just create an __init__ of the form __init__(self, pos_args_of_parent, pos_args_of_child, kw_args_of_parent, kw_args_of_child) but knowing the reasons probably wouldn't change much.


So far these are my observations and I'm rather bummed about that. I haven't gotten far in terms of thinking about our options, yet …

@Bibo-Joshi
Copy link
Member Author

A few more insights:

pydantic

  • uses kword-only signatures, which alleviates the problem with optional arguments just like dataclasses do in py3.10+
  • neither __new__ nor __init__ have an explicit signature. inspact.signature still is able to derive a signutare for the class. IISC that's done by some rather deep python trickery.

attrs

  • __init__ has an explicit signature
  • this also means that one has the same problem with optional arguments again, which can only be resolved by making the arguments keyword-only

Big picture

This issue came to be because at least #2491 and #2607 looked like they could be resolved easily by using dataclasses and because they looked promising in terms of saving us some boilerplate code. But in the end, they apparently don't fit our requirements and using them just for the sake of it doesn't make sense. 3rd party libs also can't really help (see above) and we're no fans of extensive 3rd party use anyway.

So let's evaluate what our options are to implement the things that we were considering in the context of dataclasses, without dataclasses.

api_kwargs

This is basically independent of dataclasses. The ideas outlined in this thread can readily be applied in a boilerplate setting

__slots__

AFAIK slots need to be added at class creation. So if we'd want to automate this, we'd have to e.g. build a class decorator that returns a new, slotted class. I'll wager developing something like this adds nothing to the library except bugs and maintenance work.

type annotations

context: #3008, microsoft/pylance-release#1983, i.e. pylance expects us to do

def __init__(self, arg: str):
    self.arg: str = arg # not the ": str"

if we have to touch all attributes anyway for some reason, then IMO we could add this, otherwise I'm perfectly fine with keeping as is.

Immutability

Originally this was a "dataclasses comes with support for this, so we could consider doing this", but in the recent discussion, there was enough interest from the dev team for this direction to also consider it without dataclasses.
I guess there are a few options here. From a first glance I like the descriptor thingy best. But benchmarks and other comparisons are probably good and there are surely also other options.

@property

Deas simple, lots of boilerplate:

class Message:
    def __init__(self, message_id: int):
        self._message_id = message_id
    
    @property
    def message_id(self):
        return self._message_id

Descriptors

Properties are descriptors, so one should be able to cook up something like

Deas simple, lots of boilerplate:

class Message:
    def __init__(self, message_id: int):
        self.message_id= ReadOnlyDescriptor(message_id)

__setitem__

Something along the lines

class Message:
    def __init__(self, message_id: int):
        self.message_id = message_id
    
    def __setitem__(self, name, value):
        if not name.startswith('_'):
            raise RuntimeError
        # otherwise set attribute

This might have some side effects on the implementation that I don't see just yet … Would have to investigate more.

@Bibo-Joshi
Copy link
Member Author

Oh, I forgot:

__str__ and __repr__

Recreating the full pprint functionality (indent, line cut-off etc) is probably out of scope, but doing a simple Message(message_id=1, from=User(…), …) should be doable, IMO.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants