Skip to content

Optimize calls to get_type_ref #10863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 15, 2025
Merged

Optimize calls to get_type_ref #10863

merged 12 commits into from
Jan 15, 2025

Conversation

Viicos
Copy link
Member

@Viicos Viicos commented Nov 17, 2024

Indirectly closes #9443.

What we're trying to achieve and how

get_type_ref is pretty expensive, and we'll try to reduce the amount of calls to this function.

Currently, for every call to GenerateSchema.generate_schema, we unconditionally end up with the following call chain: _generate_schema_from_property ->get_schema_or_ref -> get_type_ref.

The idea is thus to remove this part (L750-752):

def _generate_schema_from_property(self, obj: Any, source: Any) -> core_schema.CoreSchema | None:
"""Try to generate schema from either the `__get_pydantic_core_schema__` function or
`__pydantic_core_schema__` property.
Note: `__get_pydantic_core_schema__` takes priority so it can
decide whether to use a `__pydantic_core_schema__` attribute, or generate a fresh schema.
"""
# avoid calling `__get_pydantic_core_schema__` if we've already visited this object
if _typing_extra.is_self(obj):
obj = self._resolve_self_type(obj)
with self.defs.get_schema_or_ref(obj) as (_, maybe_schema):
if maybe_schema is not None:
return maybe_schema

So that any GenerateSchema method responsible for generating a core schema for a referenceable type handle it. For example, _model_schema, which already has this check:

def _model_schema(self, cls: type[BaseModel]) -> core_schema.CoreSchema:
"""Generate schema for a Pydantic model."""
with self.defs.get_schema_or_ref(cls) as (model_ref, maybe_schema):
if maybe_schema is not None:
return maybe_schema

Note that the if maybe_schema is not None branch is never reached (same for all the other referenceable types such as dataclasses, typeddict, etc) because the branch we're trying to remove handles it already.

So surely removing the branch from _generate_schema_from_property should work.

Challenges

However, this breaks for Pydantic models:

Let's say you have the following:

class Sub(BaseModel): pass

class Model(BaseModel):
    sub1: Sub
    sub2: Sub

When we generate a schema for sub2, we follow the call chain generate_schema -> _generate_schema_from_property. If we don't check for get_schema_or_ref, we would want it to be checked in _model_schema as shown above. However, the rest of the _generate_schema_from_property implementation will check for __get_pydantic_core_schema__(), and the __pydantic_core_schema__ property. In our example, Sub is already in the definitions thanks to sub1, so it would break by returning the complete core schema of Sub 1 instead of a definition reference schema (produced by get_schema_or_ref).

So let's try to push things a bit further. Why not move the __get_pydantic_core_schema__()/__pydantic_core_schema__ logic to the _model_schema method? (same for dataclasses).

The issue is that while it would be fine for the __pydantic_core_schema__ property (only relevant for Pydantic models and dataclasses), the __get_pydantic_core_schema__() method can be set on pretty much anything.

Solution

That's why I had to remove the __get_pydantic_core_schema__ method from BaseModel. The current implementation is really only returning the cached __pydantic_core_schema__ property or delegating the generation to the GenerateSchema class.

During model construction, we were doing something even more confusing:

handler = CallbackGetCoreSchemaHandler(
partial(gen_schema.generate_schema, from_dunder_get_core_schema=False),
gen_schema,
ref_mode='unpack',
)
try:
schema = cls.__get_pydantic_core_schema__(cls, handler)

The following call chain happens: cls.__get_pydantic_core_schema__(cls, handler) -> handler(cls) ~ GenerateSchema().generate_schema(cls, from_dunder_get_core_schema=False).

I think having from_dunder_get_core_schema to avoid an infinite loop is a sign that the current implementation is quite convoluted.

Instead, by:

  • removing the BaseModel.__get_pydantic_core_schema__ method, and moving the check for the __pydantic_core_schema__ attribute to GenerateSchema._model_schema
  • Only checking for a __get_pydantic_core_schema__ method (and not the property) when calling GenerateSchema.generate_schema (which isn't an issue for Pydantic models as we removed it).

We:

  • allow for a more consistent approach when it comes to schema generation of models. We can now call GenerateSchema().generate_schema(some_model_cls), and the call chain is way simpler: generate_schema -> ... -> _model_schema.
  • move the logic to potentially call get_schema_or_ref (and check for the __pydantic_core_schema__ property) to the responsible methods: _model_schema, _dataclass_schema, etc.

Breaking changes concerns

We had a couple reports from people overriding BaseModel.__get_pydantic_core_schema__:

Feels like doing so always led to unexpected behavior, and a hacky fix. That's why I'm in support of having this pattern for BaseModel removed. It might break existing code, but we are probably helping the users in avoiding unexpected issues in the future.


We could try to still support the existing BaseModel.__get_pydantic_core_schema__ method, but honestly it's been quite hard to implement this already, and it might be that it's impossible to have both this perf improvement and support for the existing behavior at the same time

Footnotes

  1. By calling Sub.__get_pydantic_core_schema__(), which would return it from the cached attribute. Confusing! Adds an extra level of indirection.

@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label Nov 17, 2024
Copy link

cloudflare-workers-and-pages bot commented Nov 17, 2024

Deploying pydantic-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 000e60e
Status: ✅  Deploy successful!
Preview URL: https://286f9f34.pydantic-docs.pages.dev
Branch Preview URL: https://type-ref-perf.pydantic-docs.pages.dev

View logs

Copy link

codspeed-hq bot commented Nov 17, 2024

CodSpeed Performance Report

Merging #10863 will improve performances by 30.11%

Comparing type-ref-perf (000e60e) with main (c371bd1)

Summary

⚡ 14 improvements
✅ 31 untouched benchmarks

Benchmarks breakdown

Benchmark main type-ref-perf Change
test_schema_build 3.3 ms 3 ms +9.69%
test_fastapi_startup_perf 243 ms 214.7 ms +13.21%
test_fastapi_startup_perf 29 ms 27.5 ms +5.29%
test_complex_model_schema_generation 2.5 ms 1.9 ms +30.11%
test_lots_of_models_with_lots_of_fields 3.1 s 2.8 s +10.95%
test_recursive_model_schema_generation 1.2 ms 1.1 ms +11.62%
test_simple_model_schema_generation 854.8 µs 804.5 µs +6.25%
test_simple_model_schema_lots_of_fields_generation 38.1 ms 29.7 ms +28.36%
test_tagged_union_with_callable_discriminator_schema_generation 1.8 ms 1.6 ms +9.27%
test_deeply_nested_recursive_model_schema_generation 1.5 ms 1.4 ms +6.48%
test_generic_recursive_model_schema_generation 1,049.3 µs 953.6 µs +10.04%
test_nested_recursive_generic_model_schema_generation 2.1 ms 1.8 ms +15.54%
test_nested_recursive_model_schema_generation 2.1 ms 1.9 ms +8.78%
test_recursive_discriminated_union_with_base_model 1.9 ms 1.8 ms +8.11%

@sydney-runkle
Copy link
Contributor

I don't think it is a viable option to remove support for __get_pydantic_core_schema__. That's pretty breaking. Happy to see what other folks think as well.

@@ -671,30 +671,6 @@ def model_validate_strings(
__tracebackhide__ = True
return cls.__pydantic_validator__.validate_strings(obj, strict=strict, context=context)

@classmethod
def __get_pydantic_core_schema__(cls, source: type[BaseModel], handler: GetCoreSchemaHandler, /) -> CoreSchema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is a good idea. This is definitely being relied on. Its not internal and also documented for customizing the schema.

Copy link
Contributor

@MarkusSintonen MarkusSintonen Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the perf improvement could be maybe done so that the __get_pydantic_core_schema__ can not be overriden but instead coming from model_config (with similar args). Users then have option to migrate into that. Then if its defined in model config the perf improvements are not applied. (Didnt fully follow why its causing issues here but maybe possible)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this doesn't remove support for __get_pydantic_core_schema__. This is still handled by the added _generate_schema_from_get_schema_method method.

Users are still free to implement __get_pydantic_core_schema__ on Pydantic models, however we provide no guarantees (and never did, see the breaking changes concerns section in my original post) it will work flawlessly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. Will give this another review then. Didn't realize users could still implement support. Nice!

@sydney-runkle
Copy link
Contributor

Ah, I jumped the gun here on my review. Reread your new description, excited to look again.

Copy link
Contributor

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I'm in favor of this change, after better understanding the scope.

Comment on lines -793 to -873
if is_function_with_inner_schema(schema):
ref = schema['schema'].pop('ref', None) # pyright: ignore[reportCallIssue, reportArgumentType]
if ref:
schema['ref'] = ref
else:
ref = get_ref(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we lost this logic - is this needed anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to what I mentioned:

the first one I could find: #7102. The #7102 (comment) isn't really compelling, the user was doing things not the intended way. It also led to a scary and hacky fix that I would really like not having.

These removed lines had the effect of moving the ref from the inner schema of a function-* schema to the the function-* schema itself. With the following simplified test code added in the mentioned issue above:

class Numeric(BaseModel):
    value: float

    @classmethod
    def __get_pydantic_core_schema__(cls, source_type, handler):
        return core_schema.no_info_before_validator_function(cls.validate, handler(source_type))

    @classmethod
    def validate(cls, v):
        ...

class OuterModel(BaseModel):
    x: Numeric
    y: Numeric

On main, the schema of OuterModel would look like:

{
│   'type': 'definitions',
│   'schema': {
│   │   'type': 'model',
│   │   'cls': <class '__main__.OuterModel'>,
│   │   'schema': {
│   │   │   'type': 'model-fields',
│   │   │   'fields': {
│   │   │   │   'x': {'type': 'model-field', 'schema': {'type': 'definition-ref', 'schema_ref': '__main__.Numeric:109238506193520'}, 'metadata': {}},
│   │   │   │   'y': {'type': 'model-field', 'schema': {'type': 'definition-ref', 'schema_ref': '__main__.Numeric:109238506193520'}, 'metadata': {}}
│   │   │   },
│   │   │   'model_name': 'OuterModel',
│   │   │   'computed_fields': []
│   │   },
│   │   'config': {'title': 'OuterModel'},
│   │   'ref': '__main__.OuterModel:109238503123056',
│   │   'metadata': {'<stripped>'}
│   },
│   'definitions': [
│   │   {
│   │   │   'function': {'type': 'no-info', 'function': <bound method Numeric.validate of <class '__main__.Numeric'>>},
│   │   │   'schema': {
│   │   │   │   'type': 'model',
│   │   │   │   'cls': <class '__main__.Numeric'>,
│   │   │   │   'schema': {
│   │   │   │   │   'type': 'model-fields',
│   │   │   │   │   'fields': {'value': {'type': 'model-field', 'schema': {'type': 'float'}, 'metadata': {}}},
│   │   │   │   │   'model_name': 'Numeric',
│   │   │   │   │   'computed_fields': []
│   │   │   │   },
│   │   │   │   'config': {'title': 'Numeric'}
│   │   │   },
│   │   │   'ref': '__main__.Numeric:109238506193520',
│   │   │   'metadata': {'<stripped>'},
│   │   │   'type': 'function-before'
│   │   }
│   ]
}

On this PR, it looks like:

{
│   'type': 'definitions',
│   'schema': {
│   │   'type': 'model',
│   │   'cls': <class '__main__.OuterModel'>,
│   │   'schema': {
│   │   │   'type': 'model-fields',
│   │   │   'fields': {
│   │   │   │   'x': {
│   │   │   │   │   'type': 'model-field',
│   │   │   │   │   'schema': {
│   │   │   │   │   │   'function': {'type': 'no-info', 'function': <bound method Numeric.validate of <class '__main__.Numeric'>>},
│   │   │   │   │   │   'schema': {
│   │   │   │   │   │   │   'function': {'type': 'no-info', 'function': <bound method Numeric.validate of <class '__main__.Numeric'>>},
│   │   │   │   │   │   │   'schema': {'type': 'definition-ref', 'schema_ref': '__main__.Numeric:105945921898336'},
│   │   │   │   │   │   │   'metadata': {'<stripped>'},
│   │   │   │   │   │   │   'type': 'function-before'
│   │   │   │   │   │   },
│   │   │   │   │   │   'metadata': {'<stripped>'},
│   │   │   │   │   │   'type': 'function-before'
│   │   │   │   │   },
│   │   │   │   │   'metadata': {}
│   │   │   │   },
│   │   │   │   'y': {
│   │   │   │   │   'type': 'model-field',
│   │   │   │   │   'schema': {
│   │   │   │   │   │   'function': {'type': 'no-info', 'function': <bound method Numeric.validate of <class '__main__.Numeric'>>},
│   │   │   │   │   │   'schema': {
│   │   │   │   │   │   │   'function': {'type': 'no-info', 'function': <bound method Numeric.validate of <class '__main__.Numeric'>>},
│   │   │   │   │   │   │   'schema': {'type': 'definition-ref', 'schema_ref': '__main__.Numeric:105945921898336'},
│   │   │   │   │   │   │   'metadata': {'<stripped>'},
│   │   │   │   │   │   │   'type': 'function-before'
│   │   │   │   │   │   },
│   │   │   │   │   │   'metadata': {'<stripped>'},
│   │   │   │   │   │   'type': 'function-before'
│   │   │   │   │   },
│   │   │   │   │   'metadata': {}
│   │   │   │   }
│   │   │   },
│   │   │   'model_name': 'OuterModel',
│   │   │   'computed_fields': []
│   │   },
│   │   'config': {'title': 'OuterModel'},
│   │   'ref': '__main__.OuterModel:105945922827312',
│   │   'metadata': {'<stripped>'}
│   },
│   'definitions': [
│   │   {
│   │   │   'type': 'model',
│   │   │   'cls': <class '__main__.Numeric'>,
│   │   │   'schema': {
│   │   │   │   'type': 'model-fields',
│   │   │   │   'fields': {'value': {'type': 'model-field', 'schema': {'type': 'float'}, 'metadata': {}}},
│   │   │   │   'model_name': 'Numeric',
│   │   │   │   'computed_fields': []
│   │   │   },
│   │   │   'config': {'title': 'Numeric'},
│   │   │   'ref': '__main__.Numeric:105945921898336'
│   │   }
│   ]
}

Essentially, the difference in these two schemas is that we don't "move" the ref from the inner schema to the function-before schemas.

The changes in this PR + removing this reference moving coincidentally make it work still.

However, doing so was a dangerous game: on L793, schema directly comes from another model. The pop calls removes the reference to the schema, and mutating schemas from other models has been a known issue.

You may be wondering: why this doesn't break things in the example I gave? Surely the pop call should have mutated the core schema of Numeric. Turns out it doesn't, because Numeric.__get_pydantic_core_schema__ does not cache the schema, so calling it will generate a new one every time (and this is what happens during the schema gen of OuterModel). But on a similar issue, I mentioned that explicitly caching the core schema in the __get_pydantic_core_schema__ method would resolve the user issue (as the use case was slightly different)!

So to conclude, overriding BaseModel.__get_pydantic_core_schema__ is full of unexpected behaviors, but that's fine as officially supporting them would be a huge pain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to break this change out into a different PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc (but I'm not sure), I was able to remove it only thanks to the other changes. This won't clutter the git diff though, because it's just a removal. Probably by having a proper commit description when merging, I can add a note about this?

@sydney-runkle
Copy link
Contributor

I've removed the example I added here to address #7833. It's no longer relevant, as there's no __get_pydantic_core_schema__ on super, so things go as planned without this workaround.

cc @chbndrhnns, you might find this interesting. This simplification now works:

from typing import Type

from pydantic_core import CoreSchema
from typing_extensions import Annotated

from pydantic import BaseModel, GetCoreSchemaHandler


class Metadata(BaseModel):
    foo: str = 'metadata!'
    bar: int = 100


class Model(BaseModel):
    state: Annotated[int, Metadata()]


m = Model.model_validate({'state': 2})
print(repr(m))
#> Model(state=2)
print(m.model_fields)
"""
{
    'state': FieldInfo(
        annotation=int,
        required=True,
        metadata=[Metadata(foo='metadata!', bar=100)],
    )
}
"""

Copy link
Contributor

github-actions bot commented Nov 20, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  pydantic
  main.py 692
  pydantic/_internal
  _core_utils.py
  _dataclasses.py
  _generate_schema.py 856
  _model_construction.py
Project Total  

This report was generated by python-coverage-comment-action

@sydney-runkle sydney-runkle marked this pull request as ready for review November 20, 2024 21:13
Comment on lines -989 to -973
As seen above, annotating a field with a `BaseModel` type can be used to modify or override the generated json schema.
However, if you want to take advantage of storing metadata via `Annotated`, but you don't want to override the generated JSON
schema, you can use the following approach with a no-op version of `__get_pydantic_core_schema__` implemented on the
metadata class:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like a terrible example, but is ther ean alternative way to do what it is trying to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see #10863 (comment)

You just don't have to use the nasty no-op, so it's even better!

@sydney-runkle
Copy link
Contributor

@Viicos, I'd love to chat about a few more nitty gritty details here tomorrow, but other than that, I think we can move forward with this. Great find!

@chbndrhnns
Copy link
Contributor

cc @chbndrhnns, you might find this interesting. This simplification now works:

Thanks for the ping, @sydney-runkle. This sounds exciting! I am waiting for the release now ;)

@sydney-runkle
Copy link
Contributor

Sure thing. We just released v2.10, and this won't fit in until v2.11, which will be a bit 😅

Feel free to install from main once we've merged this, if you want the change soon!

@Viicos
Copy link
Member Author

Viicos commented Dec 20, 2024

I went and tested this branch on a couple libraries (using the following GH search), and couldn't find any issues. This should imo be good to merge once approved.

@Viicos Viicos added the third-party-tests Add this label on a PR to trigger 3rd party tests label Dec 20, 2024
@sydney-runkle
Copy link
Contributor

Nice, tests passing! Let's add a few other integrations before we move forward with a merge here?

@sydney-runkle
Copy link
Contributor

@Viicos, I can review this again early next week once we have some more third party tests added?

@Viicos
Copy link
Member Author

Viicos commented Dec 30, 2024

@Viicos, I can review this again early next week once we have some more third party tests added?

Two integration tests were added since #10863 (comment), but overall it is pretty hard to see how this affects existing code, as most 3rd-party libraries are not defining a lot of custom types/__get_pydantic_core_schema__ implementations. Only code possibly affected seems ok (#10863 (comment))

@sydney-runkle
Copy link
Contributor

Fair enough, I'll review today then 👍

@sydney-runkle sydney-runkle added relnotes-change Used for changes to existing functionality which don't have a better categorization. relnotes-performance Used for performance improvements. deferred Deferred until future release or until something else gets done and removed relnotes-fix Used for bugfixes. labels Jan 8, 2025
@sydney-runkle sydney-runkle removed the deferred Deferred until future release or until something else gets done label Jan 14, 2025
Copy link
Contributor

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, a few follow up questions 👍

Comment on lines -793 to -873
if is_function_with_inner_schema(schema):
ref = schema['schema'].pop('ref', None) # pyright: ignore[reportCallIssue, reportArgumentType]
if ref:
schema['ref'] = ref
else:
ref = get_ref(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to break this change out into a different PR?

Copy link
Contributor

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait on merging 3.8, but otherwise LGTM :)

@chbndrhnns
Copy link
Contributor

This simplification now works:

@sydney-runkle
The no-op override seems still required with 2.11.5.
What were you referring to when saying it's no longer required?

FAILED                                             [100%]
tests/test_.py:16 (test_)
def test_():
>       m = Model.model_validate({'state': 2})
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for Model
E       state
E         Input should be a valid dictionary or instance of Metadata [type=model_type, input_value=2, input_type=int]
E           For further information visit https://errors.pydantic.dev/2.11/v/model_type

tests/test_.py:18: ValidationError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes-change Used for changes to existing functionality which don't have a better categorization. relnotes-performance Used for performance improvements. third-party-tests Add this label on a PR to trigger 3rd party tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pydantic dataclasses cannot be used as Annotated Validators
5 participants