-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Refactor and optimize schema cleaning logic #11244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying pydantic-docs with
|
Latest commit: |
76b7109
|
Status: | ✅ Deploy successful! |
Preview URL: | https://2c6bd658.pydantic-docs.pages.dev |
Branch Preview URL: | https://ref-schema-walking.pydantic-docs.pages.dev |
CodSpeed Performance ReportMerging #11244 will improve performances by 33.86%Comparing Summary
Benchmarks breakdown
|
a0cae6c
to
1fad4cf
Compare
dacb6f7
to
8631f3e
Compare
48dc1a7
to
818269d
Compare
I think we can mark this as closing #10655 |
As in, in a new commit on this PR, you're just refraining for now? |
A few questions:
I think it would be really good to loop in @adriangb here - he worked a lot with refs/defs in core schema gen, or so the blame says 😉. |
Alright, a few things before I dive into an in depth review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, seems promising. I've already voiced some concerns in independent comments that we can chat about.
I think it'd be helpful to see the deletion of those other methods for comparison purposes on this PR 👍
'$defs': {'MySeq_int_': {'items': {'type': 'integer'}, 'type': 'array'}}, | ||
'properties': {'my_int_seq': {'$ref': '#/$defs/MySeq_int_'}}, | ||
'$defs': {'MyIntSeq': {'items': {'type': 'integer'}, 'type': 'array'}}, | ||
'properties': {'my_int_seq': {'$ref': '#/$defs/MyIntSeq'}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note about this change, discussed in detail previously: #10655 (comment)
It it the same
If you have more that 1 reference to a) It does affect JSON Schema gen
Only failing tests are because of the match statement that I need to change.
yes sorry the PR description isn't complete yet. |
818269d
to
afa73b0
Compare
The issue is not just verbose schemas but also duplicate SchemaValidator's. |
I'll note that this is only an issue with the following setup: class StandaloneModel(BaseModel): ...
class M1(BaseModel):
m: StandaloneModel
# If StandaloneModel is referenced another time,
# M1's core schema is transformed in a `'definitions'` schema,
# and we don't have any duplication of StandaloneModel's cs.
class M2(BaseModel):
m: StandaloneModel
# Same note as M1
...
class Outer(BaseModel):
m: StandaloneModel
# Same note as M1 and M2
m1: M1
# If M1 is referenced another time,
# Outer's core schema is transformed in a `'definitions'` schema,
# and we don't have any duplication of M1's cs
m2: M2
# Same
...
mx: MX Which is relatively uncommon. Boxy's work will help here as well in reusing validator instances. JSON Schema changes is still an issue though (still valid, but the structure changes and this generally create churn in user code), so I'm looking into a way to have the previous behavior). |
b258a3e
to
fe0db49
Compare
Thanks for the clarification @Viicos. I'm less concerned now about this defs/refs simplification now that I understand it's semi-limited. That being said, I do think think we might want to revisit an extra pass here in the future if/when we can get significant memory usage improvements with Boxy's work (cc @davidhewitt). Happy to take another pass over this - could you remove the existing walk / discriminator logic so I can compare? Thanks for all of your work here and the detailed explanations of some complex patterns. |
dbced49
to
0321015
Compare
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
return schema | ||
|
||
|
||
def _can_be_inlined(def_ref: core_schema.DefinitionReferenceSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference from MS PR: instead of checking for specific keys in the metadata schema, we just check that there are no metadata at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we inline if there's metadata, ex json schema related metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import Annotated
from pydantic import BaseModel, WithJsonSchema
type Test = int
class Model(BaseModel):
f: Annotated[Test, WithJsonSchema({})]
Model.__pydantic_core_schema__
{
│ 'type': 'definitions',
│ 'schema': {
│ │ 'type': 'model',
│ │ 'cls': <class '__main__.Model'>,
│ │ 'schema': {
│ │ │ 'type': 'model-fields',
│ │ │ 'fields': {
│ │ │ │ 't1': {'type': 'model-field', 'schema': {'type': 'int'}, 'metadata': {}},
│ │ │ │ 't2': {
│ │ │ │ │ 'type': 'model-field',
│ │ │ │ │ 'schema': {'type': 'definition-ref', 'schema_ref': '__main__.Test:124259184101792', 'metadata': {'<stripped>'}},
│ │ │ │ │ 'metadata': {}
│ │ │ │ }
│ │ │ },
│ │ │ 'model_name': 'Model',
│ │ │ 'computed_fields': []
│ │ },
│ │ 'config': {'title': 'Model'},
│ │ 'ref': '__main__.Model:107106664271472',
│ │ 'metadata': {'<stripped>'}
│ },
│ 'definitions': [{'type': 'int', 'ref': '__main__.Test:124259184101792'}]
}
If you inline the definition ref, you loose the JSON Schema metadata. It could still be inlined and the metadata moved to the referenced schema, but you'll need to make a copy of it and merge metadata somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note about this to the code?
|
||
|
||
@pytest.mark.parametrize('deep_ref', [False, True]) | ||
@pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference from MS PR due to the type refs changes (specifically the GPCS method change). The generated schemas are still valid, but in this case the output is different:
On this branch, the core schema for M1 is:
{
│ 'type': 'definitions',
│ 'schema': {
│ │ 'type': 'model',
│ │ 'cls': <class '__main__.M1'>,
│ │ 'schema': {
│ │ │ 'type': 'model-fields',
│ │ │ 'fields': {'a': {'type': 'model-field', 'schema': {'type': 'definition-ref', 'schema_ref': '__main__.M2:101691608428272'}, 'metadata': {}}},
│ │ │ 'model_name': 'M1',
│ │ │ 'computed_fields': []
│ │ },
│ │ 'config': {'title': 'M1'},
│ │ 'ref': '__main__.M1:101691610240000',
│ │ 'metadata': {'<stripped>'}
│ },
│ 'definitions': [
│ │ {
│ │ │ 'type': 'model',
│ │ │ 'cls': <class '__main__.M2'>,
│ │ │ 'schema': {
│ │ │ │ 'type': 'model-fields',
│ │ │ │ 'fields': {
│ │ │ │ │ 'b': {
│ │ │ │ │ │ 'type': 'model-field',
│ │ │ │ │ │ 'schema': {
│ │ │ │ │ │ │ 'type': 'model',
│ │ │ │ │ │ │ 'cls': <class '__main__.M1'>,
│ │ │ │ │ │ │ 'schema': {
│ │ │ │ │ │ │ │ 'type': 'model-fields',
│ │ │ │ │ │ │ │ 'fields': {
│ │ │ │ │ │ │ │ │ 'a': {
│ │ │ │ │ │ │ │ │ │ 'type': 'model-field',
│ │ │ │ │ │ │ │ │ │ 'schema': {'type': 'definition-ref', 'schema_ref': '__main__.M2:101691608428272'},
│ │ │ │ │ │ │ │ │ │ 'metadata': {}
│ │ │ │ │ │ │ │ │ }
│ │ │ │ │ │ │ │ },
│ │ │ │ │ │ │ │ 'model_name': 'M1',
│ │ │ │ │ │ │ │ 'computed_fields': []
│ │ │ │ │ │ │ },
│ │ │ │ │ │ │ 'config': {'title': 'M1'},
│ │ │ │ │ │ │ 'ref': '__main__.M1:101691610240000',
│ │ │ │ │ │ │ 'metadata': {'<stripped>'}
│ │ │ │ │ │ },
│ │ │ │ │ │ 'metadata': {}
│ │ │ │ │ }
│ │ │ │ },
│ │ │ │ 'model_name': 'M2',
│ │ │ │ 'computed_fields': []
│ │ │ },
│ │ │ 'config': {'title': 'M2'},
│ │ │ 'ref': '__main__.M2:101691608428272',
│ │ │ 'metadata': {'<stripped>'}
│ │ }
│ ]
}
As we can see, the core schema for M1
appears twice and could be inlined.
c1fffce
to
e093879
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I feel like _generate_schema.py
could use a lot more documentation, given that schema gen + ref/def simplification is complicated by nature.
This constitutes a really significant improvement though, great work.
I like the new typed nature of the schema gathering results, etc. Easier to follow that logic as well! I'd like to take one more pass over _schema_gather.py
, but generally, really happy with this improvement.
return schema | ||
|
||
|
||
def _can_be_inlined(def_ref: core_schema.DefinitionReferenceSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we inline if there's metadata, ex json schema related metadata?
d74190f
to
7340799
Compare
7340799
to
c91eef1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good. Happy to approve pending consideration of my final nit picks :)
f1db547
to
76b7109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @Viicos, perf wins like this take a lot of deep thought and work.
@MarkusSintonen, thanks for the ideas and blueprint here! We're really excited about this perf boost.
Thanks @sydney-runkle / @Viicos! Why the generic Generic models construction is the most painful issue as that has been so slow. |
It's not entirely clear to me why this benchmark had better results than the other ones. I compared both flamegraphs on this benchmark, on this PR and yours, and couldn't find any significant time difference being spent in schema gathering (as that's the only significant difference between the two PRs -- this one is implemented in Python and so is a bit slower, however this isn't significant). Note that since then, we merged other performance improvements so the percentages aren't really comparable. |
Change Summary
Closes #10655, fixes #10285
This is slightly identical to #10655, with the difference that:
pydantic-core
. Out of 3-4s, the rust implementation takes 20ms, while the Python one takes 230ms. I don't know yet if we should keep it in Python (it has advantages: type checking, easier debugging).Concerns (1 so far):
Consequence of the new schema traversing logic: we don't traverse schemas that have been unpacked from another model. With the following:
The core schema of M1 is:
M1 core schema
As you can see,
Model
is inlined inside the core schema for fieldm
.Now if you do:
StandaloneModel
is used twice inM2
(inm
andM1.m
). However, the inlined core schema forM1.StandaloneModel
is not transformed back into a'definition-ref'
schema (and doing so would require a copy of the schema to avoid mutating the core schema ofM1
fromM2
!):M2 core schema
On
main
,StandaloneModel
is properly stored in definitions, and'definition-ref'
schemas are used:M2 core schema on main
Both approaches are valid, but it might be that this cause unexpected issues (JSON Schema differences, etc). Regarding memory consumption, it is slightly affected. Taking the
k8s_v2.py
file as an example (it is a good one because the "issue" described here happens several times in it), we don't see any increase/decrease in %RAM consumption, although profile over time differs:main