Skip to content

[JsonEncoder][Serializer] Introducing the component #51718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

mtarld
Copy link
Contributor

@mtarld mtarld commented Sep 22, 2023

Q A
Branch? 7.2
Bug fix? no
New feature? yes
Deprecations? no
Tickets
License MIT
Doc PR TODO

This PR introduces new component: JsonStreamer (initially named JsonEncoder - and has been renamed in #59863)

Note

This is the continuation of a Serializer revamp trial, the previous PR description is available here.

Why?

The Serializer component is a library designed to normalize PHP data structures in raw associative arrays, and then encode them in a large variety of formats, which offers a high degree of flexibility.
However, that flexibility has some drawbacks:

  • Data shapes get computed every time (de)serialization happens, which is very expensive as it implies resource-intensive calls such as reflection.
  • Each time the serializer is called, normalizers and encoders are tried until one supporting the given data is found. While this pattern works well when looping through a relatively small amount of services (e.g. security authenticators), it rapidly becomes costly as the number of normalizers/encoders grows, even though the situation has been slightly improved in 6.3 with the addition of getSupportedTypes().
  • The whole normalized data is at one point stored in memory, this can cause memory issues when dealing with huge collections.

Plus, that degree of flexibility isn't that often needed. Indeed, there are many use cases where we use the Serializer component to serialize data without intensive modification (IE: without custom normalization). And in these cases, the flexibility degrades a lot of performances for nothing.

That's why this PR introduces the JsonStreamer component, which is focused on performance to address the above use case for the specific JSON format. The DNA of that component is to be a fast and modern JSON parser and streaming encoder. It fixes many issues of the native json_encode and json_decode PHP functions: streaming, on-demand parsing, generics handling, ability to create strongly typed objects instead of raw associative arrays in one pass, etc.

We can see the difference between the Serializer component and the JsonStreamer component like the difference between Doctrine ORM and Doctrine DBAL.
Indeed, the DBAL can be considered as a sub-layer of ORM, and when precise and performance-related stuff is needed, developers will skip the ORM layer to deal with the DBAL one directly.
And it's the very same difference between the Serializer and the JsonStreamer, when precise and performance-related stuff is needed, developers will skip the normalization layer, by fine-tuning the data mapping in their userland and deal with the encoding layer directly.

API

Contrary to the Symfony\Component\Serializer\SerializerInterface, which has two methods serialize and deserialize, the new design will instead introduce four new interfaces.

These compose the main part of the available API.

<?php

namespace Symfony\Component\JsonStreamer;

use Symfony\Component\TypeInfo\Type;

/**
 * Writes $data into a specific format according to $options.
 *
 * @template T of array<string, mixed>
 */
interface StreamWriterInterface
{
    /**
     * @param T $options
     *
     * @return \Traversable<int, string>&\Stringable
     */
    public function write(mixed $data, Type $type, array $options = []): \Traversable&\Stringable;
}
<?php

namespace Symfony\Component\JsonStreamer;

use Symfony\Component\TypeInfo\Type;

/**
 * Reads an $input and convert it to given $type according to $options.
 *
 * @template T of array<string, mixed>
 */
interface StreamReaderInterface
{
    /**
     * @param resource|string $input
     * @param T               $options
     */
    public function read($input, Type $type, array $options = []): mixed;
}

As you can notice, there is no $format parameter.
It is indeed logical because a streamer knows how to deal with only one format.

Usage example

Install the component

composer require symfony/json-streamer

Configure PHP attributes:

<?php

use Symfony\Component\JsonStreamer\Attribute\JsonStreamable;
use Symfony\Component\JsonStreamer\Attribute\StreamedName;
use Symfony\Component\JsonStreamer\Attribute\ValueTransformer;

#[JsonStreamable]
class Dummy
{
    #[StreamedName('@id')]
    public int $id;

    public string $name;

    #[ValueTransformer(
        streamToNative: DoubleIntAndCastToStringNormalizer::class,
        nativeToStream: DivideStringAndCastToIntDenormalizer::class,
    )]
    public int $price;
}

Add the proper value transformers:

<?php

use Symfony\Component\JsonStreamer\ValueTransformer\ValueTransformerInterface;
use Symfony\Component\TypeInfo\Type;

final class DoubleIntAndCastToStringNormalizer implements ValueTransformerInterface
{
    public function transform(mixed $value, array $options = []): mixed
    {
        return (string) (2 * $options['scale'] * $data);
    }

    public static function getStreamValueType(): Type;
    {
        return Type::string();
    }
}

// ---

use Symfony\Component\JsonStreamer\ValueTransformer\ValueTransformerInterface;
use Symfony\Component\TypeInfo\Type;

final class DivideStringAndCastToIntDenormalizer implements ValueTransformerInterface
{
    public function transform(mixed $value, array $options = []): mixed
    {
        return (int) (((int) $data) / (2 * $options['scale']));
    }

    public static function getStreamValueType(): Type
    {
        return Type::string();
    }
}

Then use the stream reader/writer:

<?php

use Symfony\Component\JsonStreamer\JsonStreamReader;
use Symfony\Component\JsonStreamer\JsonStreamWriter;
use Symfony\Component\TypeInfo\Type;

final class MyService
{
    public function __invoke(): void
    {
        $streamReader = JsonStreamReader::create();
        $streamWriter = JsonStreamWriter::create();

        // convert dummy to JSON string
        echo (string) $streamWriter->write(new Dummy(), Type::object(Dummy::class));

        // convert dummy to JSON iterable string
        foreach ($streamWriter->write(new Dummy(), Type::object(Dummy::class)) as $chunk) {
            echo $chunk;
        }

        // convert a stringable dummy as a string
        echo (string) $streamWriter->write(new StringableDummy(), Type::string());

        // convert collection with generics
        $type = Type::generic(Type::object(Collection::class), Type::object(Dummy::class));
        echo (string) $streamWriter->write(new Collection([new Dummy(), new Dummy()]), $type);

        // convert JSON string to dummy
        $streamReader->read('...', Type::object(Dummy::class));

        // convert JSON resource to dummy lazy ghost
        $resource = fopen('php://temp', 'w');
        fwrite($resource, '...');
        rewind($resource);

        $streamReader->read($resource, Type::object(Dummy::class));

        // decode JSON string/resource to a collection with generics
        $json = '{...}';
        $type = Type::generic(Type::object(Collection::class), Type::object(Dummy::class);

        $streamReader->read($json, $type);
    }
}

Main ideas

Cache

The main trick to improve performance is to rely on the cache.
During cache warm-up (or on the fly once and for all), the data structure is computed and used to generate a cache PHP file that we could name "template".
Then, the generated template is called with the actual data to deal with encoding/decoding.

Template generation is the main costly thing. And because the template is computed and written once, only the template execution will be done all the other times, which implies lightning speed!

Here is the workflow during runtime:

cache miss cache hit
search for template search for template
build data model execute template file
→ scalar nodes
→ collection nodes
→ object nodes
→→ load properties metadata (reflection, attributes, ...)
build template PHP AST
optimize template PHP AST
compile template PHP AST
write template file
execute template file

By the way, because it is intended to work mostly with DTOs, it'll work well with an automapping tool.

Stream

To improve memory usage, encoding, and decoding are relying on generators.
In that way, the whole JSON string will never be at once in memory.

Here is for example a simple encoding template PHP file:

<?php

return static function (mixed $data, array $config): \Traversable {
    yield '{"@id":';
    yield \json_encode($data->id);
    yield ',"name":';
    yield \json_encode($data->name);
    yield '}';
};

Configuration and context

Contrary to the actual Serializer implementation, a difference has been made between "configuration" and "context".

  • The configuration is meant to be provided by the developer when calling the stream reader/writer. It is a basic hashmap such as the previous context, but is documented thanks to PHPStan types so it can be autocompleted, and validated during static analysis.
  • The context can be compared to runtime encoding/decoding information. It is internal and isn't meant to be manipulated by the developer. It is also a basic hashmap.

Performance showcase

With all these ideas, performance has been greatly improved.

When serializing 10k objects to JSON, it is about 10 times faster than the existing, and can even be compared to the json_encode native function.
serializer_speed

And it consumes about 2 times less memory.
serializer_memory

When deserializing a JSON file to a list of 50k objects, iterating one the 9999 firsts and reading the 10000th eagerly, it is more than 10 times faster than the legacy deserialization, and can even be compared to the json_decode native function!
deserializer_speed

In terms of memory consumption, the new implementation is comparable to the existing one when reading eagerly.

And when reading lazily, it consumes about 10 times less memory!
deserializer_memory

And it doesn't stop there, @dunglas is working on a PHP extension compatible with that new version of the component leveraging simdjson to make JSON serialization/deserialization even faster.

These improvements will benefit several big projects such as Drupal, Sylius, and API Platform (some integration tests already have been made for this).
It'll also benefit many other tiny projects as many are dealing with serialization.

The code of the used benchmark can be found here.

Extension points

PropertyMetadataLoaderInterface

The Symfony\Component\JsonStreamer\{Read,Write}\Stream{Reader,Writer}Generator calls a Symfony\Component\JsonStreamer\Mapping\PropertyMetadataLoaderInterface to retrieve object's properties, with their name, their type, and their formatters.

Therefore, it is possible to decorate (or replace) the Symfony\Component\JsonStreamer\Mapping\PropertyMetadataLoaderInterface.
In that way, it'll be possible for example to read extra custom PHP attributes, ignore specific object's properties, and rename every properties, ...

As an example, in the component, there are:

  • The PropertyMetadataLoader which reads basic properties information.
  • The AttributePropertyMetadataLoader which reads properties attributes such as EncodedName, EncodeFormatter, DecodeFormatter, and MaxDepth to ignore, rename or add formatters on the already retrieved properties.
  • The GenericTypePropertyMetadataLoader which updates properties' types according to generics, and cast date-times to strings.
  • The DateTimeTypePropertyMetadataLoader which updates properties' types to cast date-times to strings and vice-versa.

For example, you can hide sensitive data of sensitive classes and a sensitive marker:

<?php

use Symfony\Component\JsonStreamer\Mapping\PropertyMetadata;
use Symfony\Component\JsonStreamer\Mapping\PropertyMetadataLoaderInterface;
use Symfony\Component\TypeInfo\Type;

final class CustomPropertyMetadataLoader implements PropertyMetadataLoaderInterface
{
    public function __construct(
        private readonly PropertyMetadataLoaderInterface $decorated,
    ) {
    }

    public function load(string $className, array $config, array $context): array
    {
        $result = $this->decorated->load($className, $config, $context);
        if (!is_a($className, SensitiveInterface::class, true)) {
            return $result;
        }

        foreach ($result as &$metadata) {
            if ('sensitive' === $metadata->name()) {
                $metadata = $metadata
                    ->withType(Type::string())
                    ->withFormatter(self::hideData(...));
            }
        }

        $result['is_sensitive'] = new PropertyMetadata(
            name: 'wontBeUsed',
            type: Type::bool(),
            formatters: [self::true()],
        );

        return $result;
    }

    public static function hideData(mixed $value): string
    {
        return hash('xxh128', json_encode($value));
    }

    public static function true(): bool
    {
        return true;
    }
}

@mtarld mtarld requested a review from dunglas as a code owner September 22, 2023 07:48
@joelwurtz
Copy link
Contributor

What a job ! thanks for the work on this, the serializer really needs some love.

I look through most of the implementation design it's really well done with nice interface layers.

I think there is too many extension point for a start, maybe some of them are not needed ? But we should provide an integration / extension example with api platform to have feedback on those extensions point.

Are you not afraid about the generated code implementation on maintenance burden, i think there a 2 ways on this, use this implementation or use the php-parser library but i'm not sure which one is better in terms of maintenance ?

Really hope this get accepted into symfony

@Hanmac
Copy link
Contributor

Hanmac commented Sep 22, 2023

is the planned changes also affect the Normalizer / Denormalizer part of the Serializer?

when using Symfony HTTPClient, it already decodes the data for me into Array,
so the Serializer just need to Denormalize the data.

Or depending on the HTTPClient (like Guzzle?) i could use the Data from a Response Stream?

@mtarld
Copy link
Contributor Author

mtarld commented Sep 22, 2023

Many thanks @joelwurtz! I truly think as well that this component deserves more love!

I think there is too many extension point for a start, maybe some of them are not needed ? But we should provide an integration / extension example with api platform to have feedback on those extensions point.

Yes, I can agree! I began by exposing as many extension points as I could, as it's easier to reduce them instead of adding them. But we need to define which extension point is relevant for a start.

Are you not afraid about the generated code implementation on maintenance burden, i think there a 2 ways on this, use this implementation or use the php-parser library but i'm not sure which one is better in terms of maintenance ?

Indeed, it'll reduce the added code a lot (and the maintenance burden), but at the same time complicate template generation and PHP AST optimization code as right now nodes are designed for these purposes.
I don't know about php-parser library BC promise and release process, it seems to be acceptable because it is already "opt-in" used by the translation component. But this time, it unfortunately won't be opt-in.

@mtarld
Copy link
Contributor Author

mtarld commented Sep 22, 2023

@Hanmac, yes, it'll affect that part. Indeed, the performance improvement relies on somehow moving the normalization/denormalization step to cache (by the computing data shape only once)

For your specific use case, you can leverage the response's getContent method instead to retrieve a string that you can give to the deserializer.

@Hanmac
Copy link
Contributor

Hanmac commented Sep 22, 2023

@mtarld my problem there:

I use the nomalizer part to read from an API
And my custom normalizer turns references into Models

Like for Products, getProduct might return this:

{
  "id": "Uuidv4",
  "Sector": {
    "id": "Uuidv4"
  }
}

My normalizer notices that the Sector part is incomplete, so it automatically calls getSector from my API client

To make the normalizer access, the current Client call, I have it access the client via Context

Would that still be possible?

@nikophil
Copy link
Contributor

nikophil commented Sep 22, 2023

I don't know about php-parser library BC promise and release process

I think they have a very strong BC policy. Rector, Phpstan, Psalm are built in top of it

@mtarld
Copy link
Contributor Author

mtarld commented Sep 23, 2023

@Hanmac, I think that this is the way that denormalizers should not be used.

Indeed, normalizers are meant to turn objects to array and vice versa, nothing more. Here, doing HTTP calls on the fly, or querying a database for example, will imply be a big lack of separation of concerns. The process of retrieving the actual data behind a URL must live in a custom service of yours.
Therefore, an idea instead could be to have a DTO representing the first HTTP response (with the IRI), and another one that you can fill manually with nested HTTP responses (sectors for instance).
Anyway, this must not be hidden in the denormalization process IMHO 🙂

@Hanmac
Copy link
Contributor

Hanmac commented Sep 23, 2023

@Hanmac, I think that this is the way that denormalizers should not be used.

Indeed, normalizers are meant to turn objects to array and vice versa, nothing more. Here, doing HTTP calls on the fly, or querying a database for example, will imply be a big lack of separation of concerns. The process of retrieving the actual data behind a URL must live in a custom service of yours. Therefore, an idea instead could be to have a DTO representing the first HTTP response (with the IRI), and another one that you can fill manually with nested HTTP responses (sectors for instance). Anyway, this must not be hidden in the denormalization process IMHO 🙂

For work, this is one of the APIs i was going to map:
https://support.korona.de/korona-cloud-api/

when load 100+ of Product with one query, i want them to load the "ModelReference" as well, like for example their sectors or ticketDefinition.

i implemented a logic in the Denormalizer to load as less objects as possible, for example:

  • when multiple products references the same Sector, the Sector is only loaded once (while in the same Serializer call)
  • Having Multiple products reference each other relatedProducts, they notice that when A->B->A that A is already going to be loaded, is then added later to B's property

@n-valverde
Copy link
Contributor

Wow that's quite a big work, congrats @mtarld . That being said, for what it worth I have to admit I'm on the edge with this proposal as of now 😅. The perf showcase definitely makes it look awesome, but here's a few user's POV thoughts coming to my mind:

Plus, this design makes debugging hard, especially using nested normalizers.
Core normalizers make use of inheritance which leads to maintenance headaches.
Some features should've rather been left to userland and community packages [...] These add unnecessary complexity to the codebase which increases the maintenance burden.

About the why, it seems there are 2 clear intentions: improving performance, and easing design/maintenance/debugging.
The first point is clearly demonstrated, but the second is not really. My feeling going through the PR is that the proposed implementation - while being well done - is extremely hard to understand (and you seem to agree with that by stating the extension points are hard to use), and so probably to debug.
On the other hand, the current implementation is quite easy to understand, but can become hard to debug when the stack grows.
I understand there might be tradeoffs to do about simplicity when it comes to performance, but this should probably not be advertised as simplifying the design/maintenance/debugging here imho.

because it is intended to work mostly with DTOs

Why? Imho the serializer should not be intended to work with specific objects. Doc states that The Serializer component is meant to be used to turn objects into a specific format (XML, JSON, YAML, ...) and the other way around.

To be able to generate templates that stick as much as possible to the data model, the support of generics has been introduced.

Does that mean that if my app does not use generics, I might end up with bad generated templates? Or is it just better if I use them, i.e. no drawback in not using them compared to the current implementation?

BC & Upgrade Path

It is missing quite a big BC break here imo, dropping support for normalizers/denormalizers, while it is probably one of the most used extension points of the current implementation. Futhermore, normalizers/denormalizers are also used to be injected on their own, when you don't need to go through the full serialization. What is the suggested replacement? Could you showcase a simple normalizer/denormalizer and what it would become, would be great?
What about the other features of the current serializer, will they still work? E.g. is there a way to denormalize into an existing object? Name conversion? Etc...
Dropping some supported formats is obviously a BC break as well, but that can probably be implemented, or should be advertised.
The serializer has a straight forward and clear process, object -> normalize -> array -> encode -> string, and the other way around, which also makes it clear where are the extension points, could you showcase the workflow of your implementation in the same idea for comparison purpose? The cache miss/hit workflow you wrote is not very clear about where are the extension points, and does not make the distinction between serialization/deserialization.

Most of the time, it's a bad idea to alter the data structure depending on the
serialization configuration, it is rather recommended to use an adapted and dedicated DTO
during serialization.

That's what normalizers/denormalizers are advertised for 😅 you may need to create your own normalizer to transform an unsupported data structure. Imagine you want add, modify, or remove some properties during the serialization process. For that you'll have to create your own normalizer. But it's usually preferable to let Symfony normalize the object, then hook into the normalization to customize the normalized data.

Extension points (and concepts)

I agree on the general note that there are too many extension points, but on the other hand it is not super clear yet where they all kick in or could be useful. Maybe if you can showcase the workflow as simply as possible with all relevant extension points, could give a better clue.
Config classes: Do they really need to be immutable? I mean I understand why you want them immutable, but if you expect them to be extended in userland, you can't really guarantuee immutability then.
PropertyMetadataLoaderInterface: It seems to be the main replacement for normalizers/denormalizers, but relies on a formatter concept which is not really advertised in the PR. It feels weird to implement something about metadata to actually deal with the data. But when you think about attaching a formatter then it makes sense, so I think the concept of formatters needs to be clearly advertised. I'm not sure this cover all cases of a normalizer/denormalizer though, does it?
SplitterInterface: I think the concept needs to be better explained 😄.

Hopefully these are constructive enough questions/thoughts 😇, and congrats again for the huge work! Looking forward to see what this PR becomes!
Cheers!

@jdreesen
Copy link
Contributor

@mtarld just a quick heads-up that the last paragraph (Thoughts about Type and the PropertyInfo component) in your description is currently hidden in the last details tag (which describes the InstantiatorInterface for deserialization) which probably causes it to be easily overlooked.

@nicolas-grekas nicolas-grekas added this to the 7.0 milestone Sep 25, 2023
@carsonbot carsonbot changed the title [Serializer] Putting the serializer component on steroids Putting the serializer component on steroids Sep 25, 2023
@Nyholm
Copy link
Member

Nyholm commented Sep 25, 2023

Great. Thank you for this. Really impressed with the work you have done.

Can you please give me a simple usage example without the framework? I would like to make sure I set things up properly before I rerun my own performance tests.

@dunglas
Copy link
Member

dunglas commented Sep 25, 2023

To support @n-valverde's point, I wonder if this PR shouldn't be a new component of its own.

This is a better and more powerful alternative to json_encode/decode, as the Yaml component is a powerful alternative to the YAML extension. But it's not a full replacement for the current Serializer, and it has a very different design (which is better IMHO).

It could be the JsonEncoder or JsonMarshaller component (the latter is probably better to prevent confusion with encoders from the Serializer) to use the same naming as Go. Or just the Json component to be consistent with Yaml.

The current Serializer has a lot of features that will be hard to keep in your implementation (especially with a fully functional BC layer), and supporting as many features and formats as the Serializer will likely "bloat" the new code. Also, there are hundreds of projects that use the Serializer, and forcing all of them to migrate will be hard.

In most (JSON) cases, it will be possible to use the JsonEncoder component instead of the Serializer, and maybe at some point, we'll be able to feature-freeze it, and then deprecate it. But I doubt that we'll manage to do that anytime soon.

Anyway, I can't wait to get this PR merged into Symfony!

@carsonbot carsonbot changed the title Putting the serializer component on steroids [Serializer] Putting the serializer component on steroids Sep 26, 2023
@fabpot fabpot modified the milestones: 7.2, 7.3 Nov 20, 2024
Copy link
Member

@fabpot fabpot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the fact that most classes are tagged as internal.
Great work here, I'm not a big fan of the ::create() methods but I can see how they are useful, maybe something to revisit at some point.

@fabpot
Copy link
Member

fabpot commented Dec 10, 2024

Thank you @mtarld.

@fabpot fabpot merged commit c198130 into symfony:7.3 Dec 10, 2024
8 of 10 checks passed
@mtarld mtarld deleted the redesign branch December 10, 2024 17:40
chalasr added a commit that referenced this pull request Dec 11, 2024
This PR was merged into the 7.3 branch.

Discussion
----------

[FrameworkBundle][JsonEncoder] Wire services

| Q             | A
| ------------- | ---
| Branch?       | 7.3
| Bug fix?      | no
| New feature?  | yes
| Deprecations? | no
| Issues        |
| License       | MIT

Follow-up of
* #51718

The FrameworkBundle part of the JsonEncoder component introduction.

The component related config is quite simple:
```yaml
framework:
    json_encoder:
        paths:
            App\EncodableDto\: '../src/EncodableDto/*'
```

Plus, the framework integration proposes the following bindings:
- `EncoderInterface $jsonEncoder` to the `json_encoder.encoder` service
- `DecoderInterface $jsonDecoder` to the `json_encoder.decoder` service

---

As this PR is based on top of #51718, only the last commit should be considered.

Commits
-------

e213884 [FrameworkBundle] [JsonEncoder] Wire services
mykiwi added a commit to SymfonyCon/2024-talks that referenced this pull request Dec 11, 2024
@chalasr chalasr added JsonStreamer and removed ❄️ Feature Freeze Important Pull Requests to finish before the next Symfony "feature freeze" labels Dec 11, 2024
@carsonbot carsonbot changed the title [Serializer] [JsonEncoder] Introducing the component [JsonEncoder][Serializer] Introducing the component Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.