Skip to content

[Serializer] Add a data: URI normalizer #16164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

dunglas
Copy link
Member

@dunglas dunglas commented Oct 7, 2015

Q A
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets n/a
License MIT
Doc PR not yet

A new normalizer to transform \SplFileInfo, \SplFileObject and \Symfony\Component\HttpFoundation\File\File to data: URI and the opposite.

It's convenient when using the FileReader.readAsDataURL() in JavaScript.

Usage:

$normalizer = new DataUriNormalizer();
echo $normalizer->normalize(new File(__DIR__.'/test.gif'));
// data:image/gif;base64,R0lGODdhAQABAIAAAP///////ywAAAAAAQABAAACAkQBADs=

echo $normalizer->normalize(new File(__DIR__.'/test.txt'));
// data:text/plain,K%C3%A9vin%20Dunglas%0A

var_dump($normalizer->denormalize('data:image/gif;base64,R0lGODdhAQABAIAAAP///////ywAAAAAAQABAAACAkQBADs=', 'SplFileObject');
// a SplFileObject (also supports HttpFoundation file if the composant is installed)

cc @clementtalleu

@dunglas dunglas changed the title [Serializer] Add a data: URI normalizer [WIP] [Serializer] Add a data: URI normalizer Oct 7, 2015
@dunglas dunglas changed the title [WIP] [Serializer] Add a data: URI normalizer [Serializer] Add a data: URI normalizer Oct 7, 2015
*/
public function denormalize($data, $class, $format = null, array $context = array())
{
if (!preg_match('/^data:(.*),(.*)/', $data)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex in the PR is very slow (https://regex101.com/r/rC9nW6/2). The first improvement would be to make the first match lazy: '/^data:(.*?),(.*)/'(https://regex101.com/r/rC9nW6/1)

The one from the gist linked by @dunglas is indeed not only the most exact, but also the fastest one: https://regex101.com/r/dL9cL4/1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the last part of the regex possesive should also help '/^data:(.*?),(.*+)/'.

@dunglas
Copy link
Member Author

dunglas commented Oct 30, 2015

ping @symfony/deciders


public function __construct(MimeTypeGuesserInterface $mimeTypeGuesser = null)
{
if (null === $mimeTypeGuesser && class_exists('Symfony\Component\HttpFoundation\File\MimeType\MimeTypeGuesser')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class_exists() call is useless (if the class is missing, the MimeTypeGuesserInterface is missing too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not useless in the following case: new DataUriNormalizer() (without argument).

The constructor argument is optional, so the existence of the interface will not be checked and you cannot know if MimeTypeGuesser exists or no.

@dunglas
Copy link
Member Author

dunglas commented Dec 14, 2015

ping @symfony/deciders

use Symfony\Component\Serializer\Exception\UnexpectedValueException;

/**
* Normalizes a {@see \SplFileInfo} object to a data URI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a SplFileInfo [...] no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an SplFileInfo afaik

@dunglas
Copy link
Member Author

dunglas commented Jan 12, 2016

Should be ok now.

if ($object instanceof File) {
$mimeType = $object->getMimeType();
} elseif ($this->mimeTypeGuesser) {
$mimeType = $this->mimeTypeGuesser->guess($object->getPathname());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need a type-check on this to make sure it's an SplFileInfo. I know that you pointed to supportsNormalization as checking for this, but that's not good enough. Not being a normalizer expert, I just looked at this method at it confused me. It's not clear without checking for the type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pro as you pointed out is clarity.
The con is the (little) performance impact.

As dealing with data: URI is slow by definition, we can add it here.

@weaverryan
Copy link
Member

I added some comments, but this generally looks quite good and interesting

@dunglas
Copy link
Member Author

dunglas commented Jan 17, 2016

@weaverryan just introduced private methods as you suggested.

I also added a check for the type as suggested by you and @xabbuh

}

$mimeType = $this->getMimeType($object);
list($typeName) = explode('/', $mimeType, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minutiae detail, but I'd like this line moved right before the if statement that actually uses $typeName - it's not needed up here.


$mimeType = $this->getMimeType($object);
list($typeName) = explode('/', $mimeType, 2);
$object = $this->extractSplFileObject($object);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this $splFileObject for clarity and so we're not overriding the previous variable (even though we don't need it)

@weaverryan
Copy link
Member

I left some very very minor comments, but

👍

Status: Reviewed

@dunglas
Copy link
Member Author

dunglas commented Jan 17, 2016

@weaverryan Thanks for the review! Last comments fixed.

return new \SplFileObject($data);
}

throw new InvalidArgumentException(sprintf('The class parameter "%s" is not supported. It must be one of "SplFileInfo", "SplFileObject" or "Symfony\Component\HttpFoundation\File\File".', $class));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO cleaner would be to throw this exception after try/catch block.

@fabpot
Copy link
Member

fabpot commented Jan 25, 2016

Thank you @dunglas.

@fabpot fabpot closed this Jan 25, 2016
fabpot added a commit that referenced this pull request Jan 25, 2016
This PR was squashed before being merged into the 3.1-dev branch (closes #16164).

Discussion
----------

[Serializer] Add a data: URI normalizer

| Q             | A
| ------------- | ---
| Bug fix?      | no
| New feature?  | yes
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | n/a
| License       | MIT
| Doc PR        | not yet

A new normalizer to transform `\SplFileInfo`, `\SplFileObject` and `\Symfony\Component\HttpFoundation\File\File` to [`data:` URI](https://en.wikipedia.org/wiki/Data_URI_scheme) and the opposite.

It's convenient when dealing with the JavaScript `FileReader` API.

Usage:

```php
$normalizer = new DataUriNormalizer();
echo $normalizer->normalize(new File(__DIR__.'/test.gif'));
// data:image/gif;base64,R0lGODdhAQABAIAAAP///////ywAAAAAAQABAAACAkQBADs=

echo $normalizer->normalize(new File(__DIR__.'/test.txt'));
// data:text/plain,K%C3%A9vin%20Dunglas%0A

var_dump($normalizer->denormalize('data:image/gif;base64,R0lGODdhAQABAIAAAP///////ywAAAAAAQABAAACAkQBADs=', 'SplFileObject');
// a SplFileObject (also supports HttpFoundation file if the composant is installed)
```
cc @clementtalleu

Commits
-------

cc7b5af [Serializer] Add a data: URI normalizer
@teohhanhui
Copy link
Contributor

There is a practical problem with this: If the data URI is long (think few MiBs), it quickly takes up a lot of memory (easily tens or hundreds of MiBs) as it's passed by value through the Serializer...

In my case I've had to write to a temp file very early to avoid the problem. Not sure if there's a more elegant solution...

@dunglas dunglas deleted the normalizer_data_uri branch February 1, 2016 09:21
@dunglas
Copy link
Member Author

dunglas commented Feb 1, 2016

During the denormalization? It's "intended" as the current design is to handle everything in memory. Having support for on disk tmp file (or both, using the file://memory wrapper) would be nice but should be optional IMO (a constructor flag).

Anyway, I would recommend using a specific endpoint accepting multipart/form-data instead of data: URIs for large files. Base64 encoding comes with an overhead of 33% and is not the best suitable encoding for such use cases.

Are you interested on opening a PR for that @teohhanhui ?

@teohhanhui
Copy link
Contributor

@dunglas I don't think I have any solution to offer on this...

Having support for on disk tmp file (or both, using the file://memory wrapper) would be nice

Did you mean serializing (unprocessed) data to disk during denormalization (to avoid duplicating the large string when repeatedly passed by value)? That seems really convoluted... Otherwise, if you mean only during the denormalization step of the data URI, I don't see how that can avoid the problem I mentioned above...

@teohhanhui
Copy link
Contributor

Anyway, one is more likely to run into a problem with preg_match first with large strings:

thephpleague/uri#11

*/
public function denormalize($data, $class, $format = null, array $context = array())
{
if (!preg_match('/^data:([a-z0-9]+\/[a-z0-9]+(;[a-z0-9\-]+\=[a-z0-9\-]+)?)?(;base64)?,[a-z0-9\!\$\&\\\'\,\(\)\*\+\,\;\=\-\.\_\~\:\@\/\?\%\s]*\s*$/i', $data)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^data:([a-z0-9]+\/[a-z0-9]+(;[a-z0-9\-]+\=[a-z0-9\-]+)?)?(;base64)?,
[a-z0-9\!\$\&\\\'\,\(\)\*\+\,\;\=\-\.\_\~\:\@\/\?\%\s]*\s*$

...could be made posessive/greedy for better performance...

^data:([a-z0-9]++\/[a-z0-9]++(;[a-z0-9\-]++\=[a-z0-9\-]++)?)?(;base64)?,
[a-z0-9\!\$\&\\\'\,\(\)\*\+\,\;\=\-\.\_\~\:\@\/\?\%\s]*+\s*+$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, is matching the string all the way to the end really necessary (with a large list of allowed characters)?

I mean we're not validating if the actual data corresponds with the encoding specification, the user could still (for example) use the base64 option but not send base64 encoded data.

We're taking the correctness of the data (more or less) for granted -- so why allow the regex engine to even bother scanning it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is two main reasons:

  • this is very security sensitive (in my former implementation, it was allowing to access to /etc/shadow for instance and, being not a security expert, I prefer to be conservative - what about specials chars...)
  • the PHP stream wrapper can generate an error if the data is invalid, the regex prevent it

However 👍 to make the regex as fast as possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh-oh, i can't imagine how the /etc/shadow file could get exposed since the
prefix is predetermined, can you please give me some documentation?

On 20:14, Thu, Feb 4, 2016 Kévin Dunglas notifications@github.com wrote:

In src/Symfony/Component/Serializer/Normalizer/DataUriNormalizer.php
#16164 (comment):

  •    return $data instanceof \SplFileInfo;
    
  • }
  • /**
  • \* {@inheritdoc}
    
  • *
    
  • \* Regex adapted from Brian Grinstead code.
    
  • *
    
  • \* @see https://gist.github.com/bgrins/6194623
    
  • *
    
  • \* @throws InvalidArgumentException
    
  • \* @throws UnexpectedValueException
    
  • */
    
  • public function denormalize($data, $class, $format = null, array $context = array())
  • {
  •    if (!preg_match('/^data:([a-z0-9]+\/[a-z0-9]+(;[a-z0-9-]+\=[a-z0-9-]+)?)?(;base64)?,[a-z0-9!\$&\\'\,()*+\,\;\=-._~:\@\/\?\%\s]_\s_$/i', $data)) {
    

There is two main reasons:

  • this is very security sensitive (in my former implementation, it was
    allowing to access to /etc/shadow for instance and, being not a security
    expert, I prefer to be conservative - what about specials chars...)
  • the PHP stream wrapper can generate an error if the data is invalid,
    the regex prevent it

However [image: 👍] to make the regex as fast as possible


Reply to this email directly or view it on GitHub
https://github.com/symfony/symfony/pull/16164/files#r51913006.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dunglas the original RegEx does not match application/octet-stream because the second part of the mime type is not allowed to contain a dash.

Which means the DataUriNormalizer can't match a data uri that it produced (when no mime type info was available).

Also there's an issue in the normalizer logic, the Symfony mime type guesser tries to fetch the path of the file and assert that the file exists. This does not work with php://memory / php://temp or data: type files.

I could submit a PR with the aforementioned speedups as well as some logic to make the mime type guessing part optional (ex.: supply a mime type via $context).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking out this list and apparently literal dots and underscores are also (sometimes) used in mime-type names (especially in the second part) and dashes are (sometimes) used in the first part.

So the RegEx needs to allow for those too.

xabbuh added a commit to symfony/symfony-docs that referenced this pull request Mar 13, 2016
This PR was squashed before being merged into the master branch (closes #6314).

Discussion
----------

New normalizers

| Q             | A
| ------------- | ---
| Doc fix?      | no
| New docs?     | yes symfony/symfony#17603 symfony/symfony#17411 symfony/symfony#16164
| Applies to    | 3.1
| Fixed tickets |

Commits
-------

4ff3a28 New normalizers
@fabpot fabpot mentioned this pull request May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants