Skip to content

[DomCrawler] Optionally use html5-php to parse HTML #29306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2019

Conversation

tgalopin
Copy link
Contributor

@tgalopin tgalopin commented Nov 24, 2018

Q A
Branch? master
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? WIP
Fixed tickets #29280, #28596
License MIT
Doc PR symfony/symfony-docs#10700

This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.

@nicolas-grekas nicolas-grekas added this to the next milestone Nov 24, 2018
@stof
Copy link
Member

stof commented Nov 24, 2018

As the native implementation uses validateOnParse, I think your alternative implementation needs to check $html5->hasErrors() and throw based on $html5->getErrors() too. Otherwise, parse errors might go unnoticed.

@fabpot
Copy link
Member

fabpot commented Feb 21, 2019

@tgalopin What's the status of this PR?

@tgalopin
Copy link
Contributor Author

Waiting for Masterminds/html5-php#163 to be merged to pass tests here.

@fabpot
Copy link
Member

fabpot commented Mar 4, 2019

@tgalopin Upstream PR merged :)

@stof
Copy link
Member

stof commented Mar 28, 2019

Due to Masterminds/html5-php#139, shouldn't we use the saveHTML of the HTML5 library instead of the native one when use use the HTML5 parser to parse the DOM (meaning we need to also remember whether the DOM was created by the HTML5 parser)

@fabpot
Copy link
Member

fabpot commented Mar 31, 2019

@tgalopin friendly ping

@tgalopin tgalopin force-pushed the html5-parser branch 3 times, most recently from e21e17a to 14a454d Compare March 31, 2019 10:15
@tgalopin tgalopin changed the title [DomCrawler][WIP] Optionally use html5-php to parse HTML [DomCrawler] Optionally use html5-php to parse HTML Mar 31, 2019
@tgalopin
Copy link
Contributor Author

tgalopin commented Apr 3, 2019

Tests are failing for an unrelated reason. I think this is ready to review.

@tgalopin
Copy link
Contributor Author

tgalopin commented Apr 3, 2019

Updated

@tgalopin tgalopin force-pushed the html5-parser branch 2 times, most recently from 3e61e24 to e0ca69a Compare April 3, 2019 12:56
}

/**
* Convert charset to HTML-entities to ensure valid parsing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converts

@fabpot
Copy link
Member

fabpot commented Apr 3, 2019

Thank you @tgalopin.

@fabpot fabpot merged commit 4050ec4 into symfony:master Apr 3, 2019
fabpot added a commit that referenced this pull request Apr 3, 2019
…galopin)

This PR was squashed before being merged into the 4.3-dev branch (closes #29306).

Discussion
----------

[DomCrawler] Optionally use html5-php to parse HTML

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | no
| New feature?  | yes
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | WIP
| Fixed tickets | #29280, #28596
| License       | MIT
| Doc PR        | symfony/symfony-docs#10700

This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.

Commits
-------

4050ec4 [DomCrawler] Optionally use html5-php to parse HTML
@tgalopin tgalopin deleted the html5-parser branch April 3, 2019 13:23
"masterminds/html5": "^2.6"
},
"conflict": {
"masterminds/html5": "<2.6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also conflict with > 3 then

@@ -608,6 +601,15 @@ public function html(/* $default = null */)
throw new \InvalidArgumentException('The current node list is empty.');
}

if (null !== $this->html5Parser) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue here. You instantiate the HTML5 parser in the constructor even when the content added is not HTML5 but XML or existing DOM elements (coming from elsewhere than a parent crawler using HTML5). This means you might be saving with the HTML5 parser when it was not used for parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you propose to improve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we need to distinguish 3 cases:

  • we are parsing some HTML5
  • we are parsing some older HTML
  • we are not parsing HTML at all

The boolean argument in the constructor allows us to decide between the first 2 cases at the time we instantiate. But knowing whether this is HTML or no is not something the controller knows (as it can be done later).

The solution might be to store the boolean property. Then, based on that, we would decide which parsing strategy to use if we load HTML and instantiate the HTML5 parser if needed.
Then, here, we can keep saying "if I used an HTML5 parser, I also use it for saving".

And for subcrawlers, we copy the content of the private property.

}

if ($useHtml5Parser ?? class_exists(HTML5::class)) {
$this->html5Parser = new HTML5(['disable_html_ns' => true]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a child crawler, you should not rely on guessing but pass the existing value used for the parsing (or even better, assign the actual parser instead of instantiating a new one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the createSubCrawler method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@stof
Copy link
Member

stof commented Apr 3, 2019

Using a constructor argument has a big drawback (but the previous implementation using a setter that must be called before loading the content has the same drawback): most people in Symfony don't instantiate a Crawler themselves. They use BrowserKit which manages this instantiation. This means they don't have direct access to anything happening before adding content.

javiereguiluz added a commit to symfony/symfony-docs that referenced this pull request Apr 5, 2019
…y (tgalopin)

This PR was merged into the master branch.

Discussion
----------

[DomCrawler][WIP] Add note about the HTML5 parser library

Documentation for the PR symfony/symfony#29306.

Commits
-------

6e2f04a [DomCrawler] Add note about the HTML5 parser library
fabpot added a commit that referenced this pull request Apr 6, 2019
…on (tgalopin)

This PR was merged into the 4.3-dev branch.

Discussion
----------

[DomCrawler] Improve Crawler HTML5 parser need detection

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | kind of
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no>
| Tests pass?   | yes
| Fixed tickets | -
| License       | MIT
| Doc PR        | -

Live from #eu-fossa

Follow up of #29306

This PR introduces a better detection mechanism to choose when to parse using the HTML5 parser or not, and fix a subcrawler parsing issue as well.

@stof I'd be super interested by your review :) !

Commits
-------

9bbdab6 [DomCrawler] Improve Crawler HTML5 parser need detection
"masterminds/html5": "^2.6"
},
"conflict": {
"masterminds/html5": "<2.6"
},
"suggest": {
"symfony/css-selector": ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be an entry here, that describes that you can load masterminds/html5?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, that would make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against adding things under suggest, nobody reads them anyway. I would even go as far as removing the existing entries :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always read them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants