-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] Optionally use html5-php to parse HTML #29306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As the native implementation uses |
5e439d7
to
d0420c3
Compare
@tgalopin What's the status of this PR? |
Waiting for Masterminds/html5-php#163 to be merged to pass tests here. |
@tgalopin Upstream PR merged :) |
Due to Masterminds/html5-php#139, shouldn't we use the |
@tgalopin friendly ping |
e21e17a
to
14a454d
Compare
Tests are failing for an unrelated reason. I think this is ready to review. |
Updated |
3e61e24
to
e0ca69a
Compare
} | ||
|
||
/** | ||
* Convert charset to HTML-entities to ensure valid parsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converts
Thank you @tgalopin. |
…galopin) This PR was squashed before being merged into the 4.3-dev branch (closes #29306). Discussion ---------- [DomCrawler] Optionally use html5-php to parse HTML | Q | A | ------------- | --- | Branch? | master | Bug fix? | no | New feature? | yes | BC breaks? | no | Deprecations? | no | Tests pass? | WIP | Fixed tickets | #29280, #28596 | License | MIT | Doc PR | symfony/symfony-docs#10700 This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension. Commits ------- 4050ec4 [DomCrawler] Optionally use html5-php to parse HTML
"masterminds/html5": "^2.6" | ||
}, | ||
"conflict": { | ||
"masterminds/html5": "<2.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also conflict with > 3
then
@@ -608,6 +601,15 @@ public function html(/* $default = null */) | |||
throw new \InvalidArgumentException('The current node list is empty.'); | |||
} | |||
|
|||
if (null !== $this->html5Parser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue here. You instantiate the HTML5 parser in the constructor even when the content added is not HTML5 but XML or existing DOM elements (coming from elsewhere than a parent crawler using HTML5). This means you might be saving with the HTML5 parser when it was not used for parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you propose to improve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we need to distinguish 3 cases:
- we are parsing some HTML5
- we are parsing some older HTML
- we are not parsing HTML at all
The boolean argument in the constructor allows us to decide between the first 2 cases at the time we instantiate. But knowing whether this is HTML or no is not something the controller knows (as it can be done later).
The solution might be to store the boolean property. Then, based on that, we would decide which parsing strategy to use if we load HTML and instantiate the HTML5 parser if needed.
Then, here, we can keep saying "if I used an HTML5 parser, I also use it for saving".
And for subcrawlers, we copy the content of the private property.
} | ||
|
||
if ($useHtml5Parser ?? class_exists(HTML5::class)) { | ||
$this->html5Parser = new HTML5(['disable_html_ns' => true]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating a child crawler, you should not rely on guessing but pass the existing value used for the parsing (or even better, assign the actual parser instead of instantiating a new one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in the createSubCrawler
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Using a constructor argument has a big drawback (but the previous implementation using a setter that must be called before loading the content has the same drawback): most people in Symfony don't instantiate a Crawler themselves. They use BrowserKit which manages this instantiation. This means they don't have direct access to anything happening before adding content. |
…y (tgalopin) This PR was merged into the master branch. Discussion ---------- [DomCrawler][WIP] Add note about the HTML5 parser library Documentation for the PR symfony/symfony#29306. Commits ------- 6e2f04a [DomCrawler] Add note about the HTML5 parser library
…on (tgalopin) This PR was merged into the 4.3-dev branch. Discussion ---------- [DomCrawler] Improve Crawler HTML5 parser need detection | Q | A | ------------- | --- | Branch? | master | Bug fix? | kind of | New feature? | no | BC breaks? | no | Deprecations? | no> | Tests pass? | yes | Fixed tickets | - | License | MIT | Doc PR | - Live from #eu-fossa Follow up of #29306 This PR introduces a better detection mechanism to choose when to parse using the HTML5 parser or not, and fix a subcrawler parsing issue as well. @stof I'd be super interested by your review :) ! Commits ------- 9bbdab6 [DomCrawler] Improve Crawler HTML5 parser need detection
"masterminds/html5": "^2.6" | ||
}, | ||
"conflict": { | ||
"masterminds/html5": "<2.6" | ||
}, | ||
"suggest": { | ||
"symfony/css-selector": "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be an entry here, that describes that you can load masterminds/html5
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, that would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against adding things under suggest
, nobody reads them anyway. I would even go as far as removing the existing entries :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always read them!
This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.