gh-92088: Potential Performance Improvements #92084

be-thomas · 2022-04-30T14:26:31Z

Issue: ParserBase could be optimized #92088

ghost · 2022-04-30T14:26:33Z

The following commit authors need to sign the Contributor License Agreement:

thomasbenardo96@gmail.com

Click the button to sign:

bedevere-bot · 2022-04-30T14:26:34Z

Every change to Python requires a NEWS entry.

Please, add it using the blurb_it Web app or the blurb command-line tool.

be-thomas · 2022-04-30T14:39:05Z

I'm porting the html.parser along with the ParserBase class to Lua.
I would love to submit pull request on every performance improvement opportunity that I find.

corona10

Please create an issue first and provide the proper benchmark for the optimization.

bedevere-bot · 2022-04-30T16:14:24Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Akuli · 2022-04-30T16:18:58Z

Lib/_markupbase.py

                else:
                    return -1
-                while rawdata[j:j+1].isspace():
+                while rawdata[j].isspace():


If j becomes len(rawdata), this will now fail with IndexError. Previously the loop just stopped and it went into the if statement below.

Ok, that makes sense.
It would require two checks to make this behave properly (j < len(rawdata) and rawdata[j].isspace()).
performance improvement for this is not worth the complexity

bedevere-bot · 2022-04-30T17:14:29Z

Every change to Python requires a NEWS entry.

Please, add it using the blurb_it Web app or the blurb command-line tool.

be-thomas · 2022-04-30T17:20:23Z

There is a lot of wasteful CPU code in this fashion (seen multiple times) :-

if ")" in rawdata[j:]:
    j = rawdata.find(")", j) + 1

search performed in string twice, even if we could just do it once.
Consider the strings immutable, then only every slice a new string is created on the fly.
moreover, the slicing done at rawdata[j:], is bound to be quite expensive depending on the size of the string.
We could eliminate slicing altogether and only have one find operation in this style.

RPAREN_pos = rawdata.find(")", j)
if find_RPAREN != -1:
    j = RPAREN_pos + 1

I'm new to Open Source Code Contributions. Would love to learn from other's coding style & know other's point of view.

ezio-melotti · 2022-04-30T20:05:27Z

I'm porting the html.parser along with the ParserBase class to Lua. I would love to submit pull request on every performance improvement opportunity that I find.

You should create an issue to discuss your changes, and possibly several PRs linked to the issue for the optimizations you find. The optimizations should also be confirmed by benchmarks whenever possible. I should have around some code I used to test/benchmark that I might publish if you think it might be useful.

While you are at it, it would also be useful to improve testing (this will be especially useful for you if you are validating your parser against the HTMLParser test suite) and possibly helpreview/fix related HTMLParser issues.

be-thomas · 2022-05-01T07:46:25Z

I have created an issue #92088
Also, I haven't done much automated testing.
Not sure how to make the benchmarks. Do I have to prepare a few html files and check it's parsing order?
Any resources would be really helpful.

rhettinger · 2022-07-07T19:04:15Z

Lib/_markupbase.py

@@ -276,13 +276,14 @@ def _parse_doctype_attlist(self, i, declstartpos):
                return -1
            if c == "(":
                # an enumerated type; look for ')'
-                if ")" in rawdata[j:]:
-                    j = rawdata.find(")", j) + 1
+                _temp = rawdata.find(")", j)


Instead of _temp, consider closer_pos or something else similarly descriptive.

python-cla-bot · 2025-04-06T14:29:13Z

The following commit authors need to sign the Contributor License Agreement:

thomasbenardo96@gmail.com

Potential Performance Improvements

f5af6e3

be-thomas requested a review from ezio-melotti as a code owner April 30, 2022 14:26

bedevere-bot added the awaiting review label Apr 30, 2022

corona10 requested changes Apr 30, 2022

View reviewed changes

bedevere-bot removed the awaiting review label Apr 30, 2022

bedevere-bot added the awaiting changes label Apr 30, 2022

Akuli reviewed Apr 30, 2022

View reviewed changes

Update _markupbase.py

c45b527

be-thomas mentioned this pull request Apr 30, 2022

ParserBase could be optimized #92088

Open

ezio-melotti self-assigned this May 1, 2022

AlexWaygood added the performance Performance or resource usage label May 4, 2022

rhettinger reviewed Jul 7, 2022

View reviewed changes

rhettinger assigned serhiy-storchaka Jul 7, 2022

ezio-melotti changed the title ~~Potential Performance Improvements~~ gh-92088: Potential Performance Improvements Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-92088: Potential Performance Improvements #92084

gh-92088: Potential Performance Improvements #92084

Uh oh!

be-thomas commented Apr 30, 2022 •

edited by bedevere-bot

Loading

Uh oh!

ghost commented Apr 30, 2022 •

edited by ghost

Loading

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

be-thomas commented Apr 30, 2022

Uh oh!

corona10 left a comment •

edited

Loading

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

Akuli Apr 30, 2022

Uh oh!

be-thomas Apr 30, 2022

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

be-thomas commented Apr 30, 2022 •

edited

Loading

Uh oh!

ezio-melotti commented Apr 30, 2022

Uh oh!

be-thomas commented May 1, 2022

Uh oh!

rhettinger Jul 7, 2022

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

gh-92088: Potential Performance Improvements #92084

Are you sure you want to change the base?

gh-92088: Potential Performance Improvements #92084

Uh oh!

Conversation

be-thomas commented Apr 30, 2022 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Apr 30, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

be-thomas commented Apr 30, 2022

Uh oh!

corona10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

Akuli Apr 30, 2022

Choose a reason for hiding this comment

Uh oh!

be-thomas Apr 30, 2022

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Apr 30, 2022

Uh oh!

be-thomas commented Apr 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezio-melotti commented Apr 30, 2022

Uh oh!

be-thomas commented May 1, 2022

Uh oh!

rhettinger Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

Uh oh!

be-thomas commented Apr 30, 2022 •

edited by bedevere-bot

Loading

ghost commented Apr 30, 2022 •

edited by ghost

Loading

corona10 left a comment •

edited

Loading

be-thomas commented Apr 30, 2022 •

edited

Loading