Skip to content

gh-92088: Potential Performance Improvements #92084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

be-thomas
Copy link

@be-thomas be-thomas commented Apr 30, 2022

@be-thomas be-thomas requested a review from ezio-melotti as a code owner April 30, 2022 14:26
@ghost
Copy link

ghost commented Apr 30, 2022

The following commit authors need to sign the Contributor License Agreement:

Click the button to sign:
CLA not signed

@bedevere-bot
Copy link

Every change to Python requires a NEWS entry.

Please, add it using the blurb_it Web app or the blurb command-line tool.

@be-thomas
Copy link
Author

I'm porting the html.parser along with the ParserBase class to Lua.
I would love to submit pull request on every performance improvement opportunity that I find.

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue first and provide the proper benchmark for the optimization.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

else:
return -1
while rawdata[j:j+1].isspace():
while rawdata[j].isspace():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If j becomes len(rawdata), this will now fail with IndexError. Previously the loop just stopped and it went into the if statement below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense.
It would require two checks to make this behave properly (j < len(rawdata) and rawdata[j].isspace()).
performance improvement for this is not worth the complexity

@bedevere-bot
Copy link

Every change to Python requires a NEWS entry.

Please, add it using the blurb_it Web app or the blurb command-line tool.

@be-thomas
Copy link
Author

be-thomas commented Apr 30, 2022

There is a lot of wasteful CPU code in this fashion (seen multiple times) :-

if ")" in rawdata[j:]:
    j = rawdata.find(")", j) + 1

search performed in string twice, even if we could just do it once.
Consider the strings immutable, then only every slice a new string is created on the fly.
moreover, the slicing done at rawdata[j:], is bound to be quite expensive depending on the size of the string.
We could eliminate slicing altogether and only have one find operation in this style.

RPAREN_pos = rawdata.find(")", j)
if find_RPAREN != -1:
    j = RPAREN_pos + 1

I'm new to Open Source Code Contributions. Would love to learn from other's coding style & know other's point of view.

@ezio-melotti
Copy link
Member

I'm porting the html.parser along with the ParserBase class to Lua. I would love to submit pull request on every performance improvement opportunity that I find.

You should create an issue to discuss your changes, and possibly several PRs linked to the issue for the optimizations you find. The optimizations should also be confirmed by benchmarks whenever possible. I should have around some code I used to test/benchmark that I might publish if you think it might be useful.

While you are at it, it would also be useful to improve testing (this will be especially useful for you if you are validating your parser against the HTMLParser test suite) and possibly helpreview/fix related HTMLParser issues.

@be-thomas
Copy link
Author

I have created an issue #92088
Also, I haven't done much automated testing.
Not sure how to make the benchmarks. Do I have to prepare a few html files and check it's parsing order?
Any resources would be really helpful.

@ezio-melotti ezio-melotti self-assigned this May 1, 2022
@AlexWaygood AlexWaygood added the performance Performance or resource usage label May 4, 2022
@@ -276,13 +276,14 @@ def _parse_doctype_attlist(self, i, declstartpos):
return -1
if c == "(":
# an enumerated type; look for ')'
if ")" in rawdata[j:]:
j = rawdata.find(")", j) + 1
_temp = rawdata.find(")", j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of _temp, consider closer_pos or something else similarly descriptive.

@ezio-melotti ezio-melotti changed the title Potential Performance Improvements gh-92088: Potential Performance Improvements Jul 15, 2022
@python-cla-bot
Copy link

python-cla-bot bot commented Apr 6, 2025

The following commit authors need to sign the Contributor License Agreement:

CLA signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants