-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-92088: Potential Performance Improvements #92084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
be-thomas
commented
Apr 30, 2022
•
edited by bedevere-bot
Loading
edited by bedevere-bot
- Issue: ParserBase could be optimized #92088
The following commit authors need to sign the Contributor License Agreement: |
Every change to Python requires a NEWS entry. Please, add it using the blurb_it Web app or the blurb command-line tool. |
I'm porting the html.parser along with the ParserBase class to Lua. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create an issue first and provide the proper benchmark for the optimization.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Lib/_markupbase.py
Outdated
else: | ||
return -1 | ||
while rawdata[j:j+1].isspace(): | ||
while rawdata[j].isspace(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If j
becomes len(rawdata)
, this will now fail with IndexError
. Previously the loop just stopped and it went into the if statement below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense.
It would require two checks to make this behave properly (j < len(rawdata) and rawdata[j].isspace()).
performance improvement for this is not worth the complexity
Every change to Python requires a NEWS entry. Please, add it using the blurb_it Web app or the blurb command-line tool. |
There is a lot of wasteful CPU code in this fashion (seen multiple times) :-
search performed in string twice, even if we could just do it once.
I'm new to Open Source Code Contributions. Would love to learn from other's coding style & know other's point of view. |
You should create an issue to discuss your changes, and possibly several PRs linked to the issue for the optimizations you find. The optimizations should also be confirmed by benchmarks whenever possible. I should have around some code I used to test/benchmark that I might publish if you think it might be useful. While you are at it, it would also be useful to improve testing (this will be especially useful for you if you are validating your parser against the HTMLParser test suite) and possibly helpreview/fix related HTMLParser issues. |
I have created an issue #92088 |
@@ -276,13 +276,14 @@ def _parse_doctype_attlist(self, i, declstartpos): | |||
return -1 | |||
if c == "(": | |||
# an enumerated type; look for ')' | |||
if ")" in rawdata[j:]: | |||
j = rawdata.find(")", j) + 1 | |||
_temp = rawdata.find(")", j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of _temp
, consider closer_pos
or something else similarly descriptive.
The following commit authors need to sign the Contributor License Agreement: |