-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
bpo-35922: Fix RobotFileParser when robots.txt is invalid #11791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-35922: Fix RobotFileParser when robots.txt is invalid #11791
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to testing with empty robots.txt content, it would be good to also test with:
- one not referencing the given user-agent and without a default
- one with a default or with the given user-agent, but which doesn't define
Crawl_delay
andRequest_rate
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
Thanks for the review @taleinat! I made all the changes you requested, let me know if you want me to make other updates. I have made the requested changes; please review again |
Thanks for making the requested changes! @taleinat: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are very well done, and they were done very quickly; nice!
Taking a deeper look at the structure of the existing tests, we would do better to build on the existing BaseRequestRateTest
. Specifically, I think its test_request_rate()
method should be changed to always check the crawl delay and request rate, so that setting them to None
will check that RobotFileParser
actually returns None
.
Doing so should allow removing NoDefaultUserAgentTest
, since that case will already be covered by the existing CrawlDelayAndRequestRateTest
. EmptyFileTest
should stay but not require any special logic, just define robots_txt
to an empty stings. Similarly, NoRequestRateAndCrawlDelayTest
could be written much more simply.
Let me know if you'd like to do this yourself or if you'd prefer me to make these changes myself.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be put in the comfy chair! |
Yeah, it will be better that way, I will push a new commit tomorrow. |
Thanks @taleinat, I have made the requested changes; please review again |
Thanks for making the requested changes! @taleinat: please review the changes made to this pull request. |
@remilapeyre: Status check is done, and it's a failure ❌ . |
No, please ignore those messages. They seem to be caused by a bug in Miss Islington. |
@remilapeyre: Status check is done, and it's a failure ❌ . |
Or maybe Miss Islington was just made aware of the failure earlier... |
The commit message should use "robots.txt", not "robot.txt". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I just left very minor comments. Otherwise it looks good to me.
Lib/urllib/robotparser.py
Outdated
return self.default_entry.delay | ||
if self.default_entry: | ||
return self.default_entry.delay | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else
branch isn't needed:
if self.default_entry:
return self.default_entry.delay
return None
Lib/urllib/robotparser.py
Outdated
return self.default_entry.req_rate | ||
if self.default_entry: | ||
return self.default_entry.req_rate | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -0,0 +1,4 @@ | |||
Fix :meth:`RobotFileParser.crawl_delay()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: () can be removed. Sphinx will it add it automatically.
@@ -0,0 +1,4 @@ | |||
Fix :meth:`RobotFileParser.crawl_delay()` and | |||
:meth:`RobotFileParser.request_rate()` to return ``None`` rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: () can be removed. Sphinx will it add it automatically.
Lib/test/test_robotparser.py
Outdated
self.request_rate.seconds | ||
) | ||
|
||
|
||
class EmptyFileTest(BaseRequestRateTest, unittest.TestCase): | ||
robots_txt = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please use single quotes.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again. |
@berkerpeksag, just awaiting your final okay, and I'll go about merging and backporting this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thank you for fixing this.
Thanks @remilapeyre for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8. |
GH-14121 is a backport of this pull request to the 3.8 branch. |
GH-14122 is a backport of this pull request to the 3.7 branch. |
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com> (cherry picked from commit 8047e0e) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
I don't think this change requires a news entry.
https://bugs.python.org/issue35922