Skip to content

bpo-35922: Fix RobotFileParser when robots.txt is invalid #11791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

remilapeyre
Copy link
Contributor

@remilapeyre remilapeyre commented Feb 8, 2019

I don't think this change requires a news entry.

https://bugs.python.org/issue35922

Copy link
Contributor

@taleinat taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to testing with empty robots.txt content, it would be good to also test with:

  1. one not referencing the given user-agent and without a default
  2. one with a default or with the given user-agent, but which doesn't define Crawl_delay and Request_rate

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
@remilapeyre remilapeyre requested a review from berkerpeksag as a code owner June 9, 2019 16:30
@remilapeyre
Copy link
Contributor Author

Thanks for the review @taleinat! I made all the changes you requested, let me know if you want me to make other updates.

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@taleinat: please review the changes made to this pull request.

Copy link
Contributor

@taleinat taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are very well done, and they were done very quickly; nice!

Taking a deeper look at the structure of the existing tests, we would do better to build on the existing BaseRequestRateTest. Specifically, I think its test_request_rate() method should be changed to always check the crawl delay and request rate, so that setting them to None will check that RobotFileParser actually returns None.

Doing so should allow removing NoDefaultUserAgentTest, since that case will already be covered by the existing CrawlDelayAndRequestRateTest. EmptyFileTest should stay but not require any special logic, just define robots_txt to an empty stings. Similarly, NoRequestRateAndCrawlDelayTest could be written much more simply.

Let me know if you'd like to do this yourself or if you'd prefer me to make these changes myself.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

@remilapeyre
Copy link
Contributor Author

Yeah, it will be better that way, I will push a new commit tomorrow.

@remilapeyre
Copy link
Contributor Author

Thanks @taleinat, I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@taleinat: please review the changes made to this pull request.

@miss-islington
Copy link
Contributor

@remilapeyre: Status check is done, and it's a failure ❌ .

@taleinat
Copy link
Contributor

Not sure what the status check is, @taleinat should I backport manually?

No, please ignore those messages. They seem to be caused by a bug in Miss Islington.

@miss-islington
Copy link
Contributor

@remilapeyre: Status check is done, and it's a failure ❌ .

@taleinat
Copy link
Contributor

Not sure what the status check is, @taleinat should I backport manually?

No, please ignore those messages. They seem to be caused by a bug in Miss Islington.

Or maybe Miss Islington was just made aware of the failure earlier...

@ZackerySpytz
Copy link
Contributor

The commit message should use "robots.txt", not "robot.txt".

@taleinat taleinat changed the title bpo-35922: Fix RobotFileParser when robot.txt is invalid bpo-35922: Fix RobotFileParser when robots.txt is invalid Jun 12, 2019
Copy link
Member

@berkerpeksag berkerpeksag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I just left very minor comments. Otherwise it looks good to me.

return self.default_entry.delay
if self.default_entry:
return self.default_entry.delay
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else branch isn't needed:

if self.default_entry:
    return self.default_entry.delay
return None

return self.default_entry.req_rate
if self.default_entry:
return self.default_entry.req_rate
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -0,0 +1,4 @@
Fix :meth:`RobotFileParser.crawl_delay()` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: () can be removed. Sphinx will it add it automatically.

@@ -0,0 +1,4 @@
Fix :meth:`RobotFileParser.crawl_delay()` and
:meth:`RobotFileParser.request_rate()` to return ``None`` rather than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: () can be removed. Sphinx will it add it automatically.

self.request_rate.seconds
)


class EmptyFileTest(BaseRequestRateTest, unittest.TestCase):
robots_txt = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please use single quotes.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@taleinat
Copy link
Contributor

I have made the requested changes; please review again.

@taleinat
Copy link
Contributor

taleinat commented Jun 14, 2019

@berkerpeksag, just awaiting your final okay, and I'll go about merging and backporting this.

Copy link
Member

@berkerpeksag berkerpeksag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank you for fixing this.

@taleinat taleinat merged commit 8047e0e into python:master Jun 16, 2019
@miss-islington
Copy link
Contributor

Thanks @remilapeyre for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-14121 is a backport of this pull request to the 3.8 branch.

@bedevere-bot
Copy link

GH-14122 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 16, 2019
…delay or request rate (pythonGH-11791)

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
(cherry picked from commit 8047e0e)

Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
miss-islington added a commit that referenced this pull request Jun 16, 2019
…delay or request rate (GH-11791)

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
(cherry picked from commit 8047e0e)

Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
miss-islington added a commit that referenced this pull request Jun 16, 2019
…delay or request rate (GH-11791)

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
(cherry picked from commit 8047e0e)

Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
lisroach pushed a commit to lisroach/cpython that referenced this pull request Sep 10, 2019
…delay or request rate (pythonGH-11791)

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020
…delay or request rate (pythonGH-11791)

Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants