Skip to content

MNT Add robots.txt to avoid indexing of old version doc #30685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Jan 21, 2025

Fixes #8958.

After rereading the issue, adding a robots.txt seems like the simplest thing to do. I think this is worth trying for a few weeks and see whether this helps. ReadTheDocs mentions robots.txt for example.

For now I made the choice of excluding everything for indexing but:

  • /stable
  • /dev/developers to allow indexing of the developer doc

This can definitely be tweaked if you have better suggestions.

I kind of tested manually the robots.txt with https://robotstxt.com/tester and it seems to do what we want.

@lesteve lesteve changed the title MNT Add robots.txt to avoid indexing on old version doc MNT Add robots.txt to avoid indexing of old version doc Jan 21, 2025
Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2857e8b. Link to the linter CI: here

@jeremiedbb
Copy link
Member

Reading some doc about robot.txt, it is stated that it's for controlling exploration but not indexation of the website. To control indexation it's advised to use an html tag which is a lot more complex. So let's first try this and see if it's good enough for us before considering more advanced solutions.

@jeremiedbb jeremiedbb merged commit 61077dc into scikit-learn:main Jan 21, 2025
40 checks passed
@lesteve lesteve deleted the robots-txt branch January 21, 2025 09:18
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jan 21, 2025
@jeremiedbb jeremiedbb mentioned this pull request Jan 21, 2025
@lesteve
Copy link
Member Author

lesteve commented Jan 21, 2025

Hmmm thinking about it a bit more, I think robots.txt is going to end up in scikit-learn.org/dev rather than at the root. Not sure this will work then ...

Maybe I can do a PR to the https://github.com/scikit-learn/scikit-learn.github.io repo adding robots.txt at the root? We can also revert this PR ...


About the approach, I agree with you that it seems the simplest thing to try. Since ReadTheDocs is recommending using robots.txt I guess this should work, we will see.

About robots.txt not being the "right" way, indeed this is what #8958 (comment) was pointing at, but indeed it sounds more complex. We are also using rel=canonical which should help #8958 (comment) but I think we are using it only in some places (not 100% sure). Also rel=canonical does not work for documentation which have been renamed or have disappeared.

@jeremiedbb
Copy link
Member

Hmmm thinking about it a bit more, I think robots.txt is going to end up in scikit-learn.org/dev rather than at the root. Not sure this will work then ...

So backporting in 1.6.X (see #30686) will put it in scikit-learn.org/stable ? Then we probably need to do it directly on the webiste repo. And actually we already have a robot.txt there :)

@lesteve
Copy link
Member Author

lesteve commented Jan 21, 2025

Then we probably need to do it directly on the webiste repo. And actually we already have a robot.txt there :)

Yep, I opened scikit-learn/scikit-learn.github.io#22 on the website repo.

lesteve added a commit that referenced this pull request Jan 21, 2025
@jeremiedbb
Copy link
Member

Let's revert this one

@lesteve
Copy link
Member Author

lesteve commented Jan 21, 2025

I opened #30687 to revert the robots.txt addition to the scikit-learn/scikit-learn repo.

@GaelVaroquaux
Copy link
Member

Thanks heaps!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants