Skip to content

MAINT Codespell configuration #21051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2021

Conversation

DimitriPapadopoulos
Copy link
Contributor

@DimitriPapadopoulos DimitriPapadopoulos commented Sep 15, 2021

What does this implement/fix? Explain your changes.

Add codespell to CI and fix existing typos.

Any other comments?

Not sure where to put file codespell_ignore_words.txt . Can you help?

Finally, should this really go into the Changelog?

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Sep 15, 2021

I believe (some of) the linter issues predate my own changes (probably an effect of dos2unix).

@glemaitre
Copy link
Member

We used black only on the code source and ignore the examples and tutorial up to now. In addition, we run flake8 only on the diff of each PR. If the spelling was inside of the tutorial, and this tutorial is not PEP8 compliant, then it would be logical that an error is raised. Looking at a couple of the failures, I could spot such errors. It might be worth making these files PEP8 compliant just for this PR (only the one modified).

Regarding the usage of codespell in the project, I think that it could be good. I personally use something similar locally in my vs code and it could be nice to have fewer typos in the project.

any thoughts @adrinjalali @thomasjpfan @NicolasHug?

@NicolasHug
Copy link
Member

I'm generally happy with fixing and preventing typos, my only concern is the potential risk of false positives. Did you observe a lot of these in your local setup @glemaitre ?

As a technical detail on this PR, I feel like we should fix the CRLF issues in a separate PR. 6K+ lines of change is scary for us reviewers :p !

@DimitriPapadopoulos
Copy link
Contributor Author

I'll move the dos2unix commit to a separate PR.

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from 20d625c to 484fa7e Compare September 15, 2021 12:23
@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Sep 15, 2021

False positives I have seen myself:

  • a few variables names
  • a couple codespell bugs (complies => compiles or theses => thesis)
  • names of contributors or names from the bibliography
  • typos put there on purpose (see sklearn/feature_extraction/_stop_words.py)

Have a look at .github/codespell_ignore_words.txt for the false positives in scikit-learn.

@glemaitre
Copy link
Member

I'm generally happy with fixing and preventing typos, my only concern is the potential risk of false positives. Did you observe a lot of these in your local setup @glemaitre ?

Locally I don't recall any false positives (or so few of them that I don't notice them) but I don't know if this extension uses the same engine.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with spell checking in the CI.

To complete the PR, I think we need to add something in the contributor guide to describe how to run this locally.

@@ -0,0 +1,38 @@
aline
amoungst
ba
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the words in codespell_ignore_words.txt do not look like words?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aline is a person's name from bibliography.
amoungst is a valid variant of amongst from sklearn/feature_extraction/_stop_words.py
ba is a variable name

Copy link
Member

@thomasjpfan thomasjpfan Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amoungst is a valid variant of amongst from sklearn/feature_extraction/_stop_words.py

Is this safe to remove from the ignore list now that we ignore _stop_words.py?

Some of these variable names are not the best. In a follow up PR, we can start looking at the ignore list and try to remove some words and create better variable names for them.

Copy link
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I will remove amoungst from the list.

Copy link
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain ba is an actual variable name. It might be a PDF command used when generating docs, or something similar, you cannot change. I'd have to run codespell without ignoring ba to tell you. Most variable names are actually (very) well chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed amoungst. Note that the compliesfalse positive might be the result of an actual codespell bug or new feature (codespell-project/codespell#2062).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ba is indeed a variable name is some cases, but it is also part of an URL!
http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F

While codespell does have an --uri-ignore-words-list option, it has not been propagated to the GitHub action. I have opened another ticket for that (codespell-project/actions-codespell#36).

Comment on lines 19 to 20
skip: js,v*.rst,_stop_words.py
ignore_words_file: .github/codespell_ignore_words.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be placed in a RC file so it is easier to run locally? (https://github.com/codespell-project/codespell#using-a-config-file)

Copy link
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Excellent idea, in Python projects I can put these in setup.cfg.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"happy.pleased",
"relaxing.calm",
"quiet.still",
"sad.lonely",
"angry.aggresive",
"angry.aggressive",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem: those typos are part of the original dataset hosted at openml.org. We should not fix those and we should find a way to silence false positives for this specific file without adding typos in the global codespell_ignore_words.txt file. Would it be possible to tell codespell to ignore some words in a specific file?

Copy link
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I had noted that I should ask about these, then forgot. Thank you for raising this issue.

No, there is no way to do that. You can only:

  • ignore words
  • exclude files
  • ignore full lines

No worries though. I'll undo the changes in sklearn/datasets/tests/test_openml.py and add these words to the list of false positives. It's not perfect, but we cannot aim at perfect here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to the openml.org typos, aggresive and suprised added to the list of false positives.

@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2021

If the false positive management and maintenance overhead to have to manage an ignore list is judged too much trouble to be worth it, I would be in favor of opening a dedicated ,one-shot PR that just fixes the currently identified typos, without the extra CI harness. Merging that one should not be controversial ;)

@glemaitre
Copy link
Member

glemaitre commented Sep 16, 2021 via email

@NicolasHug
Copy link
Member

Maybe we can still keep the ignore_word list etc and make this a step in the release process?

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Sep 16, 2021

I believe the management of false positives in .github/codespell_ignore_words.txt is a small overhead. If you have a look at the list, it's pretty short. It won't happen that often.

Fixing the actual typos might be more annoying, it will happen more often - compare the number of words of codespell_ignore_words.txt with the number of typos fixed in this PR. Will all developers accept to have to fix their typos before their patches can be merged?

An alternative would be to show the results of codespell but not block the merge. It would be up to reviewers to let typos through or decide they're false positives, and perhaps update the list of false positives in codespell_ignore_words.txt later on, in batches. I don't know how feasible this is with the current GitHub action.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Sep 16, 2021

How to make that a step in the release process? Are the steps just documented or actually programmed?

In any case, I can split this into two PRs:

  • this PR to just fix current typos
  • another PR to integrate codespell into CI or the release process

@DimitriPapadopoulos
Copy link
Contributor Author

I have moved the actual typo fixes to #21069. It would be great if someone could merge that PR soon.

Then I can rebase this PR and use it to continue discussing the machinery to prevent future typos in this PR, either as part of CI, or as a step in the release process.

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the codespell branch 3 times, most recently from a69551f to cfdd64d Compare September 21, 2021 05:09
@adrinjalali
Copy link
Member

I like the other PR fixing the typos, but I'm also wary of this being a part of the CI, especially with the way we need to globally ignore things. I'd be in favor of having this done periodically, at every release or once a year or so, rather than being a part of CI.

@DimitriPapadopoulos
Copy link
Contributor Author

I'll remove the GitHub action .github/workflows/codespell.yml.

Is it OK to keep the codespell parameters in setup.cfg?

[codespell]
skip = js,v*.rst,_stop_words.py
ignore-words = .github/codespell_ignore_words.txt

Since this won't be a GitHub action, I should probably move .github/codespell_ignore_words.txt elsewhere. Where? Perhaps a hidden file in the root directory such as .codespell_ignore_words.txt?

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from 15da471 to ea3857e Compare September 21, 2021 09:40
@DimitriPapadopoulos DimitriPapadopoulos changed the title Add codespell to CI and fix existing typos Codespell configuration Sep 22, 2021
@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2021

Since this won't be a GitHub action, I should probably move .github/codespell_ignore_words.txt elsewhere. Where? Perhaps a hidden file in the root directory such as .codespell_ignore_words.txt?

I would put that under build_tools.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running codespell locally is great. Thank you for working on these spelling errors @DimitriPapadopoulos !

setup.cfg Outdated
@@ -71,3 +71,7 @@ ignore =
sklearn/utils/_seq_dataset.pxd
sklearn/utils/_weight_vector.pyx
sklearn/utils/_weight_vector.pxd

[codespell]
skip = js,v*.rst,_stop_words.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some auto generated files that would good to skip:

Suggested change
skip = js,v*.rst,_stop_words.py
skip = *.js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated

I think expanding _stop_words.py to the full path makes it more explicit. I see that v*.rst is skipped because of names, but I think having spell check enabled for the changelog is a net win.

Copy link
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Sep 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the full paths, but I'd rather wait for codespell-project/codespell#2058 to be taken into account.

About v*.rst, I've left these out because many projects don't want to modify the changelog, they want typos in there to remain as they are forever. I believe this is inspired by the GNU documentation on Change Logs. On the other hand, not everyone agrees with the GNU documentation, after all the changelog is viewed as a detailed dump of the VCS, which may appear a bit too detailed and obsolete these days. If many of you want me to fix the typos in the changelog, I'm happy to do that. However, I suggest this happens in a distinct PR, like #21069 for the rest of the typos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's js and not *.js, to exclude vendored JS scripts under doc/themes/scikit-learn-modern/static/js, but not other JS scripts (if any in the future).

@thomasjpfan thomasjpfan changed the title Codespell configuration MAINT Codespell configuration Sep 26, 2021
@DimitriPapadopoulos DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from e5064fd to 178e692 Compare September 27, 2021 09:28
@thomasjpfan
Copy link
Member

When I run codespell locally, I see that codespell is looking at my .git/logs folder. @DimitriPapadopoulos Are you seeing the same behavior?

@DimitriPapadopoulos
Copy link
Contributor Author

Indeed, that's because I had set it up for automated scans running on a git archive extraction, so without a .git directory. Fixed now by adding ./.git to the list of files/directories to skip.

I have also added a new build_tools/codespell_exclude.txt file to exclude whole lines instead of words, specifically for names in v*.rst.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codespell_exclude.txt is a little too fragile. If there is any change in those lines, we may need to update codespell_exclude.txt. I am happy with the original version where we had the names & words in codespell_ignore_words.txt.

setup.cfg Outdated
@@ -71,3 +71,8 @@ ignore =
sklearn/utils/_seq_dataset.pxd
sklearn/utils/_weight_vector.pyx
sklearn/utils/_weight_vector.pxd

[codespell]
skip = ./.git,./doc/themes/scikit-learn-modern/static/js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we add ./.mypy_cache here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added ./.mypy_cache.

Using two exclusion mechanisms is complicated and error-prone, I agree with that. However, I'm worried that some of the words that we ignore will shadow very common typos, the kind you see in any large project. The most striking example is teh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that at least files v*.rst wouldn't change much.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 6edbffd into scikit-learn:main Oct 4, 2021
@DimitriPapadopoulos DimitriPapadopoulos deleted the codespell branch October 4, 2021 13:36
@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
glemaitre pushed a commit that referenced this pull request Oct 25, 2021
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants