MAINT Codespell configuration #21051

DimitriPapadopoulos · 2021-09-15T10:37:51Z

What does this implement/fix? Explain your changes.

Add codespell to CI and fix existing typos.

Any other comments?

Not sure where to put file codespell_ignore_words.txt . Can you help?

Finally, should this really go into the Changelog?

DimitriPapadopoulos · 2021-09-15T10:42:27Z

I believe (some of) the linter issues predate my own changes (probably an effect of dos2unix).

glemaitre · 2021-09-15T11:42:37Z

We used black only on the code source and ignore the examples and tutorial up to now. In addition, we run flake8 only on the diff of each PR. If the spelling was inside of the tutorial, and this tutorial is not PEP8 compliant, then it would be logical that an error is raised. Looking at a couple of the failures, I could spot such errors. It might be worth making these files PEP8 compliant just for this PR (only the one modified).

Regarding the usage of codespell in the project, I think that it could be good. I personally use something similar locally in my vs code and it could be nice to have fewer typos in the project.

any thoughts @adrinjalali @thomasjpfan @NicolasHug?

NicolasHug · 2021-09-15T12:02:14Z

I'm generally happy with fixing and preventing typos, my only concern is the potential risk of false positives. Did you observe a lot of these in your local setup @glemaitre ?

As a technical detail on this PR, I feel like we should fix the CRLF issues in a separate PR. 6K+ lines of change is scary for us reviewers :p !

DimitriPapadopoulos · 2021-09-15T12:03:04Z

I'll move the dos2unix commit to a separate PR.

DimitriPapadopoulos · 2021-09-15T12:25:57Z

False positives I have seen myself:

a few variables names
a couple codespell bugs (complies => compiles or theses => thesis)
names of contributors or names from the bibliography
typos put there on purpose (see sklearn/feature_extraction/_stop_words.py)

Have a look at .github/codespell_ignore_words.txt for the false positives in scikit-learn.

glemaitre · 2021-09-15T12:45:30Z

I'm generally happy with fixing and preventing typos, my only concern is the potential risk of false positives. Did you observe a lot of these in your local setup @glemaitre ?

Locally I don't recall any false positives (or so few of them that I don't notice them) but I don't know if this extension uses the same engine.

thomasjpfan

I am happy with spell checking in the CI.

To complete the PR, I think we need to add something in the contributor guide to describe how to run this locally.

thomasjpfan · 2021-09-15T14:17:25Z

.github/codespell_ignore_words.txt

@@ -0,0 +1,38 @@
+aline
+amoungst
+ba


Some of the words in codespell_ignore_words.txt do not look like words?

Aline is a person's name from bibliography.
amoungst is a valid variant of amongst from sklearn/feature_extraction/_stop_words.py
ba is a variable name

amoungst is a valid variant of amongst from sklearn/feature_extraction/_stop_words.py

Is this safe to remove from the ignore list now that we ignore _stop_words.py?

Some of these variable names are not the best. In a follow up PR, we can start looking at the ignore list and try to remove some words and create better variable names for them.

Yes! I will remove amoungst from the list.

Not certain ba is an actual variable name. It might be a PDF command used when generating docs, or something similar, you cannot change. I'd have to run codespell without ignoring ba to tell you. Most variable names are actually (very) well chosen.

I have removed amoungst. Note that the compliesfalse positive might be the result of an actual codespell bug or new feature (codespell-project/codespell#2062).

ba is indeed a variable name is some cases, but it is also part of an URL!
http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F

While codespell does have an --uri-ignore-words-list option, it has not been propagated to the GitHub action. I have opened another ticket for that (codespell-project/actions-codespell#36).

thomasjpfan · 2021-09-15T14:19:52Z

.github/workflows/codespell.yml

+          skip: js,v*.rst,_stop_words.py
+          ignore_words_file: .github/codespell_ignore_words.txt


Can this be placed in a RC file so it is easier to run locally? (https://github.com/codespell-project/codespell#using-a-config-file)

~~Good~~ Excellent idea, in Python projects I can put these in setup.cfg.

ogrisel

On top of https://github.com/scikit-learn/scikit-learn/pull/21051/files#r709263298 there is also the following:

ogrisel · 2021-09-16T12:31:56Z

sklearn/datasets/tests/test_openml.py

        "happy.pleased",
        "relaxing.calm",
        "quiet.still",
        "sad.lonely",
-        "angry.aggresive",
+        "angry.aggressive",


This is a problem: those typos are part of the original dataset hosted at openml.org. We should not fix those and we should find a way to silence false positives for this specific file without adding typos in the global codespell_ignore_words.txt file. Would it be possible to tell codespell to ignore some words in a specific file?

Ah, right, I had noted that I should ask about these, then forgot. Thank you for raising this issue.

No, there is no way to do that. You can only:

ignore words

exclude files

ignore full lines

No worries though. I'll undo the changes in sklearn/datasets/tests/test_openml.py and add these words to the list of false positives. It's not perfect, but we cannot aim at perfect here.

Reverted to the openml.org typos, aggresive and suprised added to the list of false positives.

ogrisel · 2021-09-16T13:29:16Z

If the false positive management and maintenance overhead to have to manage an ignore list is judged too much trouble to be worth it, I would be in favor of opening a dedicated ,one-shot PR that just fixes the currently identified typos, without the extra CI harness. Merging that one should not be controversial ;)

glemaitre · 2021-09-16T13:31:28Z

+1 with Olivier proposal.

…

On Thu, 16 Sept 2021 at 15:29, Olivier Grisel ***@***.***> wrote: If the false positive management and maintenance overhead to have to manage an ignore list is judged too much trouble to be worth it, I would be in favor of opening a dedicated ,one-shot PR that just fixes the currently identified typos, without the extra CI harness. Merging that one should not be controversial ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21051 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P6Y7TFJSD2AZ5NQMADUCHWLPANCNFSM5ECEGMFA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

NicolasHug · 2021-09-16T13:40:35Z

Maybe we can still keep the ignore_word list etc and make this a step in the release process?

DimitriPapadopoulos · 2021-09-16T14:09:03Z

I believe the management of false positives in .github/codespell_ignore_words.txt is a small overhead. If you have a look at the list, it's pretty short. It won't happen that often.

Fixing the actual typos might be more annoying, it will happen more often - compare the number of words of codespell_ignore_words.txt with the number of typos fixed in this PR. Will all developers accept to have to fix their typos before their patches can be merged?

An alternative would be to show the results of codespell but not block the merge. It would be up to reviewers to let typos through or decide they're false positives, and perhaps update the list of false positives in codespell_ignore_words.txt later on, in batches. I don't know how feasible this is with the current GitHub action.

DimitriPapadopoulos · 2021-09-16T14:13:12Z

How to make that a step in the release process? Are the steps just documented or actually programmed?

In any case, I can split this into two PRs:

this PR to just fix current typos
another PR to integrate codespell into CI or the release process

DimitriPapadopoulos · 2021-09-16T14:57:51Z

I have moved the actual typo fixes to #21069. It would be great if someone could merge that PR soon.

Then I can rebase this PR and use it to continue discussing the machinery to prevent future typos in this PR, either as part of CI, or as a step in the release process.

adrinjalali · 2021-09-21T08:24:40Z

I like the other PR fixing the typos, but I'm also wary of this being a part of the CI, especially with the way we need to globally ignore things. I'd be in favor of having this done periodically, at every release or once a year or so, rather than being a part of CI.

DimitriPapadopoulos · 2021-09-21T09:01:12Z

I'll remove the GitHub action .github/workflows/codespell.yml.

Is it OK to keep the codespell parameters in setup.cfg?

[codespell]
skip = js,v*.rst,_stop_words.py
ignore-words = .github/codespell_ignore_words.txt

Since this won't be a GitHub action, I should probably move .github/codespell_ignore_words.txt elsewhere. Where? Perhaps a hidden file in the root directory such as .codespell_ignore_words.txt?

ogrisel · 2021-09-23T08:33:57Z

Since this won't be a GitHub action, I should probably move .github/codespell_ignore_words.txt elsewhere. Where? Perhaps a hidden file in the root directory such as .codespell_ignore_words.txt?

I would put that under build_tools.

thomasjpfan

Running codespell locally is great. Thank you for working on these spelling errors @DimitriPapadopoulos !

thomasjpfan · 2021-09-26T15:33:28Z

setup.cfg

@@ -71,3 +71,7 @@ ignore =
    sklearn/utils/_seq_dataset.pxd
    sklearn/utils/_weight_vector.pyx
    sklearn/utils/_weight_vector.pxd
+
+[codespell]
+skip = js,v*.rst,_stop_words.py


There are some auto generated files that would good to skip:

Suggested change

skip = js,v*.rst,_stop_words.py

skip = *.js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated

I think expanding _stop_words.py to the full path makes it more explicit. I see that v*.rst is skipped because of names, but I think having spell check enabled for the changelog is a net win.

I agree about the full paths, but I'd rather wait for codespell-project/codespell#2058 to be taken into account.

About v*.rst, I've left these out because many projects don't want to modify the changelog, they want typos in there to remain as they are forever. I believe this is inspired by the GNU documentation on Change Logs. On the other hand, not everyone agrees with the GNU documentation, after all the changelog is viewed as a detailed dump of the VCS, which may appear a bit too detailed and obsolete these days. If many of you want me to fix the typos in the changelog, I'm happy to do that. However, I suggest this happens in a distinct PR, like #21069 for the rest of the typos.

Also, it's js and not *.js, to exclude vendored JS scripts under doc/themes/scikit-learn-modern/static/js, but not other JS scripts (if any in the future).

thomasjpfan · 2021-10-02T21:22:33Z

When I run codespell locally, I see that codespell is looking at my .git/logs folder. @DimitriPapadopoulos Are you seeing the same behavior?

DimitriPapadopoulos · 2021-10-03T06:17:34Z

Indeed, that's because I had set it up for automated scans running on a git archive extraction, so without a .git directory. Fixed now by adding ./.git to the list of files/directories to skip.

I have also added a new build_tools/codespell_exclude.txt file to exclude whole lines instead of words, specifically for names in v*.rst.

thomasjpfan

I think codespell_exclude.txt is a little too fragile. If there is any change in those lines, we may need to update codespell_exclude.txt. I am happy with the original version where we had the names & words in codespell_ignore_words.txt.

thomasjpfan · 2021-10-03T19:32:52Z

setup.cfg

@@ -71,3 +71,8 @@ ignore =
    sklearn/utils/_seq_dataset.pxd
    sklearn/utils/_weight_vector.pyx
    sklearn/utils/_weight_vector.pxd
+
+[codespell]
+skip = ./.git,./doc/themes/scikit-learn-modern/static/js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated


May we add ./.mypy_cache here too?

I have added ./.mypy_cache.

Using two exclusion mechanisms is complicated and error-prone, I agree with that. However, I'm worried that some of the words that we ignore will shadow very common typos, the kind you see in any large project. The most striking example is teh.

I was hoping that at least files v*.rst wouldn't change much.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan

LGTM

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

github-actions bot added cython Build / CI labels Sep 15, 2021

DimitriPapadopoulos force-pushed the codespell branch from c3bd90f to 09005d8 Compare September 15, 2021 10:39

DimitriPapadopoulos force-pushed the codespell branch from 09005d8 to 023c2ee Compare September 15, 2021 11:51

DimitriPapadopoulos mentioned this pull request Sep 15, 2021

MNT Run dos2unix #21053

Merged

DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from 20d625c to 484fa7e Compare September 15, 2021 12:23

thomasjpfan reviewed Sep 15, 2021

View reviewed changes

DimitriPapadopoulos force-pushed the codespell branch from 484fa7e to 327625c Compare September 15, 2021 14:42

ogrisel reviewed Sep 16, 2021

View reviewed changes

DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from d98d3e9 to f6a5454 Compare September 16, 2021 14:44

DimitriPapadopoulos mentioned this pull request Sep 16, 2021

DOC Typos found by codespell #21069

Merged

DimitriPapadopoulos force-pushed the codespell branch 3 times, most recently from a69551f to cfdd64d Compare September 21, 2021 05:09

DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from 15da471 to ea3857e Compare September 21, 2021 09:40

DimitriPapadopoulos changed the title ~~Add codespell to CI and fix existing typos~~ Codespell configuration Sep 22, 2021

DimitriPapadopoulos force-pushed the codespell branch from ea3857e to 5126586 Compare September 23, 2021 08:47

ogrisel approved these changes Sep 23, 2021

View reviewed changes

thomasjpfan reviewed Sep 26, 2021

View reviewed changes

thomasjpfan changed the title ~~Codespell configuration~~ MAINT Codespell configuration Sep 26, 2021

DimitriPapadopoulos force-pushed the codespell branch 2 times, most recently from e5064fd to 178e692 Compare September 27, 2021 09:28

DimitriPapadopoulos force-pushed the codespell branch from 178e692 to 47f4ff6 Compare October 3, 2021 06:13

DimitriPapadopoulos force-pushed the codespell branch from 47f4ff6 to 9df331e Compare October 3, 2021 07:09

thomasjpfan reviewed Oct 3, 2021

View reviewed changes

Configure codespell optimally

bfa4438

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

DimitriPapadopoulos force-pushed the codespell branch from 9df331e to bfa4438 Compare October 4, 2021 06:08

thomasjpfan approved these changes Oct 4, 2021

View reviewed changes

thomasjpfan merged commit 6edbffd into scikit-learn:main Oct 4, 2021

DimitriPapadopoulos deleted the codespell branch October 4, 2021 13:36

glemaitre mentioned this pull request Oct 23, 2021

Release 1.0.1 #21404

Merged

10 tasks

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

MAINT Codespell configuration (scikit-learn#21051)

b9fc2c7

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

glemaitre pushed a commit that referenced this pull request Oct 25, 2021

MAINT Codespell configuration (#21051)

396d462

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

MAINT Codespell configuration (scikit-learn#21051)

8518563

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

		skip: js,v*.rst,_stop_words.py
		ignore_words_file: .github/codespell_ignore_words.txt

	skip = js,v*.rst,_stop_words.py
	skip = *.js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated

Uh oh!

MAINT Codespell configuration #21051

MAINT Codespell configuration #21051

Uh oh!

Conversation

DimitriPapadopoulos commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

DimitriPapadopoulos commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Sep 15, 2021

Uh oh!

NicolasHug commented Sep 15, 2021

Uh oh!

DimitriPapadopoulos commented Sep 15, 2021

Uh oh!

DimitriPapadopoulos commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Sep 15, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimitriPapadopoulos Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimitriPapadopoulos Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimitriPapadopoulos Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimitriPapadopoulos Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 16, 2021

Uh oh!

glemaitre commented Sep 16, 2021 via email

Uh oh!

NicolasHug commented Sep 16, 2021

Uh oh!

DimitriPapadopoulos commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DimitriPapadopoulos commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DimitriPapadopoulos commented Sep 16, 2021

Uh oh!

adrinjalali commented Sep 21, 2021

Uh oh!

DimitriPapadopoulos commented Sep 15, 2021 •

edited

Loading

DimitriPapadopoulos commented Sep 15, 2021 •

edited

Loading

DimitriPapadopoulos commented Sep 15, 2021 •

edited

Loading

thomasjpfan Sep 15, 2021 •

edited

Loading

DimitriPapadopoulos Sep 16, 2021 •

edited

Loading

DimitriPapadopoulos Sep 16, 2021 •

edited

Loading

DimitriPapadopoulos Sep 15, 2021 •

edited

Loading

DimitriPapadopoulos Sep 16, 2021 •

edited

Loading

DimitriPapadopoulos commented Sep 16, 2021 •

edited

Loading

DimitriPapadopoulos commented Sep 16, 2021 •

edited

Loading

DimitriPapadopoulos Sep 26, 2021 •

edited

Loading