-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Codespell configuration #21051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c3bd90f
to
09005d8
Compare
I believe (some of) the linter issues predate my own changes (probably an effect of dos2unix). |
We used Regarding the usage of any thoughts @adrinjalali @thomasjpfan @NicolasHug? |
09005d8
to
023c2ee
Compare
I'm generally happy with fixing and preventing typos, my only concern is the potential risk of false positives. Did you observe a lot of these in your local setup @glemaitre ? As a technical detail on this PR, I feel like we should fix the CRLF issues in a separate PR. 6K+ lines of change is scary for us reviewers :p ! |
I'll move the |
20d625c
to
484fa7e
Compare
False positives I have seen myself:
Have a look at |
Locally I don't recall any false positives (or so few of them that I don't notice them) but I don't know if this extension uses the same engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with spell checking in the CI.
To complete the PR, I think we need to add something in the contributor guide to describe how to run this locally.
.github/codespell_ignore_words.txt
Outdated
@@ -0,0 +1,38 @@ | |||
aline | |||
amoungst | |||
ba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the words in codespell_ignore_words.txt
do not look like words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aline
is a person's name from bibliography.
amoungst
is a valid variant of amongst
from sklearn/feature_extraction/_stop_words.py
ba
is a variable name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amoungst is a valid variant of amongst from sklearn/feature_extraction/_stop_words.py
Is this safe to remove from the ignore list now that we ignore _stop_words.py
?
Some of these variable names are not the best. In a follow up PR, we can start looking at the ignore list and try to remove some words and create better variable names for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I will remove amoungst
from the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not certain ba
is an actual variable name. It might be a PDF command used when generating docs, or something similar, you cannot change. I'd have to run codespell without ignoring ba
to tell you. Most variable names are actually (very) well chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed amoungst
. Note that the complies
false positive might be the result of an actual codespell bug or new feature (codespell-project/codespell#2062).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ba
is indeed a variable name is some cases, but it is also part of an URL!
http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%
BA
%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F
While codespell does have an --uri-ignore-words-list
option, it has not been propagated to the GitHub action. I have opened another ticket for that (codespell-project/actions-codespell#36).
.github/workflows/codespell.yml
Outdated
skip: js,v*.rst,_stop_words.py | ||
ignore_words_file: .github/codespell_ignore_words.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be placed in a RC file so it is easier to run locally? (https://github.com/codespell-project/codespell#using-a-config-file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Excellent idea, in Python projects I can put these in setup.cfg
.
484fa7e
to
327625c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of https://github.com/scikit-learn/scikit-learn/pull/21051/files#r709263298 there is also the following:
"happy.pleased", | ||
"relaxing.calm", | ||
"quiet.still", | ||
"sad.lonely", | ||
"angry.aggresive", | ||
"angry.aggressive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem: those typos are part of the original dataset hosted at openml.org. We should not fix those and we should find a way to silence false positives for this specific file without adding typos in the global codespell_ignore_words.txt
file. Would it be possible to tell codespell to ignore some words in a specific file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, I had noted that I should ask about these, then forgot. Thank you for raising this issue.
No, there is no way to do that. You can only:
- ignore words
- exclude files
- ignore full lines
No worries though. I'll undo the changes in sklearn/datasets/tests/test_openml.py
and add these words to the list of false positives. It's not perfect, but we cannot aim at perfect here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to the openml.org typos, aggresive
and suprised
added to the list of false positives.
If the false positive management and maintenance overhead to have to manage an ignore list is judged too much trouble to be worth it, I would be in favor of opening a dedicated ,one-shot PR that just fixes the currently identified typos, without the extra CI harness. Merging that one should not be controversial ;) |
+1 with Olivier proposal.
…On Thu, 16 Sept 2021 at 15:29, Olivier Grisel ***@***.***> wrote:
If the false positive management and maintenance overhead to have to
manage an ignore list is judged too much trouble to be worth it, I would be
in favor of opening a dedicated ,one-shot PR that just fixes the currently
identified typos, without the extra CI harness. Merging that one should not
be controversial ;)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21051 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P6Y7TFJSD2AZ5NQMADUCHWLPANCNFSM5ECEGMFA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
Maybe we can still keep the ignore_word list etc and make this a step in the release process? |
I believe the management of false positives in Fixing the actual typos might be more annoying, it will happen more often - compare the number of words of An alternative would be to show the results of codespell but not block the merge. It would be up to reviewers to let typos through or decide they're false positives, and perhaps update the list of false positives in |
How to make that a step in the release process? Are the steps just documented or actually programmed? In any case, I can split this into two PRs:
|
d98d3e9
to
f6a5454
Compare
I have moved the actual typo fixes to #21069. It would be great if someone could merge that PR soon. Then I can rebase this PR and use it to continue discussing the machinery to prevent future typos in this PR, either as part of CI, or as a step in the release process. |
a69551f
to
cfdd64d
Compare
I like the other PR fixing the typos, but I'm also wary of this being a part of the CI, especially with the way we need to globally ignore things. I'd be in favor of having this done periodically, at every release or once a year or so, rather than being a part of CI. |
I'll remove the GitHub action Is it OK to keep the codespell parameters in
Since this won't be a GitHub action, I should probably move |
15da471
to
ea3857e
Compare
I would put that under |
ea3857e
to
5126586
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running codespell
locally is great. Thank you for working on these spelling errors @DimitriPapadopoulos !
setup.cfg
Outdated
@@ -71,3 +71,7 @@ ignore = | |||
sklearn/utils/_seq_dataset.pxd | |||
sklearn/utils/_weight_vector.pyx | |||
sklearn/utils/_weight_vector.pxd | |||
|
|||
[codespell] | |||
skip = js,v*.rst,_stop_words.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some auto generated files that would good to skip:
skip = js,v*.rst,_stop_words.py | |
skip = *.js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated |
I think expanding _stop_words.py
to the full path makes it more explicit. I see that v*.rst
is skipped because of names, but I think having spell check enabled for the changelog is a net win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the full paths, but I'd rather wait for codespell-project/codespell#2058 to be taken into account.
About v*.rst
, I've left these out because many projects don't want to modify the changelog, they want typos in there to remain as they are forever. I believe this is inspired by the GNU documentation on Change Logs. On the other hand, not everyone agrees with the GNU documentation, after all the changelog is viewed as a detailed dump of the VCS, which may appear a bit too detailed and obsolete these days. If many of you want me to fix the typos in the changelog, I'm happy to do that. However, I suggest this happens in a distinct PR, like #21069 for the rest of the typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's js
and not *.js
, to exclude vendored JS scripts under doc/themes/scikit-learn-modern/static/js
, but not other JS scripts (if any in the future).
e5064fd
to
178e692
Compare
When I run |
178e692
to
47f4ff6
Compare
Indeed, that's because I had set it up for automated scans running on a git archive extraction, so without a .git directory. Fixed now by adding I have also added a new |
47f4ff6
to
9df331e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think codespell_exclude.txt
is a little too fragile. If there is any change in those lines, we may need to update codespell_exclude.txt
. I am happy with the original version where we had the names & words in codespell_ignore_words.txt
.
setup.cfg
Outdated
@@ -71,3 +71,8 @@ ignore = | |||
sklearn/utils/_seq_dataset.pxd | |||
sklearn/utils/_weight_vector.pyx | |||
sklearn/utils/_weight_vector.pxd | |||
|
|||
[codespell] | |||
skip = ./.git,./doc/themes/scikit-learn-modern/static/js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we add ./.mypy_cache
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added ./.mypy_cache
.
Using two exclusion mechanisms is complicated and error-prone, I agree with that. However, I'm worried that some of the words that we ignore will shadow very common typos, the kind you see in any large project. The most striking example is teh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that at least files v*.rst
wouldn't change much.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
9df331e
to
bfa4438
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
What does this implement/fix? Explain your changes.
Add codespell to CI and fix existing typos.
Any other comments?
Not sure where to put file
codespell_ignore_words.txt
. Can you help?Finally, should this really go into the Changelog?