MNT Remove encoding declarations: `# -- coding: utf-8 --` #21260

DimitriPapadopoulos · 2021-10-06T18:24:34Z

What does this implement/fix? Explain your changes.

In Python 3, the default source file encoding is UTF-8.

Any other comments?

This is a follow-up of #21246 which was limited to examples.

In Python 3, the default source file encoding is UTF-8.

rth · 2021-10-07T09:28:44Z

While for Python it's indeed not necessary, editors/IDE will also need an indication what encoding it is (and I guess otherwise could use the system encoding). https://stackoverflow.com/a/14083123

For instance on Windows the default encoding is not UTF-8,

PS C:\> [System.Text.Encoding]::Default

BodyName          : iso-8859-1
HeaderName        : Windows-1252

So I think it's probably safer to keep those headers.

DimitriPapadopoulos · 2021-10-07T10:13:04Z

The default encoding of Python source files and the default encoding associated to the system locale are two different things.

If the IDE knows it's a Python file, it should use UTF-8 by default.
If the IDE doesn't know it's a Python file, will it be able to interpret the encoding declaration?

Do you know of a broken IDE that would understand the encoding declaration # -*- coding: utf-8 -*- but would not apply UTF-8 as the default encoding?

Currently, some pure ASCII files do have the encoding declaration and some UTF-8 files lack it. No one complains about it, this is a strong indication the encoding declaration is probably NOT important.

Also the current situation is inconsistent. Either all source files should start with the encoding declaration, or all UTF-8 files (and only them) should start with the encoding declaration - whatever you prefer.

DimitriPapadopoulos · 2021-10-07T10:25:07Z

A majority of the files I have modified seem to be ASCII files:

$ file doc/conf.py 
doc/conf.py: Python script, ASCII text executable
$ 
$ file doc/sphinxext/sphinx_issues.py 
doc/sphinxext/sphinx_issues.py: Python script, ASCII text executable
$ 
$ file sklearn/cluster/_dbscan.py 
sklearn/cluster/_dbscan.py: Python script, ASCII text executable
$ 
$ file sklearn/cluster/_optics.py 
sklearn/cluster/_optics.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/cluster/_spectral.py 
sklearn/cluster/_spectral.py: Python script, ASCII text executable
$ 
$ file sklearn/feature_extraction/tests/test_text.py 
sklearn/feature_extraction/tests/test_text.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/feature_extraction/text.py 
sklearn/feature_extraction/text.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/feature_selection/_base.py 
sklearn/feature_selection/_base.py: Python script, ASCII text executable
$ 
$ file sklearn/gaussian_process/__init__.py 
sklearn/gaussian_process/__init__.py: Python script, ASCII text executable
$ 
[...]
$ file sklearn/random_projection.py 
sklearn/random_projection.py: Python script, ASCII text executable
$

DimitriPapadopoulos · 2021-10-07T10:34:24Z

Here are the UTF-8 *.py NumPy source files without the encoding declaration:

./sklearn/preprocessing/tests/test_data.py
./sklearn/preprocessing/_encoders.py
./sklearn/preprocessing/_data.py
./sklearn/metrics/_plot/roc_curve.py
./sklearn/decomposition/_lda.py
./sklearn/gaussian_process/kernels.py
./sklearn/linear_model/tests/test_coordinate_descent.py
./sklearn/linear_model/_perceptron.py

And here are the UTF-8 *.py NumPy source files with an encoding declaration:

./sklearn/preprocessing/_discretization.py
./sklearn/preprocessing/tests/test_encoders.py
./sklearn/metrics/pairwise.py
./sklearn/linear_model/_theil_sen.py
./sklearn/linear_model/_ransac.py
./sklearn/feature_extraction/tests/test_text.py
./sklearn/feature_extraction/text.py
./sklearn/naive_bayes.py
./sklearn/cluster/_optics.py

rth · 2021-10-07T12:16:55Z

If the IDE knows it's a Python file, it should use UTF-8 by default.
[...]
Do you know of a broken IDE that would understand the encoding declaration # -- coding: utf-8 -- but would not apply UTF-8 as the default encoding?

Well the question is more how confident are you that IDE will rely on the file extension to determine the encoding :) For instance in https://www.jetbrains.com/help/pycharm/encoding.html I see nothing about it being a Python or a non Python file. I guess it depends on the default settings.

One could hope they are reasonable, but personally I don't know. I just would rather be careful here. So a confirmation that UTF-8 is used by default, particularly on Windows for some of the more popular browsers would be useful. In VS Code the default is indeed UTF-8.

It's probably fine, but doesn't hurt to double check.

DimitriPapadopoulos · 2021-10-07T13:01:37Z

I don't have Windows to double-check.

Instead I can add # -*- coding: utf-8 -*- to UTF-8 files that lack it, and remove # -*- coding: utf-8 -*- from ASCII files.

Alternatively we can close this PR.

DimitriPapadopoulos · 2021-10-07T13:06:51Z

Note that VS saves into UTF8-BOM by default, but we probably want to the avoid the BOM in a multi-platform context.

rth · 2021-10-07T13:07:07Z

remove # -- coding: utf-8 -- from ASCII files.

Well the problem is that a file that is currently ASCII will become non ASCII as soon as one adds a non ASCII character.

Let's keep it open. Maybe someone else has other opinions or could provide feedback on their IDE configuration.

DimitriPapadopoulos · 2021-10-07T13:07:56Z

Well the problem is that a file that is currently ASCII will become non ASCII as soon as one add a non ASCII character.

Then we need to add # -_- coding: utf-8 -_- to all files. The probability it happens to the ~10 ASCII files I have removed the encoding declaration from is very small compared to the probability it happens to the rest of the ~850 ASCII source files.

DimitriPapadopoulos · 2021-10-07T13:11:35Z

Note that large projects like NumPy live without the encoding declarations. It seems to be working well for them.

ogrisel · 2021-10-07T13:52:58Z

Note that large projects like NumPy live without the encoding declarations. It seems to be working well for them.

Maybe they not not have authorship lines with "Gaël", "Müller" or "Loïc" or docstrings with "Schölkopf".

ogrisel · 2021-10-07T13:55:23Z

+1 for keeping UTF-8 markers to avoid problems with editors on Windows and add them when needed on a case by case basis.

DimitriPapadopoulos · 2021-10-07T14:09:41Z

Maybe they not not have authorship lines with "Gaël", "Müller" or "Loïc" or docstrings with "Schölkopf".

The authorship lines are not relevant as such. The number of UTF-8 files and total files is similar in both projects, but I agree Scikit-learn has twice the NumPy proportion of UTF-8 files. Still, it's similar.

In NumPy:

$ find . -name \*.py -exec file {} +  | grep ASCII | wc  -l
504
$ 
$ find . -name \*.py -exec file {} +  | grep UTF-8 | wc  -l
13
$

In Scikit-learn:

$ find . -name \*.py -exec file {} +  | grep ASCII | wc  -l
794
$ 
$ find . -name \*.py -exec file {} +  | grep UTF-8 | wc  -l
37

DimitriPapadopoulos · 2021-10-07T14:15:35Z

+1 for keeping UTF-8 markers to avoid problems with editors on Windows and add them when needed on a case by case basis.

Just to be clear:

UTF-8 files: I understand you want an encoding declaration in UTF-8 files, so I'll keep it in UTF-8 files that already have one and add it to UTF-8 files that lack one.
ASCII files: Most ASCII files lack an encoding declaration. What about them? Keeping the encoding declaration in ~10 ASCII file while the rest of ~800 files lack one won't help much. Being consistent would be much more useful. So what do we do with ASCII files?

ogrisel · 2021-10-07T14:36:44Z

I don't know. Can someone with a windows machine try to edit a .py file without the # -*- coding: utf-8 -*- marker using a couple of common editors such as VSCode, PyCharm, spyder, notepad++ to add "ö" in a docstring and see what encoding is used by the editor by default (for instance, you could check that reading the file with print(Path(filaneme).readtext("utf-8")) works)?

On Linux and macOS I am pretty sure that UTF-8 is always use by default nowadays.

DimitriPapadopoulos · 2021-10-07T15:50:56Z

If you really want to take into account Windows editors, I think pre-commitshould remove BOMs added by Windows editors. See for example BOM Away, in Git Style.

jeremiedbb · 2022-03-13T01:15:26Z

Can someone with a windows machine try to edit a .py file without the # -- coding: utf-8 -- marker using a couple of common editors such as VSCode, PyCharm, spyder, notepad++ to add "ö" in a docstring and see what encoding is used by the editor by default (for instance, you could check that reading the file with print(Path(filaneme).readtext("utf-8")) works)?

I checked with these 4 editors. VSCode, PyCharm, notepad++ use UTF-8 by default. Spider seems to be using ASCII by default but automatically switches to UTF-8 as soon as you add a non-ASCII character. In all cases, reading the file shows the characters correctly.

I think we can safely remove the UTF-8 markers.

jeremiedbb · 2022-03-13T01:17:07Z

(please don't ask me more experiments with these editors, I'm uninstalling 3 of them as I speak 😄)

ogrisel

I trust @jeremiedbb's test results. Thanks @jeremiedbb!

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

MNT Remove encoding declarations: # -*- coding: utf-8 -*-

86e5679

In Python 3, the default source file encoding is UTF-8.

DimitriPapadopoulos force-pushed the utf-8 branch from 571474c to 86e5679 Compare October 6, 2021 18:24

DimitriPapadopoulos changed the title ~~MNT Remove encoding declarations: # -*- coding: utf-8 -*-~~ MNT Remove encoding declarations: # -*- coding: utf-8 -*- Oct 6, 2021

TomDLT approved these changes Oct 6, 2021

View reviewed changes

TomDLT added the No Changelog Needed label Oct 6, 2021

jjerphan approved these changes Oct 7, 2021

View reviewed changes

DimitriPapadopoulos mentioned this pull request Oct 7, 2021

MAINT: Remove encoding declarations: # -*- coding: utf-8 -*- numpy/numpy#20060

Merged

DimitriPapadopoulos mentioned this pull request Oct 7, 2021

MNT Clean up examples #21246

Merged

glemaitre added the Needs Decision Requires decision label Jan 28, 2022

jeremiedbb approved these changes Mar 13, 2022

View reviewed changes

ogrisel approved these changes Mar 14, 2022

View reviewed changes

Merge branch 'main' into utf-8

f64e02e

jeremiedbb merged commit 3d4ee9e into scikit-learn:main Mar 14, 2022

DimitriPapadopoulos deleted the utf-8 branch March 15, 2022 13:52

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

MNT Remove utf-8 encoding declarations (scikit-learn#21260)

630cfb5

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

DimitriPapadopoulos mentioned this pull request May 31, 2023

RFC Consistency for meta infos in files (license, encoding, authors, ..) #20813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Remove encoding declarations: `# -- coding: utf-8 --` #21260

MNT Remove encoding declarations: `# -- coding: utf-8 --` #21260

DimitriPapadopoulos commented Oct 6, 2021

rth commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

rth commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

rth commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021

ogrisel commented Oct 7, 2021 •

edited

Loading

ogrisel commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

ogrisel commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

jeremiedbb commented Mar 13, 2022

jeremiedbb commented Mar 13, 2022 •

edited

Loading

ogrisel left a comment

MNT Remove encoding declarations: # -*- coding: utf-8 -*- #21260

MNT Remove encoding declarations: # -*- coding: utf-8 -*- #21260

Conversation

DimitriPapadopoulos commented Oct 6, 2021

What does this implement/fix? Explain your changes.

Any other comments?

rth commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

DimitriPapadopoulos commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

rth commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

rth commented Oct 7, 2021 • edited Loading

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

DimitriPapadopoulos commented Oct 7, 2021

ogrisel commented Oct 7, 2021 • edited Loading

ogrisel commented Oct 7, 2021 • edited Loading

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

ogrisel commented Oct 7, 2021

DimitriPapadopoulos commented Oct 7, 2021 • edited Loading

jeremiedbb commented Mar 13, 2022

jeremiedbb commented Mar 13, 2022 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

MNT Remove encoding declarations: `# -- coding: utf-8 --` #21260

MNT Remove encoding declarations: `# -- coding: utf-8 --` #21260

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

rth commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

ogrisel commented Oct 7, 2021 •

edited

Loading

ogrisel commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

DimitriPapadopoulos commented Oct 7, 2021 •

edited

Loading

jeremiedbb commented Mar 13, 2022 •

edited

Loading