Skip to content

MNT Remove encoding declarations: # -*- coding: utf-8 -*- #21260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 14, 2022

Conversation

DimitriPapadopoulos
Copy link
Contributor

What does this implement/fix? Explain your changes.

In Python 3, the default source file encoding is UTF-8.

Any other comments?

This is a follow-up of #21246 which was limited to examples.

In Python 3, the default source file encoding is UTF-8.
@DimitriPapadopoulos DimitriPapadopoulos changed the title MNT Remove encoding declarations: # -*- coding: utf-8 -*- MNT Remove encoding declarations: # -*- coding: utf-8 -*- Oct 6, 2021
@rth
Copy link
Member

rth commented Oct 7, 2021

While for Python it's indeed not necessary, editors/IDE will also need an indication what encoding it is (and I guess otherwise could use the system encoding). https://stackoverflow.com/a/14083123

For instance on Windows the default encoding is not UTF-8,

PS C:\> [System.Text.Encoding]::Default

BodyName          : iso-8859-1
HeaderName        : Windows-1252

So I think it's probably safer to keep those headers.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

The default encoding of Python source files and the default encoding associated to the system locale are two different things.

  • If the IDE knows it's a Python file, it should use UTF-8 by default.
  • If the IDE doesn't know it's a Python file, will it be able to interpret the encoding declaration?

Do you know of a broken IDE that would understand the encoding declaration # -*- coding: utf-8 -*- but would not apply UTF-8 as the default encoding?

Currently, some pure ASCII files do have the encoding declaration and some UTF-8 files lack it. No one complains about it, this is a strong indication the encoding declaration is probably NOT important.

Also the current situation is inconsistent. Either all source files should start with the encoding declaration, or all UTF-8 files (and only them) should start with the encoding declaration - whatever you prefer.

@DimitriPapadopoulos
Copy link
Contributor Author

A majority of the files I have modified seem to be ASCII files:

$ file doc/conf.py 
doc/conf.py: Python script, ASCII text executable
$ 
$ file doc/sphinxext/sphinx_issues.py 
doc/sphinxext/sphinx_issues.py: Python script, ASCII text executable
$ 
$ file sklearn/cluster/_dbscan.py 
sklearn/cluster/_dbscan.py: Python script, ASCII text executable
$ 
$ file sklearn/cluster/_optics.py 
sklearn/cluster/_optics.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/cluster/_spectral.py 
sklearn/cluster/_spectral.py: Python script, ASCII text executable
$ 
$ file sklearn/feature_extraction/tests/test_text.py 
sklearn/feature_extraction/tests/test_text.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/feature_extraction/text.py 
sklearn/feature_extraction/text.py: Python script, UTF-8 Unicode text executable
$ 
$ file sklearn/feature_selection/_base.py 
sklearn/feature_selection/_base.py: Python script, ASCII text executable
$ 
$ file sklearn/gaussian_process/__init__.py 
sklearn/gaussian_process/__init__.py: Python script, ASCII text executable
$ 
[...]
$ file sklearn/random_projection.py 
sklearn/random_projection.py: Python script, ASCII text executable
$ 

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

Here are the UTF-8 *.py NumPy source files without the encoding declaration:

./sklearn/preprocessing/tests/test_data.py
./sklearn/preprocessing/_encoders.py
./sklearn/preprocessing/_data.py
./sklearn/metrics/_plot/roc_curve.py
./sklearn/decomposition/_lda.py
./sklearn/gaussian_process/kernels.py
./sklearn/linear_model/tests/test_coordinate_descent.py
./sklearn/linear_model/_perceptron.py

And here are the UTF-8 *.py NumPy source files with an encoding declaration:

./sklearn/preprocessing/_discretization.py
./sklearn/preprocessing/tests/test_encoders.py
./sklearn/metrics/pairwise.py
./sklearn/linear_model/_theil_sen.py
./sklearn/linear_model/_ransac.py
./sklearn/feature_extraction/tests/test_text.py
./sklearn/feature_extraction/text.py
./sklearn/naive_bayes.py
./sklearn/cluster/_optics.py

@rth
Copy link
Member

rth commented Oct 7, 2021

If the IDE knows it's a Python file, it should use UTF-8 by default.
[...]
Do you know of a broken IDE that would understand the encoding declaration # -- coding: utf-8 -- but would not apply UTF-8 as the default encoding?

Well the question is more how confident are you that IDE will rely on the file extension to determine the encoding :) For instance in https://www.jetbrains.com/help/pycharm/encoding.html I see nothing about it being a Python or a non Python file. I guess it depends on the default settings.

One could hope they are reasonable, but personally I don't know. I just would rather be careful here. So a confirmation that UTF-8 is used by default, particularly on Windows for some of the more popular browsers would be useful. In VS Code the default is indeed UTF-8.

It's probably fine, but doesn't hurt to double check.

@DimitriPapadopoulos
Copy link
Contributor Author

I don't have Windows to double-check.

Instead I can add # -*- coding: utf-8 -*- to UTF-8 files that lack it, and remove # -*- coding: utf-8 -*- from ASCII files.

Alternatively we can close this PR.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

Note that VS saves into UTF8-BOM by default, but we probably want to the avoid the BOM in a multi-platform context.

@rth
Copy link
Member

rth commented Oct 7, 2021

remove # -- coding: utf-8 -- from ASCII files.

Well the problem is that a file that is currently ASCII will become non ASCII as soon as one adds a non ASCII character.

Let's keep it open. Maybe someone else has other opinions or could provide feedback on their IDE configuration.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

Well the problem is that a file that is currently ASCII will become non ASCII as soon as one add a non ASCII character.

Then we need to add # -_- coding: utf-8 -_- to all files. The probability it happens to the ~10 ASCII files I have removed the encoding declaration from is very small compared to the probability it happens to the rest of the ~850 ASCII source files.

@DimitriPapadopoulos
Copy link
Contributor Author

Note that large projects like NumPy live without the encoding declarations. It seems to be working well for them.

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2021

Note that large projects like NumPy live without the encoding declarations. It seems to be working well for them.

Maybe they not not have authorship lines with "Gaël", "Müller" or "Loïc" or docstrings with "Schölkopf".

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2021

+1 for keeping UTF-8 markers to avoid problems with editors on Windows and add them when needed on a case by case basis.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

Maybe they not not have authorship lines with "Gaël", "Müller" or "Loïc" or docstrings with "Schölkopf".

The authorship lines are not relevant as such. The number of UTF-8 files and total files is similar in both projects, but I agree Scikit-learn has twice the NumPy proportion of UTF-8 files. Still, it's similar.

In NumPy:

$ find . -name \*.py -exec file {} +  | grep ASCII | wc  -l
504
$ 
$ find . -name \*.py -exec file {} +  | grep UTF-8 | wc  -l
13
$ 

In Scikit-learn:

$ find . -name \*.py -exec file {} +  | grep ASCII | wc  -l
794
$ 
$ find . -name \*.py -exec file {} +  | grep UTF-8 | wc  -l
37

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

+1 for keeping UTF-8 markers to avoid problems with editors on Windows and add them when needed on a case by case basis.

Just to be clear:

  • UTF-8 files: I understand you want an encoding declaration in UTF-8 files, so I'll keep it in UTF-8 files that already have one and add it to UTF-8 files that lack one.
  • ASCII files: Most ASCII files lack an encoding declaration. What about them? Keeping the encoding declaration in ~10 ASCII file while the rest of ~800 files lack one won't help much. Being consistent would be much more useful. So what do we do with ASCII files?

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2021

I don't know. Can someone with a windows machine try to edit a .py file without the # -*- coding: utf-8 -*- marker using a couple of common editors such as VSCode, PyCharm, spyder, notepad++ to add "ö" in a docstring and see what encoding is used by the editor by default (for instance, you could check that reading the file with print(Path(filaneme).readtext("utf-8")) works)?

On Linux and macOS I am pretty sure that UTF-8 is always use by default nowadays.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Oct 7, 2021

If you really want to take into account Windows editors, I think pre-commitshould remove BOMs added by Windows editors. See for example BOM Away, in Git Style.

@glemaitre glemaitre added the Needs Decision Requires decision label Jan 28, 2022
@jeremiedbb
Copy link
Member

Can someone with a windows machine try to edit a .py file without the # -- coding: utf-8 -- marker using a couple of common editors such as VSCode, PyCharm, spyder, notepad++ to add "ö" in a docstring and see what encoding is used by the editor by default (for instance, you could check that reading the file with print(Path(filaneme).readtext("utf-8")) works)?

I checked with these 4 editors. VSCode, PyCharm, notepad++ use UTF-8 by default. Spider seems to be using ASCII by default but automatically switches to UTF-8 as soon as you add a non-ASCII character. In all cases, reading the file shows the characters correctly.

I think we can safely remove the UTF-8 markers.

@jeremiedbb
Copy link
Member

jeremiedbb commented Mar 13, 2022

(please don't ask me more experiments with these editors, I'm uninstalling 3 of them as I speak 😄)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust @jeremiedbb's test results. Thanks @jeremiedbb!

@jeremiedbb jeremiedbb merged commit 3d4ee9e into scikit-learn:main Mar 14, 2022
@DimitriPapadopoulos DimitriPapadopoulos deleted the utf-8 branch March 15, 2022 13:52
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants