-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC Consistency for meta infos in files (license, encoding, authors, ..) #20813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was also somewhat skeptical about the need for As to the license header, it looks like there is no clear consensus but at the same time if parts of scikit-learn get vendored somewhere, it become very difficult to realize what the license was without this header line. So I would be +1 to consistently add it everywhere. Though it looks like neither pandas not numpy do this. |
I like having a license header everywhere to be consistent. Since everything is licensed under https://github.com/scikit-learn/scikit-learn/blob/main/COPYING do we add the following everywhere? # Authors: The scikit-learn developers.
# License: BSD 3 clause |
+1 for placing the license everywhere. A little argument in favor of |
I was thinking to leave the current authors in place and add this new Now that I am looking at other python code bases: CPython, numpy, scipy, etc, they do not have a license everywhere. I think I can go either way with this as long as we are consistent. Would it be okay to remove the license on top of files and have it all other the "root" license? It is the same license and the authors are still "The scikit-learn developers". The generic "____ developers" term is used in numpy's license, scipy's license. |
I had already removed all (?) occurrences of As for editors that would rely on it to switch to UTF-8 when editing Python files, I would argue they are broken. I've never seen such an editor myself. |
To add a datapoint from JupyterHub: we use "The JupyterHub authors" in a lot of places, there is no explicit mention of individual authors in the files themselves and I don't think the use of the encoding line is widespread/used anywhere. The In general I like having the statement on the shared copyright model because it makes it clear that you don't have to transfer your copyright, sign a contributor license agreement, etc to contribute. It is also useful to point to if ever someone would want to change the license (as we'd have to contact every single author to ask them to relicense their contributions). |
I would add a mention of the BSD license, and of scikit-learn in the headers of the sources for the reason given by Roman in #20813 (comment) and of "scikit-learn developers" for the "Authors" mention. I find that having lists of the original authors is also useful, though public git histories make those lists irrelevant. |
See: scikit-learn#20813 Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
Neither numpy, scipy nor pandas have authors in source files nor a repeated license header. All are licensed under BSD-3-clause which is stated in the |
On the first hand, there's no obligation to have the license and copyright in every single file. Having it only in the root folder is acceptable. On the other hand, having it in each file can make it easier for third party projects that vendor parts of scikit-learn. They just have to take them. Note that I don't think the current formatting would be valid: I think it would require a This brings the question: do we want to be compliant to some standard regarding licensing ? There's for instance the SPDX project hosted by the linux foundation. SPDX is not just about licensing, but there's a Licensing part. The goal would be to have human and machine readable headers, and make scikit-learn files trivial to vendor. It could also help us move forward with #27559. Then there are tools to convert files of a project in order to be compliant and tools to certify that files are compliant. I tested reuse which is both a converter and a linter, compatible with SPDX. I ran
I produces this kind of header
The Then reuse can be used as a linter, even as a precommit hook. Side note; it doesn't prevent us to first remove all existing headers because reuse can't replace them automatically because they're not headers sometimes or not in a parsable format. What's your opinion on this ? |
My first take away is that we still have (kind of a) consensus to (first) remove authors and license in the files. Then to your proposal, I would not over-engineer this. Let's wait until a 3rd party project asks for it. Do we know of any project that vendors (parts of) scikit-learn? |
|
I don't see a consensus from this discussion to remove them. I see 1 or 2 in favor of removing, 1 or 2 in favor of keeping them but uniformized and 1 or 2 unclear 😄 I'd tend to remove them as well unless following the standard is important for some reason I'm not aware about, hence my question. But I agree that whatever the decision, we can start by removing them. |
I'm not a legal expert, but can we do this? Our root license is under "The scikit-learn developers.": But the individual files have the authors listed: scikit-learn/sklearn/tree/_tree.pyx Lines 1 to 11 in 57a7222
|
From what I read here and there: Since we're releasing under BSD license, we don't necessarily have to worry too much about copyright ownership. This license means the copyright owner is effectively giving away their code, so it's unlikely that it can cause us any trouble. However I'm not a legal expert either, it would be interesting to have a concrete legal advice. |
IANAL, but coming from the license compliance side, the usual recommendation I have met is to always include all copyright lines found in the original source code along with their corresponding license statements - usually regardless of the actual license. Applying this to the packaging side: I would not remove them unless being completely sure about this. (There have been recent rumors about this in the Redis community as well when they removed BSD license headers due to re-licensing, see PR 13157 in https://github.com/redis/redis for example.) |
We do not want to change the license at all. |
I'm in favour of removing the list of individual authors from files (and replacing it with "The scikit-learn authors"). Only caveat: if someone can point to some specific legal opinion that says "you definitely can not do this" (not just "you might not be able to do this"). Basically, the ideal end goal for me would be two lines in each file: one stating authors, one stating the license (e.g. what The main motivator for me is that the list of individual authors is probably out of date and hence not super useful for anything other than some form of "giving credit". The credit function was probably useful 5 years ago, but today I wonder if we still need it. I'm not sure I understand what you mean @thomasjpfan regarding the fact that we list individual authors (#20813 (comment)). If we ever wanted to change the license we'd have to contact all the individual authors to get them to agree. For that we'd have to use git's history though, because I doubt that people were diligent in updating the manual list (e.g. when making small tweaks). |
I hope. Are we allowed to modify or remove code contributed by some author without his explicit consent ? I can't imagine otherwise, it's done every single day. The author line being 1 line of code, I don't see how it would be different. It's not that line that can will help a contributor to prove what part of the code was written by him. It's other tools like git or github history. And let's keep in mind that it's not the author line that gives him rights on his code contribution. He holds this right no matter what we write or not write. |
Can we get legal advice via Open Collective ? Maybe overkill for that matter but we could settle once and for all :) |
From my understanding of 3-Clause BSD, the copyright holder must be preserved. Concretely, if we remove the list of authors in If the listed authors are part of the "The scikit-learn developers.", then it could be okay. But I am not sure. |
A minimal consensus seems to be @thomasjpfan's proposal in #20813 (comment) and put
in every file. |
My assumption is that "The scikit-learn developers" includes all people who have edited the source code (at some point in time). |
In the article linked in #20813 (comment), the recommandation is to not remove the existing authors:
|
I have the really strong opinion to remove all names. I also don’t see a larger risk in doing so. We do not remove the copyright holder, just a line of source code. |
I read a bit more on US copyright laws despite not being a lawyer and not knowing which (country’s) law is best cited here. I found what is a copyright notice
So I would conclude that we only have one single copyright notice in the COPYING file. Additionally, we have a lot of non-systematic mentions of (some of the) authors of some files scattered over files which are not copyright notices. An author keeps being an author even without such author statements. Finally, what do we fear? I am trying to be fair to the community and credit each and every contribution. And I consider those rather arbitrary and scattered author statements unfair. From Rich Bowen
|
I feel that the authors on top of the files are extremely inaccurate. They by far don't include people who have actually contributed to the file, and at times they have people who originally had something to do with the file, but in the meantime we've moved on in the implementation and their code is no longer there. So it's a very fuzzy inaccurate thing to have them up there. Maybe the best approximation is to look at the git blame, but that also only shows the last person touching the line, and that's not really mirroring the authorship. So I'm more in favor of removing names all together. Right now, hypothetically, even if we want to "ask for permission from authors of a file", we have pretty much no easy way to do so, and the names on top of the file are NOT the people whose permission we need. So it's pretty useless to have them up there in the first place. |
I am also in disfavor author tags in headers. I think those were useful at a time or in context where authors were only reachable by mail or where IP had to be protected. To me, those are often rather out-of-date pieces of information that we probably do not want to maintain. There are many public authoritative proofs of authorship, clean git history with verified commits (like scikit-learn's) being one of them. If one is curious about an implementation and wants to contact the author, or if one were to sue a project for copyright infringement, one probably would use them. In any case, this message is also an authorization for removing my name tag in headers. |
For the record, I tried to extract all authors and deduplicate them, and here's what I found:
It's clearly far from the actual authors we have on the files. You can try the git log for instance like this:
which gives:
|
If it is legal for us to remove authors like this, then I am in favor of removing and replacing it with: (As suggested in #20813 (comment))
I want this consistency as well, but never pushed for it because of the potential negative sentiment and gray legal area of removing authors from the source. I'll be more comfortable, if we prepared a public statement that explains that it is legal to remove the authors and replace it with "The scikit-learn developers". |
I would recommend to use
Reference: |
I'm writing a blog post for this then. EDIT: scikit-learn/blog#180 |
for what it's worth, if there is a legal question i support the change and give permission when it comes to my own name in any files. |
I read that recommendation in relation to taking code from another project and adding it to yours. In that case you shouldn't remove/modify the existing copyright notice. Which I think is quite different from what we are trying to do here. |
In #28799 we went with # Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause |
Should we adopt this scheme everywhere then? Worth making a PR that updates lots of files in one go or do it as we touch files? Somehow one big PR feels like the thing to do |
This issue got to a decision and was closed by several PRs, thanks to all contributors, in particular thanks to @adrinjalali. |
There are several pieces of information that some files contain, others do not, examples are:
# -*- coding: utf-8 -*-
at the beginning of a file# License: BSD 3 clause
# Authors:
For the following reasons, I'd like to remove the encoding info, and have a consistent policy for the other 2 points:
I haven't figured out a convincing policy for
authors
, maybe leave it as is.Note that changes of those lines of code shouldn't produce merge conflicts, cross referencing #11336.
The text was updated successfully, but these errors were encountered: