Skip to content

Correctly document linked libraries #27559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stefan6419846 opened this issue Oct 10, 2023 · 15 comments · Fixed by #29861
Closed

Correctly document linked libraries #27559

stefan6419846 opened this issue Oct 10, 2023 · 15 comments · Fixed by #29861
Assignees

Comments

@stefan6419846
Copy link

stefan6419846 commented Oct 10, 2023

Describe the issue linked to the documentation

When downloading the current wheel for scikit-learn==1.3.1, the metadata tell me that the package is subject to the terms of BSD-3-Clause. Unfortunately, this only applies to the package itself. Skimming through the distributed files, there are at least two additional cases:

Suggest a potential alternative/fix

It would be great if a full list of external modules shipped within scikit-learn wheels and their copyright information would be provided to detect possible license conflicts early.

@stefan6419846 stefan6419846 added Documentation Needs Triage Issue requires triage labels Oct 10, 2023
@adrinjalali
Copy link
Member

Hmm, this is interesting. This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think. This seems like an oversight from our side.

cc @scikit-learn/core-devs

@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Oct 10, 2023
@stefan6419846
Copy link
Author

IANAL, but: GCC has the runtime exception which should reduce the general risk (see copyright header as well): https://www.gnu.org/licenses/gcc-exception-3.1.html Nevertheless, if this is clearly documented on the scikit-learn side, this should at least resolve basic confusion.

@glemaitre
Copy link
Member

External code snippets under licenses like MIT, Apache-2.0 and Python-2.0

@stefan6419846 Could you provide the way you found them?

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

Actually, it means that you need to build scikit-learn from source using an OpenMP that is not GPL because we don't bundle within the package but only in the wheel.

I assume that the only way that we can work around is to always use llvm compilers with the llvm-openmp as we already do for the MacOS wheels. The licence is Apache-2 in this case.

@stefan6419846
Copy link
Author

Could you provide the way you found them?

I used https://github.com/stefan6419846/license_tools, a custom wrapper around https://github.com/nexB/scancode-toolkit/

@glemaitre
Copy link
Member

@stefan6419846 Thanks. I assume that we should be running such tools and have a proper LICENCE file integrated to the wheels.

@stefan6419846
Copy link
Author

In theory you shouldn't need to run these tools regularly, but perform an initial complete review of the current code base for all external stuff to document it appropriately (and in which cases it is shipped in the official distributions) - this can be assisted by corresponding scanning tools.

Future checks usually can be subject to a general pull request review process, backed by corresponding contribution docs (when and how to include new external code, including indirect dependencies, how to ensure license compatibility ...)

@GaelVaroquaux
Copy link
Member

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

I don't believe that this is true.

Still, it would be good from our side to document things better.

@stefan6419846
Copy link
Author

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

I don't believe that this is true.

License compliance does not really allow for generalization and IANAL, but yes, it depends on how you use/distribute your applications and what the law department considers appropriate. In general, the GPL being a strict copyleft license can be an issue and some "weaker" license might be desired (like Apache-2.0 with its weak copyleft effect), but internal use without distribution or SaaS-based usage tends to be fine at least.

@lorentzenchr
Copy link
Member

As already pointed out, the GCC RUNTIME LIBRARY EXCEPTION states

1. Grant of Additional Permission.

You have permission to propagate a work of Target Code formed by combining the Runtime Library with Independent Modules, even if such propagation would otherwise violate the terms of GPLv3, provided that all Target Code was generated by Eligible Compilation Processes. You may then convey such a combination under terms of your choice, consistent with the licensing of the Independent Modules.

libgomp has this exception and is only included in our binaries (wheels) when compiling via gcc, isn't it? IANAL, I don't see a problem here. And I also don't know if it is a good idea to add anything to the docs.

License scanning, on the other side, is usually a good idea 😏

@stefan6419846
Copy link
Author

And I also don't know if it is a good idea to add anything to the docs.

This depends on the general perspective you want to take. Yes, in general FOSS and especially the liability/warranty clauses of most licenses do not require anyone to provide such information. They can rather serve as some basic indication of the current licensing situation and provide some short hints regarding possible issues, while indicating that someone might be aware of the possible implications.

Given the liability clauses above, I will always have to check for correctness of the statements as well to avoid hidden risks (studies have shown that there are quite some projects which do not correctly document "hidden" licenses). During such a process, I stumbled upon the current documentation limitations and decided to file this issue to further evaluate what a suitable solution could look like.

As some examples, this is how scipy or opencv-python currently handle this: https://github.com/scipy/scipy/blob/main/LICENSES_bundled.txt https://github.com/opencv/opencv-python/blob/4.x/LICENSE-3RD-PARTY.txt

@lorentzenchr
Copy link
Member

Scipy really bundles/vendors several whole libraries, i.e., they are included in the scipy source code. The only thing we vendor is liblinear and libsvm, and then a few smaller code snippets like in utils/_pprint.py.

If you think a LICENSES_bundled.txt as in numpy as scipy would help, then PR welcome. This, however, will not solve the (non) issue with libgomp in the wheel.

@markdryan
Copy link

If you think a LICENSES_bundled.txt as in numpy as scipy would help, then PR welcome.

I think something like this is required. Assuming I've identified the correct licenses for liblinear and pprint and their code is included in the binary wheels, their licenses, BSD 3-Clause and PSF require that their copyright notices and licenses are supplied with the binaries that contain them. As far as I can tell the scikit-learn wheels do not currently do this.

Regarding, libgomp, although the Runtime Exception clause applies to the scikit-learn code, I believe libgomp itself is distributed under the terms of the GPL v3, i.e., the source code from which it was it built should be provided or should be linked to in some way. See the second paragraph of the section entitled I use a proprietary compiler toolchain without any parts of GCC to compile my program, and link it with libstdc++ in the gcc-exception-3.1-faq. (libstdc++ is also released under the GCC Runtime Library Exception).

Numpy and scipy have had a similar issue with libgfortran in the past which is bundled in their binary wheels and is also released under the same license as libgomp. When the numpy wheels are built, an OS specific text file containing the licenses for all the bundled dependencies (including libgfortran) is now appended to the LICENSE.txt file included in the wheel. The entry for libgfortran in the final LICENSE.txt file contains a link to the libgfortran source code, although not I think, the exact version from which it was built.

@lorentzenchr
Copy link
Member

@thomasjpfan Could you contribute something similar to numpy/numpy#20102 concerning the licenses?

@thomasjpfan
Copy link
Member

Yea, I'll contribute something like numpy/numpy#20102 for scikit-learn.

@thomasjpfan thomasjpfan self-assigned this Apr 11, 2024
@doshi-kevin
Copy link
Contributor

Anything to contribute here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants