-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add common Array API tests and estimator tag #26372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add common Array API tests and estimator tag #26372
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need tests to test the new added common checks
sklearn/utils/estimator_checks.py
Outdated
"""Check that the array_api Array gives the same results as ndarrays.""" | ||
xp = pytest.importorskip(array_namespace) | ||
|
||
X, y = make_classification(random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we use load_iris
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest of the common tests use the iris dataset, we don't seem to use make_classification
anywhere else. Here we could use iris too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we repeat that everywhere, I wanted to see the impact in terms of computational resources and both options are cheap enough.
In [1]: from sklearn.datasets import load_iris
In [2]: %timeit load_iris()
215 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: from sklearn.datasets import make_classification
In [4]: %timeit make_classification(random_state=42)
164 µs ± 721 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
We could switch to load_iris
for the sake of consistency but no strong opinion.
Would it be possible to consolidate the two It should be possible to call the standard some_xp_array.to_device(ns_specific_device_name_or_object) for namespaces that support multiple devices or alternatively rely on asarray(data, device=ns_specific_device_name_or_object). EDIT: you might need https://github.com/data-apis/array-api-compat to expose torch as a regular array namespace to make the code of the checker library agnostic beyond the namespace name to create the arrays and get an >>> import array_api_compat.torch as xp
>>> xp.asarray(np.arange(10), dtype=xp.float32, device="cpu")
tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) |
sklearn/utils/estimator_checks.py
Outdated
|
||
est_xp_param_np = _convert_to_numpy(est_xp_param, xp=xp) | ||
assert_allclose( | ||
attribute, est_xp_param_np, err_msg=f"{key} not the same", atol=1e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a bit surprised to see atol=1e-3
. Is there a particular reason why we have such a large tolerance? I know it is inherited from the existing tests, but I don't know the reasoning for it originally either. It just seems especially large compared to our tolerances on many other algorithms and implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparing the output of results from a GPU and CPU, which can result in bigger differences. For example, PyTorch sets atol=1e-3
and rtol=1e-3
when comparing CPU and GPU operations: https://github.com/pytorch/pytorch/blob/c618093681bfbb7fca6159de5b73ffceb5b4e0f5/test/test_ops.py#L287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good enough explanation @Micky774 or should we dig a bit more?
Worth adding Thomas' answer as a comment for people from the future who have the same question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seeing Olivier's change to 100 * finfo(dtype).eps
, should we still add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't hurt to leave a comment about how these error bounds are tighter than PyTorch which is a strong precedence.
I went ahead and fixed the tests for cupy and removed the now redundant pytorch test. I also changed the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI is green, LGTM.
CI is not green when |
@betatim I also made sure that estimator checks can be executed without the dependency on |
I missed some pytest imports in |
Shouldn't we have a place where we explain how an environment can be created where these tests are not skipped? We're adding quite some environment complexity here with a few dependencies, it'd be nice to have them documented. Otherwise this LGTM. |
As soon as the optional dependencies (array-api-compat and pytorch or cupy) are installed and the matching hardware is available (e.g. cuda for torch gpu and cupy), the relevant tests are not skipped anymore. The skip tests messages seem explicit enough to know what to install, no? I am worried that adding doc on how to install such extra packages will be redundant with the installation instructions of said packages and I am worried that our doc might not stay up to date. Or maybe we could add a list of links to the installation instructions of the afore mentioned packages at the bottom of the Array API doc page in scikit-learn? |
I added a short section to |
When I write tests, there are hundreds of skipped tests and tons of xfailed ones, it's not reasonable to expect that one can easily find steps to make all the tests run. But it's also not super crucial maybe. I just don't understand where we're testing the GPU code here in our CI. If we have a job doing that, I'd be happier with having the PR merged. |
On the CI, we are only testing Array API support using PyTorch on CPU and NumPy's Array API implementation. For coverage, PyTorch on CPU will test the same code paths as a GPU array. The only thing we are missing from the CI is the correctness of the GPU result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this looks good.
Comments addressed. Testing on an actual GPU in CI would be nice. Even if we only do it once a week or so. However, it is proving hard to find a CI with a GPU :-/ |
Hey @betatim not anymore. You can try https://cirun.io it supports GPU runners for GitHub Actions, spun up on a cloud provider like say AWS. Here is an example of a sample run: https://github.com/pystatgen/sgkit/actions/runs/5142554146/jobs/9256432281 Here is an example configuration: |
My comment about hardness might have factored in the hardness of finding someone who pays for the costs of running CI on GPUs ;) But while I have an expert here: what alternatives/competitors exist to cirun.io? |
Yeah, that's the tricky bit.
None that I'm aware of in the GPU space, I read somewhere that CircleCI supports GPUs as well on enterprise plans, but have never tried it. There is another called buildjet in non-gpu space. |
Cost isn't really the problem, unless you want to run it by default on PRs. Above you (@betatim) said "Even if we only do it once a week or so." - that'd be like $100/yr. It's admin'ing the integration and cloud account that's the painful part.
PyTorch CI used to run completely on CircleCI; it is very good, but also very expensive. |
That's something I can help with, I already do that with some projects, if I can get a cloud account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the changes on a local machine with CuPy and left some suggestions to get the common test working with CuPy.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
The namespace is not exactly the same as the module.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Addressed the latest rounds of comments from Thomas |
@thomasjpfan I think your comments were taken into account. Merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! LGTM
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
#26348
What does this implement/fix? Explain your changes.
This adds a "Array API support" estimator tag that allows the common estimator checks to test if an estimator's Array API support full fills basic requirements.
The idea is to have these tests as common tests instead of having them duplicated in the tests for each estimator (#26243, #26315, #22554, #25956).
Any other comments?
What would be a good way to have an additional dtype (
float16
) in the torch test? We can't just add it because some estimators (likeLinearDiscriminantAnalysis
) don't support it. But some others do. So we need a way to detect if an estimator should work or not withfloat16
s.