Skip to content

TST Extend tests for scipy.sparse.*array in sklearn/cluster/tests/test_optics.py #27104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 30, 2023

Conversation

marenwestermann
Copy link
Member

Reference Issues/PRs

#27090

What does this implement/fix? Explain your changes.

Makes use of the CSR_CONTAINERS fix to test sparse arrays.

Any other comments?

I included a NotImplementedError exception because I got the following test failure:

________________________________________________________________________________ test_precomputed_dists[float64-csr_array-True] ________________________________________________________________________________

is_sparse = True, global_dtype = <class 'numpy.float64'>, csr_container = <class 'scipy.sparse._arrays.csr_array'>

    @pytest.mark.parametrize("is_sparse", [False, True])
    @pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
    def test_precomputed_dists(is_sparse, global_dtype, csr_container):
        redX = X[::2].astype(global_dtype, copy=False)
        print("redX", redX)
        dists = pairwise_distances(redX, metric="euclidean")
        print("dists", dists)
        dists = csr_container(dists) if is_sparse else dists
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", EfficiencyWarning)
>           clust1 = OPTICS(min_samples=10, algorithm="brute", metric="precomputed").fit(
                dists
            )

sklearn/cluster/tests/test_optics.py:815:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/base.py:1215: in wrapper
    return fit_method(estimator, *args, **kwargs)
sklearn/cluster/_optics.py:348: in fit
    ) = memory.cache(compute_optics_graph)(
../../mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/joblib/memory.py:349: in __call__
    return self.func(*args, **kwargs)
sklearn/utils/_param_validation.py:211: in wrapper
    return func(*args, **kwargs)
sklearn/cluster/_optics.py:617: in compute_optics_graph
    _set_reach_dist(
sklearn/cluster/_optics.py:668: in _set_reach_dist
    dists = X[point_index, unproc]
../../mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/scipy/sparse/_index.py:57: in __getitem__
    self._raise_on_1d_array_slice()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <30x30 sparse array of type '<class 'numpy.float64'>'
	with 900 stored elements in Compressed Sparse Row format>

    def _raise_on_1d_array_slice(self):
        """We do not currently support 1D sparse arrays.

        This function is called each time that a 1D array would
        result, raising an error instead.

        Once 1D sparse arrays are implemented, it should be removed.
        """
        if self._is_array:
>           raise NotImplementedError(
                'We have not yet implemented 1D sparse slices; '
                'please index using explicit indices, e.g. `x[:, [0]]`'
            )
E           NotImplementedError: We have not yet implemented 1D sparse slices; please index using explicit indices, e.g. `x[:, [0]]`

../../mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/scipy/sparse/_index.py:41: NotImplementedError

Not sure if this is the best solution.

@github-actions
Copy link

github-actions bot commented Aug 18, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c1d74d0. Link to the linter CI: here

@glemaitre glemaitre removed their request for review August 21, 2023 12:03
@jjerphan jjerphan changed the title TST Extend tests for scipy.sparse.*array in sklearn/cluster/tests/test_optics.py TST Extend tests for scipy.sparse.*array in sklearn/cluster/tests/test_optics.py Aug 24, 2023
@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2023

@marenwestermann do you need help with this? Are the above information enough for you to investigate? Otherwise I can try to dig further to give more specific guidance.

@glemaitre
Copy link
Member

One thing that I could see is that our check_array can actually fail on some other sparse matrix format than CSC and CSR. #27336 should solve the problem.

@marenwestermann
Copy link
Member Author

Thanks @ogrisel for asking. I actually do need help. I applied the suggested change, i.e.

dists = X[[point_index], unproc]

and this resulted in the following error:

============================================================================================= test session starts ==============================================================================================
platform darwin -- Python 3.10.8, pytest-7.2.0, pluggy-1.0.0
rootdir: /Users/maren/open-source/scikit-learn, configfile: setup.cfg
plugins: xdist-3.1.0, cov-4.0.0
collected 153 items / 145 deselected / 8 selected

sklearn/cluster/tests/test_optics.py ssss.F..                                                                                                                                                            [100%]

=================================================================================================== FAILURES ===================================================================================================
_______________________________________________________________________________ test_precomputed_dists[float64-csr_matrix-True] ________________________________________________________________________________

is_sparse = True, global_dtype = <class 'numpy.float64'>, csr_container = <class 'scipy.sparse._csr.csr_matrix'>

    @pytest.mark.parametrize("is_sparse", [False, True])
    @pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
    def test_precomputed_dists(is_sparse, global_dtype, csr_container):
        redX = X[::2].astype(global_dtype, copy=False)
        dists = pairwise_distances(redX, metric="euclidean")
        dists = csr_container(dists) if is_sparse else dists
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", EfficiencyWarning)
>           clust1 = OPTICS(min_samples=10, algorithm="brute", metric="precomputed").fit(
                dists
            )

sklearn/cluster/tests/test_optics.py:814:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/base.py:1215: in wrapper
    return fit_method(estimator, *args, **kwargs)
sklearn/cluster/_optics.py:348: in fit
    ) = memory.cache(compute_optics_graph)(
../../mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/joblib/memory.py:349: in __call__
    return self.func(*args, **kwargs)
sklearn/utils/_param_validation.py:211: in wrapper
    return func(*args, **kwargs)
sklearn/cluster/_optics.py:617: in compute_optics_graph
    _set_reach_dist(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

core_distances_ = array([ 4.60937947,  4.59708404,  5.52672977,  5.29246467,  4.82327335,
        2.6358904 ,  3.14641026,  3.03184371, ...006936,  2.53532733,  2.52139009,  2.37094714,
        6.59694387,  8.98309419, 10.91542663,  8.9213898 ,  7.13039835])
reachability_ = array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf])
predecessor_ = array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]), point_index = 0
processed = array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])
X = <30x30 sparse matrix of type '<class 'numpy.float64'>'
	with 900 stored elements in Compressed Sparse Row format>, nbrs = NearestNeighbors(algorithm='brute', metric='precomputed', n_neighbors=10)
metric = 'precomputed', metric_params = None, p = 2, max_eps = inf

    def _set_reach_dist(
        core_distances_,
        reachability_,
        predecessor_,
        point_index,
        processed,
        X,
        nbrs,
        metric,
        metric_params,
        p,
        max_eps,
    ):
        P = X[point_index : point_index + 1]
        # Assume that radius_neighbors is faster without distances
        # and we don't need all distances, nevertheless, this means
        # we may be doing some work twice.
        indices = nbrs.radius_neighbors(P, radius=max_eps, return_distance=False)[0]

        # Getting indices of neighbors that have not been processed
        unproc = np.compress(~np.take(processed, indices), indices)
        # Neighbors of current point are already processed.
        if not unproc.size:
            return

        # Only compute distances to unprocessed neighbors:
        if metric == "precomputed":
            dists = X[[point_index], unproc]
            if issparse(dists):
                dists.sort_indices()
                dists = dists.data
        else:
            _params = dict() if metric_params is None else metric_params.copy()
            if metric == "minkowski" and "p" not in _params:
                # the same logic as neighbors, p is ignored if explicitly set
                # in the dict params
                _params["p"] = p
            dists = pairwise_distances(P, X[unproc], metric, n_jobs=None, **_params).ravel()

        rdists = np.maximum(dists, core_distances_[point_index])
        np.around(rdists, decimals=np.finfo(rdists.dtype).precision, out=rdists)
        improved = np.where(rdists < np.take(reachability_, unproc))
>       reachability_[unproc[improved]] = rdists[improved]
E       IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

sklearn/cluster/_optics.py:683: IndexError
=========================================================================================== short test summary info ============================================================================================
FAILED sklearn/cluster/tests/test_optics.py::test_precomputed_dists[float64-csr_matrix-True] - IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
============================================================================ 1 failed, 3 passed, 4 skipped, 145 deselected in 0.46s ============================================================================

I then tried the following:

reachability_[unproc[[improved]]] = rdists[improved]

and equally

predecessor_[unproc[[improved]]] = point_index

and this fixed the index errors but resulted in the following error:

============================================================================================= test session starts ==============================================================================================
platform darwin -- Python 3.10.8, pytest-7.2.0, pluggy-1.0.0
rootdir: /Users/maren/open-source/scikit-learn, configfile: setup.cfg
plugins: xdist-3.1.0, cov-4.0.0
collected 153 items / 145 deselected / 8 selected

sklearn/cluster/tests/test_optics.py ssss.F..                                                                                                                                                            [100%]

=================================================================================================== FAILURES ===================================================================================================
_______________________________________________________________________________ test_precomputed_dists[float64-csr_matrix-True] ________________________________________________________________________________

is_sparse = True, global_dtype = <class 'numpy.float64'>, csr_container = <class 'scipy.sparse._csr.csr_matrix'>

    @pytest.mark.parametrize("is_sparse", [False, True])
    @pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
    def test_precomputed_dists(is_sparse, global_dtype, csr_container):
        redX = X[::2].astype(global_dtype, copy=False)
        dists = pairwise_distances(redX, metric="euclidean")
        dists = csr_container(dists) if is_sparse else dists
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", EfficiencyWarning)
            clust1 = OPTICS(min_samples=10, algorithm="brute", metric="precomputed").fit(
                dists
            )
        clust2 = OPTICS(min_samples=10, algorithm="brute", metric="euclidean").fit(redX)

>       assert_allclose(clust1.reachability_, clust2.reachability_)

sklearn/cluster/tests/test_optics.py:819:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

actual = array([       inf, 4.60937947, 4.59708404, 4.59708404, 4.59708404,
       3.30445873, 2.6358904 , 2.6358904 , 2.635890...4 , 2.6358904 , 2.52139009, 2.6358904 , 2.52139009,
       4.31006936, 8.38294501, 6.59694387, 8.51939412, 8.51939412])
desired = array([       inf, 4.60937947, 4.59708404, 4.59708404, 4.59708404,
       3.30445873, 2.6358904 , 2.6358904 , 2.635890...4 , 2.6358904 , 2.6358904 , 2.53532733, 2.52139009,
       4.31006936, 7.13039835, 6.59694387, 7.13039835, 4.89628456])
rtol = 1e-07, atol = 0.0, equal_nan = True, err_msg = '', verbose = True

    def assert_allclose(
        actual, desired, rtol=None, atol=0.0, equal_nan=True, err_msg="", verbose=True
    ):
        """dtype-aware variant of numpy.testing.assert_allclose

        This variant introspects the least precise floating point dtype
        in the input argument and automatically sets the relative tolerance
        parameter to 1e-4 float32 and use 1e-7 otherwise (typically float64
        in scikit-learn).

        `atol` is always left to 0. by default. It should be adjusted manually
        to an assertion-specific value in case there are null values expected
        in `desired`.

        The aggregate tolerance is `atol + rtol * abs(desired)`.

        Parameters
        ----------
        actual : array_like
            Array obtained.
        desired : array_like
            Array desired.
        rtol : float, optional, default=None
            Relative tolerance.
            If None, it is set based on the provided arrays' dtypes.
        atol : float, optional, default=0.
            Absolute tolerance.
        equal_nan : bool, optional, default=True
            If True, NaNs will compare equal.
        err_msg : str, optional, default=''
            The error message to be printed in case of failure.
        verbose : bool, optional, default=True
            If True, the conflicting values are appended to the error message.

        Raises
        ------
        AssertionError
            If actual and desired are not equal up to specified precision.

        See Also
        --------
        numpy.testing.assert_allclose

        Examples
        --------
        >>> import numpy as np
        >>> from sklearn.utils._testing import assert_allclose
        >>> x = [1e-5, 1e-3, 1e-1]
        >>> y = np.arccos(np.cos(x))
        >>> assert_allclose(x, y, rtol=1e-5, atol=0)
        >>> a = np.full(shape=10, fill_value=1e-5, dtype=np.float32)
        >>> assert_allclose(a, 1e-5)
        """
        dtypes = []

        actual, desired = np.asanyarray(actual), np.asanyarray(desired)
        dtypes = [actual.dtype, desired.dtype]

        if rtol is None:
            rtols = [1e-4 if dtype == np.float32 else 1e-7 for dtype in dtypes]
            rtol = max(rtols)

>       np_assert_allclose(
            actual,
            desired,
            rtol=rtol,
            atol=atol,
            equal_nan=equal_nan,
            err_msg=err_msg,
            verbose=verbose,
        )
E       AssertionError:
E       Not equal to tolerance rtol=1e-07, atol=0
E
E       Mismatched elements: 9 / 30 (30%)
E       Max absolute difference: 5.74705461
E       Max relative difference: 2.18030864
E        x: array([     inf, 4.609379, 4.597084, 4.597084, 4.597084, 3.304459,
E              2.63589 , 2.63589 , 2.63589 , 2.52139 , 4.597084, 2.63589 ,
E              2.63589 , 8.382945, 2.63589 , 5.223714, 5.223714, 4.872887,...
E        y: array([     inf, 4.609379, 4.597084, 4.597084, 4.597084, 3.304459,
E              2.63589 , 2.63589 , 2.63589 , 2.63589 , 4.597084, 2.63589 ,
E              2.63589 , 2.63589 , 2.63589 , 4.737295, 4.543899, 4.872887,...

sklearn/utils/_testing.py:291: AssertionError
=========================================================================================== short test summary info ============================================================================================
FAILED sklearn/cluster/tests/test_optics.py::test_precomputed_dists[float64-csr_matrix-True] - AssertionError:
============================================================================ 1 failed, 3 passed, 4 skipped, 145 deselected in 0.37s ============================================================================

I noticed that the shape of reachability_[unproc[[improved]]] changed from previously e.g. (29,) to e.g. (1, 2, 29). I don't know if this is actually desired.

So this is where I currently don't know how to proceed. I had a look at @glemaitre 's last comment but from my understanding that bug doesn't cause the problem described here. I pulled the recent changes from upstream main but the problem described here persists.

@marenwestermann
Copy link
Member Author

PS: I finally have time again to be active on scikit-learn and I'm excited to be back. 🎉

@glemaitre glemaitre self-requested a review September 15, 2023 09:05
@glemaitre
Copy link
Member

Looking closer to the code, we need to change the following:

        dists = X[point_index, unproc]
        if issparse(dists):
            dists.sort_indices()
            dists = dists.data

to

        dists = X[[point_index], unproc]
        if isinstance(dists, np.matrix):
            dists = dists.A1

So:

  • we need to index as we mention previously using [point_index].
  • the current if issparse(dists) cannot be reach: indexing a sparse matrix returns a numpy matrix and indexing a sparse array returns a numpy array. So we never get a sparse matrix back.
  • The idea here is to handle the everything if it was a numpy array (it is what the code does). So we can recover a 1-D numpy array from the numpy matrix using this A1 that return the flatten data.

I tried locally and the tests are passing.

@glemaitre glemaitre requested review from adrinjalali and removed request for glemaitre September 15, 2023 11:38
@marenwestermann
Copy link
Member Author

I see, that makes sense. Thanks a lot for the detailed explanation @glemaitre. I applied the suggested changes.

@glemaitre glemaitre self-requested a review September 15, 2023 19:53
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments to optimize the number of tests to the minimum.

@marenwestermann
Copy link
Member Author

Good catch, I wouldn't have seen this myself. I had a careful look through the changes and tested them and everything looks good.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marenwestermann LGTM.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes the code of the scikit-learn library (not just the test suite) and therefore needs a dedicated entry in sklearn/doc/whats_new/v1.4.rst to document the fact that sklearn.cluster.OPTICS now accepts scipy sparse arrays as inputs.

Take example on the log entry for NMF as a reference to write this entry for the changelog.

Other than that here is a suggestion for further consistency in the code (even if it does not change the behavior seen by the tests):

@marenwestermann
Copy link
Member Author

I don't think the CI failures are due to my changes (unless it's a problem that this PR which is mentioned in the changelog isn't merged yet).

@glemaitre
Copy link
Member

Uhm. Could you solve the conflict of the changelog.

@marenwestermann
Copy link
Member Author

Ah, I didn't see that it was a merge conflict, I fixed it. Please note that somewhere in the restructuring process of the changelog the changelog entry for #26602 seems to have gotten lost.

@marenwestermann
Copy link
Member Author

Looks like the test test_precomputed_dists needs to be updated. I'll have a proper look at it tonight or tomorrow.

dists = np.asarray(dists)
elif hasattr(dists, "toarray"):
# X is a scipy sparse array.
dists = dists.toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marenwestermann do you understand why this branch of the condition is not already covered by test_precomputed_dists?

I am a bit confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel The issue is that previously dists was a sparse matrix for the test case "is_sparse"=True. Now, when we want to test for the case csr_matrix, dists becomes a numpy matrix (which is currently covered in the code) and in the case where we want to test for csr_array, dists becomes a numpy array. So we never test for the case sparse array.

Now the question is should dists be sparse when X is sparse (as it was preciously the case) or is the current implementation ok? I don't have a deep enough understanding of the OPTICS estimator to make a judgement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minimal example to understand what happens here:

In [5]: from scipy import sparse

In [6]: X = sparse.random(10, 10, format="csr")

In [7]: X[0, :]
Out[7]: 
<1x10 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [8]: X[0, [1, 2]]
Out[8]: 
<1x2 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [9]: X[[0], [1, 2]]
Out[9]: matrix([[0., 0.]])

Since we index both dimension, we get this np.matrix that is the difference with the previous code.

For me, we don't need to cover anymore the case where we would eventually have a sparse matrix because we will get already a np.matrix or a np.array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marenwestermann do you think that you can reduce to only having np.matrix or np.array (not anymore sparse matrices) for the condition. Once done, I think that we good for merging.

@glemaitre glemaitre merged commit 38d499b into scikit-learn:main Oct 30, 2023
@glemaitre
Copy link
Member

This one was waiting for merge. Thanks @marenwestermann

RUrlus pushed a commit to RUrlus/scikit-learn that referenced this pull request Oct 30, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023
@marenwestermann marenwestermann deleted the test-optics branch October 31, 2023 17:17
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants