Skip to content

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards #24875

What does this implement/fix? Explain your changes.

  • Remove Wcpp warnings when compiling sklearn.cluster._hierarchical_fast. However the cnp.import_array() statement is kept.

Any other comments?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you once again, @OmarManzoor!

Some more hints for this PR before approval.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @OmarManzoor!

@glemaitre glemaitre self-requested a review November 17, 2022 10:01
Comment on lines 33 to 36
cnp.float64_t[::1] m_1,
cnp.float64_t[:, ::1] m_2,
cnp.npy_intp[::1] coord_row,
cnp.npy_intp[::1] coord_col,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those could be declared as constant, isn't it?

Suggested change
cnp.float64_t[::1] m_1,
cnp.float64_t[:, ::1] m_2,
cnp.npy_intp[::1] coord_row,
cnp.npy_intp[::1] coord_col,
const cnp.float64_t[::1] m_1,
const cnp.float64_t[:, ::1] m_2,
const cnp.npy_intp[::1] coord_row,
const cnp.npy_intp[::1] coord_col,

cnp.npy_intp[::1] coord_row,
cnp.npy_intp[::1] coord_col,
cnp.float64_t[::1] res
) nogil:
cdef INTP size_max = coord_row.shape[0]
cdef INTP n_features = m_2.shape[1]
cdef INTP i, j, row, col
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for those cdef, we could change the INTP and DOUBLE by the cnp.npy_intp and cnp.float64_t as well just for consistency in the function.

cpdef void _get_parents(
nodes,
heads,
cnp.npy_intp[:] parents,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cnp.npy_intp[:] parents,
const cnp.npy_intp[:] parents,

@@ -95,7 +96,7 @@ def _hc_get_descendent(INTP node, children, INTP n_leaves):
return descendent


def hc_get_heads(cnp.ndarray[INTP, ndim=1] parents, copy=True):
def hc_get_heads(cnp.npy_intp[:] parents, copy=True):
"""Returns the heads of the forest, as defined by parents.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change the INTP below

def average_merge(
IntFloatDict a,
IntFloatDict b,
cnp.intp_t[:] mask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cnp.intp_t[:] mask,
const cnp.intp_t[:] mask,

def max_merge(
IntFloatDict a,
IntFloatDict b,
cnp.intp_t[:] mask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cnp.intp_t[:] mask,
const cnp.intp_t[:] mask,

@@ -354,8 +366,7 @@ cdef class UnionFind(object):
return n


cpdef cnp.ndarray[DTYPE_t, ndim=2] _single_linkage_label(
cnp.ndarray[DTYPE_t, ndim=2] L):
def _single_linkage_label(cnp.float64_t[:, :] L):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for using def and not cpdef with output type?

Suggested change
def _single_linkage_label(cnp.float64_t[:, :] L):
cpdef DTYPE_t[:, :] _single_linkage_label(const cnp.float64_t[:, :] L):

L seems also to be constant.

Copy link
Contributor Author

@OmarManzoor OmarManzoor Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this forces the output to be a memory view whereas we want to return the actual array and we get the error AttributeError: 'sklearn.cluster._hierarchical_fast._memoryviewslic' object has no attribute 'astype' from the place which is accessing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. @jjerphan What is the Cython API to then acknowledge that you want to return a NumPy array and not a memory view?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can use cnp.ndarray, but it is better not to depend on this interface if we do not need to.

For functions that are only called from Python code, there's little value of using return type because everything is seen as an object in Python. This is the case here.

@jjerphan
Copy link
Member

Is it possible to remove the aliases definition and usage for aliases defined in this file, i.e.:

ctypedef cnp.float64_t DOUBLE
ctypedef cnp.npy_intp INTP
ctypedef cnp.int8_t INT8

Thank you!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit other wise LGTM


U.union(left_cluster, right_cluster)

return result_arr
return result_arr.base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think using np.asarray is clearer here:

Suggested change
return result_arr.base
return np.asarray(result_arr)

Note that np.asarray does not make a copy because it uses the dtype and order from the memoryview.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this still be an extra method call as compared to directly accessing the base attribute?

Copy link
Member

@thomasjpfan thomasjpfan Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future contributors, I think using base will be harder to reason about. If result_arr was a view into another array, then base can point to a bigger base array. For example:

%%cython
import numpy as np

def func_base(double[:, :] X):
    cdef:
        double[:] X_small = X[0]
    
    return np.asarray(X_small), X_small.base
import numpy as np

X = np.asarray([[1.0, 2.0, 3.0],
                [3.0, 4.0, 5.0]], dtype=np.float64)
X_asarray, X_base = func_base(X)

X_asarray
# array([1., 2., 3.])

X_base
# array([[1., 2., 3.],
#       [3., 4., 5.]])

For this PR, to be sure that np.asarray(result_arr) == result_arr.base one needs to look through the entire function to confirm that result_arr is not a view into another memoryview. Although the function is not long, using result_arr.base does add more cognitive overhead compared to returning np.asarray(result_arr).

Secondly, I think the function call overhead of asarray is small compared to the function's actual computation. In these situations, I favor code that is easier to digest by using np.asarray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice explanation. Looking at the provided example, I agree that it seems more reasonable to use np.asarray here.

@jjerphan jjerphan merged commit f9453d5 into scikit-learn:main Nov 28, 2022
@jjerphan
Copy link
Member

Thank you, @OmarManzoor.

I also approve @thomasjpfan's comment (i.e. #24914 (comment)) regarding using np.asarray over the memoryviews's base attribute.

@OmarManzoor OmarManzoor deleted the cython_hierarchical_fast branch November 28, 2022 11:44
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…cal_fast (scikit-learn#24914)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…cal_fast (scikit-learn#24914)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants