-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
MAINT Remove -Wcpp warnings when compiling sklearn.cluster._hierarchical_fast #24914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Remove -Wcpp warnings when compiling sklearn.cluster._hierarchical_fast #24914
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you once again, @OmarManzoor!
Some more hints for this PR before approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @OmarManzoor!
cnp.float64_t[::1] m_1, | ||
cnp.float64_t[:, ::1] m_2, | ||
cnp.npy_intp[::1] coord_row, | ||
cnp.npy_intp[::1] coord_col, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those could be declared as constant, isn't it?
cnp.float64_t[::1] m_1, | |
cnp.float64_t[:, ::1] m_2, | |
cnp.npy_intp[::1] coord_row, | |
cnp.npy_intp[::1] coord_col, | |
const cnp.float64_t[::1] m_1, | |
const cnp.float64_t[:, ::1] m_2, | |
const cnp.npy_intp[::1] coord_row, | |
const cnp.npy_intp[::1] coord_col, |
cnp.npy_intp[::1] coord_row, | ||
cnp.npy_intp[::1] coord_col, | ||
cnp.float64_t[::1] res | ||
) nogil: | ||
cdef INTP size_max = coord_row.shape[0] | ||
cdef INTP n_features = m_2.shape[1] | ||
cdef INTP i, j, row, col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for those cdef
, we could change the INTP
and DOUBLE
by the cnp.npy_intp
and cnp.float64_t
as well just for consistency in the function.
cpdef void _get_parents( | ||
nodes, | ||
heads, | ||
cnp.npy_intp[:] parents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cnp.npy_intp[:] parents, | |
const cnp.npy_intp[:] parents, |
@@ -95,7 +96,7 @@ def _hc_get_descendent(INTP node, children, INTP n_leaves): | |||
return descendent | |||
|
|||
|
|||
def hc_get_heads(cnp.ndarray[INTP, ndim=1] parents, copy=True): | |||
def hc_get_heads(cnp.npy_intp[:] parents, copy=True): | |||
"""Returns the heads of the forest, as defined by parents. | |||
|
|||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change the INTP
below
def average_merge( | ||
IntFloatDict a, | ||
IntFloatDict b, | ||
cnp.intp_t[:] mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cnp.intp_t[:] mask, | |
const cnp.intp_t[:] mask, |
def max_merge( | ||
IntFloatDict a, | ||
IntFloatDict b, | ||
cnp.intp_t[:] mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cnp.intp_t[:] mask, | |
const cnp.intp_t[:] mask, |
@@ -354,8 +366,7 @@ cdef class UnionFind(object): | |||
return n | |||
|
|||
|
|||
cpdef cnp.ndarray[DTYPE_t, ndim=2] _single_linkage_label( | |||
cnp.ndarray[DTYPE_t, ndim=2] L): | |||
def _single_linkage_label(cnp.float64_t[:, :] L): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for using def
and not cpdef
with output type?
def _single_linkage_label(cnp.float64_t[:, :] L): | |
cpdef DTYPE_t[:, :] _single_linkage_label(const cnp.float64_t[:, :] L): |
L
seems also to be constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this forces the output to be a memory view whereas we want to return the actual array and we get the error AttributeError: 'sklearn.cluster._hierarchical_fast._memoryviewslic' object has no attribute 'astype' from the place which is accessing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. @jjerphan What is the Cython API to then acknowledge that you want to return a NumPy array and not a memory view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can use cnp.ndarray
, but it is better not to depend on this interface if we do not need to.
For functions that are only called from Python code, there's little value of using return type because everything is seen as an object in Python. This is the case here.
…zoor/scikit-learn into cython_hierarchical_fast
Is it possible to remove the aliases definition and usage for aliases defined in this file, i.e.: scikit-learn/sklearn/cluster/_hierarchical_fast.pyx Lines 7 to 9 in 6dd1728
Thank you! |
…zoor/scikit-learn into cython_hierarchical_fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit other wise LGTM
|
||
U.union(left_cluster, right_cluster) | ||
|
||
return result_arr | ||
return result_arr.base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think using np.asarray
is clearer here:
return result_arr.base | |
return np.asarray(result_arr) |
Note that np.asarray
does not make a copy because it uses the dtype and order from the memoryview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this still be an extra method call as compared to directly accessing the base attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future contributors, I think using base
will be harder to reason about. If result_arr
was a view into another array, then base
can point to a bigger base array. For example:
%%cython
import numpy as np
def func_base(double[:, :] X):
cdef:
double[:] X_small = X[0]
return np.asarray(X_small), X_small.base
import numpy as np
X = np.asarray([[1.0, 2.0, 3.0],
[3.0, 4.0, 5.0]], dtype=np.float64)
X_asarray, X_base = func_base(X)
X_asarray
# array([1., 2., 3.])
X_base
# array([[1., 2., 3.],
# [3., 4., 5.]])
For this PR, to be sure that np.asarray(result_arr) == result_arr.base
one needs to look through the entire function to confirm that result_arr
is not a view into another memoryview. Although the function is not long, using result_arr.base
does add more cognitive overhead compared to returning np.asarray(result_arr)
.
Secondly, I think the function call overhead of asarray
is small compared to the function's actual computation. In these situations, I favor code that is easier to digest by using np.asarray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice explanation. Looking at the provided example, I agree that it seems more reasonable to use np.asarray here.
Thank you, @OmarManzoor. I also approve @thomasjpfan's comment (i.e. #24914 (comment)) regarding using |
…cal_fast (scikit-learn#24914) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…cal_fast (scikit-learn#24914) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Towards #24875
What does this implement/fix? Explain your changes.
Any other comments?