Skip to content

Conversation

rprkh
Copy link
Contributor

@rprkh rprkh commented Nov 15, 2022

Reference Issues/PRs

Towards #24875

What does this implement/fix? Explain your changes.

Using memory views in place of the deprecated cnp.ndarray.

Any other comments?

Comment on lines 15 to 16
def _binary_search_perplexity(
cnp.float32_t[:, :] sqdistances,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _binary_search_perplexity(
cnp.float32_t[:, :] sqdistances,
cpdef float[:, :] _binary_search_perplexity(
const float[:, :] sqdistances,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is never used in cython but always in python, it's better to make it a pure def

@@ -63,7 +63,7 @@ cpdef cnp.ndarray[cnp.float32_t, ndim=2] _binary_search_perplexity(

# This array is later used as a 32bit array. It has multiple intermediate
# floating point additions that benefit from the extra precision
cdef cnp.ndarray[cnp.float64_t, ndim=2] P = np.zeros(
cdef cnp.float64_t[:, :] P = np.zeros(
(n_samples, n_neighbors), dtype=np.float64)

for i in range(n_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the return line by:

return P.astype(np.float32)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the current behavior. Currently it returns a float64 array (this is because it's using the def part of the function which doesn't care about the return type)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the current behavior

@rprkh
Copy link
Contributor Author

rprkh commented Nov 16, 2022

Included the suggested changes.

Edit: Some tests are still failing as the memoryviewslice has no ravel attribute. I tried looking for solutions on SO and going through some of the code, but I'm not entirely sure what the fix should be.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @rprkh.

Apart from a few comments regarding dtype preservation, everything LGTM.

@@ -118,4 +116,4 @@ cpdef cnp.ndarray[cnp.float32_t, ndim=2] _binary_search_perplexity(
if verbose:
print("[t-SNE] Mean sigma: %f"
% np.mean(math.sqrt(n_samples / beta_sum)))
return P
return np.array(P, dtype=np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think preserving dtype might better be treated in another PR.

For now, I propose just returning the numpy array using the base attribute of the memoryview.

Suggested change
return np.array(P, dtype=np.float32)
return P.base

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was discussed somewhere that np.asarray should be preferred because .base can point to a larger array than expected (we can have a view on only a subset of the array)

@rprkh rprkh force-pushed the remove_wcpp_warnings branch from 70e5412 to 6505d3d Compare November 21, 2022 17:49
@rprkh
Copy link
Contributor Author

rprkh commented Nov 22, 2022

Included the suggestions.

Comment on lines 64 to 65
cdef double[:, :] P = np.zeros(
(n_samples, n_neighbors), dtype=np.float64)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the failing tests are related to the dtype of P over here.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @rprkh is right in https://github.com/scikit-learn/scikit-learn/pull/24925/files#r1028804071: P is allocated with float64 but returned (and implicitly cast to float32).

That's the first time I stumble upon this behavior in Cython.

For now, I would simply cast it explicitly using what @glemaitre proposed earlier and add a TODO comment indicating that this function both support float32 and float64 and preserves inputs' dtype.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given that the CI passes.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rprkh

@jjerphan jjerphan merged commit f2f3b3c into scikit-learn:main Nov 30, 2022
@rprkh
Copy link
Contributor Author

rprkh commented Nov 30, 2022

This was my first time working with Cython. Thanks for the reviews and suggestions @glemaitre @jjerphan @jeremiedbb. I look forward to contributing more in the future.

@rprkh rprkh deleted the remove_wcpp_warnings branch November 30, 2022 19:07
@jjerphan
Copy link
Member

Glad to read that!

Feel free to pick other similar issues or to ask if you want to work on something more involved.

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
scikit-learn#24925)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
scikit-learn#24925)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants