Skip to content

MAINT Refactor vector sentinel into utils #22728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 29, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Mar 8, 2022

Reference Issues/PRs

Follow up to #22320

What does this implement/fix? Explain your changes.

This PR refactors the StdVectorSentinel into it's own file and puts it into sklearn.utils. With this PR, adding new vectors entail.

  1. Add new type to _typedef
  2. Add new vector to vector_typed
  3. Implement a StdVectorSentinel*
  4. Add Sentinel to _create_sentinel

The caller of vector_to_nd_array does not need to know anything about sentinels. They can pass in a vector and an ndarray is returned.

Any other comments?

Running benchmarks from #22320 (review) I do not see any performance difference with this refactor.

I initially had a PR ready to resolve #11540 by using vector[int64_t] + StdVectorSentinelInt64. But I think the refactoring itself deserves it's own PR.

CC @jjerphan

Comment on lines +70 to +72
StdVectorSentinel sentinel = _create_sentinel(vect_ptr)
np.ndarray arr = np.PyArray_SimpleNewFromData(
1, &size, sentinel.get_typenum(), sentinel.get_data())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is slightly more clear than the current implementation. The get_data makes it clear that arr points to the data owned by the sentinel.

The implementation on main defines arr with the buffer from vect_ptr and then the sentinel would set the internal pointer in sentinel.vec to vect_ptr. Because only the pointers were swapped, arr is pointing to the correct place in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactoring.

I am glad that this set of features helps solving issues and improving others implementations.

This LGTM modulo a minor discussion: should we move coerce_vectors_to_nd_arrays here potentially renaming it to mention ragged arrays?

Comment on lines +70 to +72
StdVectorSentinel sentinel = _create_sentinel(vect_ptr)
np.ndarray arr = np.PyArray_SimpleNewFromData(
1, &size, sentinel.get_typenum(), sentinel.get_data())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@thomasjpfan
Copy link
Member Author

Should we move coerce_vectors_to_nd_arrays here potentially renaming it to mention ragged arrays?

I think ragged arrays in coerce_vectors_to_nd_arrays and it's fused type vector_vector_DITYPE_t is specific to RadiusNeighborhood.

In the future, if there is a need in other parts of the codebase for ragged arrays, I can see placing it into _vector_sentinel. For now, I prefer _vector_sentinel to focus on one task: converting single vectors into ndarrays.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jjerphan jjerphan added the Quick Review For PRs that are quick to review label Mar 16, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 104 to 109
config.add_extension(
"_vector_sentinel",
sources=["_vector_sentinel.pyx"],
libraries=libraries,
language="c++",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always confused about when is include_dirs=[numpy.get_include()] necessary.. I thought it was whenever compiling against numpy c api.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's strange. I added the include_dirs here to be safe.

@jeremiedbb jeremiedbb merged commit b51fbb0 into scikit-learn:main Mar 29, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FeatureHasher support in PyPy
3 participants