-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add sample weights support to kernel density estimation (fix #4394) #10803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Add sample weights support to kernel density estimation (fix #4394) #10803
Conversation
Thanks! please add tests, for example to show the equivalence of weighting
and repetition
|
Thanks, I just added such tests. |
I ended up raising a |
…tion of the index in the non-weighted case
I don't have enough time available to review atm, sorry. I do feel like tests along the lines of:
are sufficient from a correctness perspective. I think you've got the middle point covered. |
@jnothman no rush, I'll add more tests along the lines you suggested, thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work.
I have some small comments, and I still need to go though the Cython code.
Also don't forget to add the suggested tests.
sklearn/neighbors/tests/test_kde.py
Outdated
|
||
|
||
def test_kde_sample_weights(): | ||
N = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure the test is not too long, reducing N
and T
if necessary.
You can do it with pytest --duration 10 sklearn/neighbors/tests/test_kde.py::test_kde_sample_weights
.
sklearn/neighbors/tests/test_kde.py
Outdated
|
||
def test_kde_sample_weights(): | ||
N = 10000 | ||
T = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using single-letter variables (except X
).
example: change N
to n_samples
.
sklearn/neighbors/tests/test_kde.py
Outdated
for d in [1, 2, 10]: | ||
rng = np.random.RandomState(0) | ||
X = rng.rand(N, d) | ||
W = 1 + (10 * X.sum(axis=1)).astype(np.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W = rng.randint(10, size=N)
is probably clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, however the goal is not to have weights uniformly distributed, since this would be (asymptotically) equivalent to having uniform weights.
So I picked weights to be entirely determined the (L1) norm of the vector to have a simple pattern where weights are positive integers.
sklearn/neighbors/tests/test_kde.py
Outdated
for _ in range(w): | ||
repetitions.append(x.tolist()) | ||
X_repetitions = np.array(repetitions) | ||
Y = rng.rand(T // d, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why T // d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the first argument (the number of test points) to be an integer, both in python2 and python3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but why would you divide T by d? It is just another integer that you could name n_samples_2
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I'll rename this to make it more expressive.
I do this mainly because I want to keep the number of operations under control while increasing the dimension of the space, so I decrease the cardinality of the test sample.
sklearn/neighbors/kde.py
Outdated
if sample_weight is not None: | ||
if not hasattr(sample_weight, 'shape'): | ||
sample_weight = np.array(sample_weight) | ||
if len(sample_weight.shape) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight.ndim
is more appropriate.
@TomDLT thanks for your comments ! I just updated the branch accordingly. |
@jnothman I added tests for "weighting has some effect" and "weight effect is scale invariant". |
@TomDLT did you get any chance to look at the Cython ? |
Sorry for slow review. I don't think the cython needs comments. |
sklearn/neighbors/binary_tree.pxi
Outdated
@@ -1524,6 +1536,11 @@ cdef class BinaryTree: | |||
cdef DTYPE_t dist_LB = 0, dist_UB = 0 | |||
|
|||
cdef ITYPE_t n_samples = self.data.shape[0] | |||
cdef DTYPE_t Z | |||
if self.sample_weight is not None: | |||
Z = self.sum_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need both Z and sum_weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Z
is either sum_weight
or n_samples
, so in this context sum_weight
is just as useful as n_samples
although much costlier to produce (hence better to compute it just once at __init__
time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but why not store sum_weight = n_samples in init and use it in all cases here rather than introduce a new variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make sense indeed, thanks for the suggestion !
sklearn/neighbors/binary_tree.pxi
Outdated
log_density = compute_log_kernel(dist_pt, h, kernel) | ||
global_log_min_bound = logaddexp(global_log_min_bound, | ||
log_density) | ||
if with_sample_weight: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does all this branching provide much benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not compare it to
for i in range(node_info.idx_start, node_info.idx_end):
dist_pt = self.dist(pt, data + n_features * idx_array[i],
n_features)
log_density = compute_log_kernel(dist_pt, h, kernel)
if with_sample_weight:
log_weight = np.log(sample_weight[idx_array[i]])
else:
log_weight = 0
global_log_min_bound = logaddexp(global_log_min_bound,
log_density + log_weight)
but that did not look much simpler to me. Do you have another suggestion that I am missing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks more maintainable to me, which is my concern. Near-duplicate code is hard to see differences in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, I'll do some benchmarking to see if this branching is a actually improving anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the branching slows down the fit method (both with and without sample weights unfortunately) by about 10-15%, for 10^5 sample with 20 columns on my local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I committed the simplified code anyway. Happy to revert the commit if we want to avoid the consequent slowdown.
sklearn/neighbors/kde.py
Outdated
" but was {1}".format(X.shape[0], | ||
sample_weight.shape)) | ||
if sample_weight.shape[0] != X.shape[0]: | ||
raise ValueError("X and sample_weight have incompatible " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we use check_consistent_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
…performance in the fit method (might want to revert)
I suppose 15% is quite substantial...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from nitpicks, the tests look good, so I trust the code (which also looks good but I've only looked at it briefly so far)
sklearn/neighbors/tests/test_kde.py
Outdated
|
||
|
||
def test_kde_sample_weights(): | ||
n_samples = 2500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take to run? This seems larger than necessary to prove the point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 seconds on my laptop. Indeed, I can decrease this to a few 100s and this would take ~1 sec.
sklearn/neighbors/tests/test_kde.py
Outdated
X = rng.rand(n_samples, d) | ||
weights = 1 + (10 * X.sum(axis=1)).astype(np.int8) | ||
repetitions = [] | ||
for x, w in zip(X, weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.repeat should work here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
sklearn/neighbors/tests/test_kde.py
Outdated
n_samples_test = size_test // d | ||
test_points = rng.rand(n_samples_test, d) | ||
for algorithm in ['auto', 'ball_tree', 'kd_tree']: | ||
for metric in ['euclidean', 'minkowski', 'manhattan', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to believe minkowski without parameter would work differently?
sklearn/neighbors/kde.py
Outdated
""" | ||
algorithm = self._choose_algorithm(self.algorithm, self.metric) | ||
X = check_array(X, order='C', dtype=DTYPE) | ||
|
||
if sample_weight is not None: | ||
if not hasattr(sample_weight, 'shape'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use check_array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/neighbors/kde.py
Outdated
if not hasattr(sample_weight, 'shape'): | ||
sample_weight = np.array(sample_weight) | ||
if sample_weight.ndim != 1: | ||
raise ValueError("the shape of sample_weight must be ({0},)," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these covered by tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added !
sklearn/neighbors/binary_tree.pxi
Outdated
global_log_min_bound[0] = logaddexp(global_log_min_bound[0], | ||
log_dens_contribution) | ||
log_dens_contribution + | ||
log_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd usually not want this extra indentation. You could do:
(log_dens_contribution +
log_weight)
sklearn/neighbors/binary_tree.pxi
Outdated
N2 = node_data[i2].idx_end - node_data[i2].idx_start | ||
if with_sample_weight: | ||
N1 = 0.0 | ||
for i in range(node_data[i1].idx_start, node_data[i1].idx_end): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a cdef inline to avoid repetition of this idiom.
So would it be better to revert commit 8ff81e5 ? |
If the 15% was also more than a few seconds, then yes, I think so.
…On 4 June 2018 at 19:16, Samuel O. Ronsin ***@***.***> wrote:
I suppose 15% is quite substantial...
So would it be better to revert commit 8ff81e5
<8ff81e5>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10803 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wlqzfehX7FOcEdpX48E5uAnWb7Xks5t5PrzgaJpZM4SoOrd>
.
|
On a 10^5 x 20 sample, the difference is ~100ms. |
so a few seconds on 1e6 might be no big deal. But others' opinions may
differ
|
Is there something else I can do ? Should the title be [MRG+1] now ? |
it can be MRG+1 if you'd like... but we're sort of phasing that out in preference for GitHub's Approved thing. |
Ok, I didn't know that, thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please also add an entry in doc/whats_new/v0.20.rst
sample_weight.shape)) | ||
check_consistent_length(X, sample_weight) | ||
if sample_weight.min() <= 0: | ||
raise ValueError("sample_weight must have positive values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bien vu ! Added a test for that as well.
Thank you for a very nice contribution! |
Hi! Is this feature already available? I don't see it in the documentation of sklearn. |
It is not in the stable doc, but it is in the dev doc. |
Fix #4394
This PR adds sample weights support to kernel density estimation by adding a sample_weight array in the BinaryTree object and updating the traversal of the tree whenever sample weights are detected, all in the binary_tree.pyi Cython code.
This PR also updates the centroids of the BallTree to take the sample weights into account in ball_tree.pyx code.