Skip to content

[MRG+1] Add sample weights support to kernel density estimation (fix #4394) #10803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 26, 2018

Conversation

samronsin
Copy link
Contributor

@samronsin samronsin commented Mar 13, 2018

Fix #4394

This PR adds sample weights support to kernel density estimation by adding a sample_weight array in the BinaryTree object and updating the traversal of the tree whenever sample weights are detected, all in the binary_tree.pyi Cython code.

This PR also updates the centroids of the BallTree to take the sample weights into account in ball_tree.pyx code.

@jnothman
Copy link
Member

jnothman commented Mar 13, 2018 via email

@samronsin
Copy link
Contributor Author

Thanks, I just added such tests.
I'm wondering what the result should be for zero or negative weights.
Leaves with negative or zero total weight should probably be ignored.

@samronsin
Copy link
Contributor Author

I ended up raising a ValueError for non-positive weights, since such weights would create nans and infs in the low-level code.

@samronsin samronsin changed the title Add sample weights support to kernel density estimation (fix #4394) [MRG] Add sample weights support to kernel density estimation (fix #4394) Mar 26, 2018
@samronsin
Copy link
Contributor Author

@jnothman Is there something I can do to help this PR move forward ?
@jakevdp It looks like you are the author of the modified files, so your advice would be very valuable and appreciated !

@jnothman
Copy link
Member

jnothman commented Apr 2, 2018

I don't have enough time available to review atm, sorry.

I do feel like tests along the lines of:

  • ensure weighting has some effect
  • ensure weighting is equivalent to repetition
  • insured weight effect is scale invariant for a positive scaling factor

are sufficient from a correctness perspective. I think you've got the middle point covered.

@samronsin
Copy link
Contributor Author

@jnothman no rush, I'll add more tests along the lines you suggested, thanks !

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work.
I have some small comments, and I still need to go though the Cython code.

Also don't forget to add the suggested tests.



def test_kde_sample_weights():
N = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure the test is not too long, reducing N and T if necessary.
You can do it with pytest --duration 10 sklearn/neighbors/tests/test_kde.py::test_kde_sample_weights.


def test_kde_sample_weights():
N = 10000
T = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using single-letter variables (except X).
example: change N to n_samples.

for d in [1, 2, 10]:
rng = np.random.RandomState(0)
X = rng.rand(N, d)
W = 1 + (10 * X.sum(axis=1)).astype(np.int8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W = rng.randint(10, size=N) is probably clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, however the goal is not to have weights uniformly distributed, since this would be (asymptotically) equivalent to having uniform weights.
So I picked weights to be entirely determined the (L1) norm of the vector to have a simple pattern where weights are positive integers.

for _ in range(w):
repetitions.append(x.tolist())
X_repetitions = np.array(repetitions)
Y = rng.rand(T // d, d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why T // d ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the first argument (the number of test points) to be an integer, both in python2 and python3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but why would you divide T by d? It is just another integer that you could name n_samples_2, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I'll rename this to make it more expressive.
I do this mainly because I want to keep the number of operations under control while increasing the dimension of the space, so I decrease the cardinality of the test sample.

if sample_weight is not None:
if not hasattr(sample_weight, 'shape'):
sample_weight = np.array(sample_weight)
if len(sample_weight.shape) != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight.ndim is more appropriate.

@samronsin
Copy link
Contributor Author

@TomDLT thanks for your comments ! I just updated the branch accordingly.

@samronsin
Copy link
Contributor Author

samronsin commented Apr 16, 2018

@jnothman I added tests for "weighting has some effect" and "weight effect is scale invariant".

@samronsin
Copy link
Contributor Author

@TomDLT did you get any chance to look at the Cython ?
I was wondering if I should add a few comments to it in order to explain the general idea, namely that I simply replace all counts by sums of weights.

@jnothman
Copy link
Member

Sorry for slow review. I don't think the cython needs comments.

@@ -1524,6 +1536,11 @@ cdef class BinaryTree:
cdef DTYPE_t dist_LB = 0, dist_UB = 0

cdef ITYPE_t n_samples = self.data.shape[0]
cdef DTYPE_t Z
if self.sample_weight is not None:
Z = self.sum_weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need both Z and sum_weight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Z is either sum_weight or n_samples, so in this context sum_weight is just as useful as n_samples although much costlier to produce (hence better to compute it just once at __init__ time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why not store sum_weight = n_samples in init and use it in all cases here rather than introduce a new variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make sense indeed, thanks for the suggestion !

log_density = compute_log_kernel(dist_pt, h, kernel)
global_log_min_bound = logaddexp(global_log_min_bound,
log_density)
if with_sample_weight:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all this branching provide much benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not compare it to

                   for i in range(node_info.idx_start, node_info.idx_end):
                       dist_pt = self.dist(pt, data + n_features * idx_array[i],
                                            n_features)
                        log_density = compute_log_kernel(dist_pt, h, kernel)
                        if with_sample_weight: 
                            log_weight = np.log(sample_weight[idx_array[i]])
                        else:
                            log_weight = 0
                        global_log_min_bound = logaddexp(global_log_min_bound,
                                                                 log_density + log_weight)

but that did not look much simpler to me. Do you have another suggestion that I am missing ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks more maintainable to me, which is my concern. Near-duplicate code is hard to see differences in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, I'll do some benchmarking to see if this branching is a actually improving anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the branching slows down the fit method (both with and without sample weights unfortunately) by about 10-15%, for 10^5 sample with 20 columns on my local machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed the simplified code anyway. Happy to revert the commit if we want to avoid the consequent slowdown.

" but was {1}".format(X.shape[0],
sample_weight.shape))
if sample_weight.shape[0] != X.shape[0]:
raise ValueError("X and sample_weight have incompatible "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we use check_consistent_length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@jnothman
Copy link
Member

jnothman commented Jun 4, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from nitpicks, the tests look good, so I trust the code (which also looks good but I've only looked at it briefly so far)



def test_kde_sample_weights():
n_samples = 2500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take to run? This seems larger than necessary to prove the point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds on my laptop. Indeed, I can decrease this to a few 100s and this would take ~1 sec.

X = rng.rand(n_samples, d)
weights = 1 + (10 * X.sum(axis=1)).astype(np.int8)
repetitions = []
for x, w in zip(X, weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.repeat should work here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

n_samples_test = size_test // d
test_points = rng.rand(n_samples_test, d)
for algorithm in ['auto', 'ball_tree', 'kd_tree']:
for metric in ['euclidean', 'minkowski', 'manhattan',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to believe minkowski without parameter would work differently?

"""
algorithm = self._choose_algorithm(self.algorithm, self.metric)
X = check_array(X, order='C', dtype=DTYPE)

if sample_weight is not None:
if not hasattr(sample_weight, 'shape'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use check_array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if not hasattr(sample_weight, 'shape'):
sample_weight = np.array(sample_weight)
if sample_weight.ndim != 1:
raise ValueError("the shape of sample_weight must be ({0},),"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these covered by tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added !

global_log_min_bound[0] = logaddexp(global_log_min_bound[0],
log_dens_contribution)
log_dens_contribution +
log_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd usually not want this extra indentation. You could do:

(log_dens_contribution +
 log_weight)

N2 = node_data[i2].idx_end - node_data[i2].idx_start
if with_sample_weight:
N1 = 0.0
for i in range(node_data[i1].idx_start, node_data[i1].idx_end):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a cdef inline to avoid repetition of this idiom.

@samronsin
Copy link
Contributor Author

samronsin commented Jun 4, 2018

I suppose 15% is quite substantial...

So would it be better to revert commit 8ff81e5 ?

@jnothman
Copy link
Member

jnothman commented Jun 4, 2018 via email

@samronsin
Copy link
Contributor Author

On a 10^5 x 20 sample, the difference is ~100ms.
On a 10^6 x 50 sample, the difference is less consistent and about a few seconds at most.

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@samronsin
Copy link
Contributor Author

Is there something else I can do ? Should the title be [MRG+1] now ?

@jnothman
Copy link
Member

it can be MRG+1 if you'd like... but we're sort of phasing that out in preference for GitHub's Approved thing.

@jnothman jnothman changed the title [MRG] Add sample weights support to kernel density estimation (fix #4394) [MRG+1] Add sample weights support to kernel density estimation (fix #4394) Jun 25, 2018
@samronsin
Copy link
Contributor Author

Ok, I didn't know that, thanks !

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Please also add an entry in doc/whats_new/v0.20.rst

sample_weight.shape))
check_consistent_length(X, sample_weight)
if sample_weight.min() <= 0:
raise ValueError("sample_weight must have positive values")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered by the tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bien vu ! Added a test for that as well.

@samronsin
Copy link
Contributor Author

Thank you guys @jnothman and @TomDLT for reviewing this PR !

@jnothman
Copy link
Member

Thank you for a very nice contribution!

@jnothman jnothman merged commit f89131b into scikit-learn:master Jun 26, 2018
@raimon-fa
Copy link

Hi! Is this feature already available? I don't see it in the documentation of sklearn.
Thanks a lot!

@TomDLT
Copy link
Member

TomDLT commented Jul 13, 2018

It is not in the stable doc, but it is in the dev doc.
It means that it is not included in latest version (0.19), but it will be included in next version (0.20).
Version 0.20 should be released in a few weeks.
If you can't wait to use it, you can install the dev version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

weighted KDE
4 participants