-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data #3772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @mblondel @robertlayton @larsmans ? |
@jnothman Do you approve of this utility in |
X, y = check_X_y(X, y, ['csr', 'csc']) | ||
if sp.issparse(X) and self.shrink_threshold: | ||
X, y = check_X_y(X, y, ['csc']) | ||
X_sparse = sp.issparse(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_sparse
should be renamed to something less ambigious. We typically use X_something
to denote the dataset (i.e. X_train
). Perhaps is_X_sparse
?
@robertlayton Thanks for your reviews. Do you think such a utility (median) would be useful and should be placed in |
Short (by amount of work) answer: yes -- and probably enough for this PR Long answer: I would suggest that we move the distance stuff out of the metrics folder and put functions like this in there, with a wrapper function that you can just call: |
Thanks, and also a general doubt. How does the "manhattan" metric which is the L1 distance between two points, translate into the median? I googled but I was unable to find any answer. (Similar question for the euclidean metric too.) |
You are right I think, metric doesn't matter. The case I was thinking of was with an even number of samples, choosing the "middle" to be the median, and that doesn't change from metric to metric that I'm aware of (although maybe with some metrics it does). My bad. |
I see, so do we keep the |
I've not looked at your code, but see also the imputer which has implemented this in pure python. |
@jnothman oops. btw this is also in pure python. Should we refactor the code from imputer? |
It may not be so simple to refactor, but at least the two implementations should be close to each other, if not reusing each other's code. |
@jnothman I should have just asked if if there was already an implementation around. I can just warp around the |
53565c3
to
9a804f4
Compare
@robertlayton @jnothman Now the question is if we should have a separate keyword for the centroid being either the mean or median, or somehow answer the question that how using the |
@@ -86,8 +87,9 @@ def fit(self, X, y): | |||
y : array, shape = [n_samples] | |||
Target values (integers) | |||
""" | |||
X, y = check_X_y(X, y, ['csr', 'csc']) | |||
if sp.issparse(X) and self.shrink_threshold: | |||
X, y = check_X_y(X, y, ['csc']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All sparse matrices, would be better to get converted to csc
, so that the median can be easily calculated across all rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the metric is not manhattan, should we convert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying this because the row slice would be slower case of slicing the rows to calculate the mean (for csc) ? In that case I would explicitly convert it to csc
for manhattan
because of the median and to csr
for the other metrics since it is easier to do the row slicing to calculate the mean, if its ok with you.
@arjoly I have addressed all your comments. This comment (#3772 (comment)) remains. |
8fa7d09
to
3364721
Compare
else: | ||
self.centroids_[cur_class] = csc_row_median(X[center_mask]) | ||
else: | ||
self.centroids_[cur_class] = X[center_mask].mean(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it me or only euclidian and manhattan metrics are supported?
any good reason why no docstring is modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other metrics are supported in the predict
method. the question is that if the manhattan metric means that the average should be the median, or should we have a separate keyword for the centroid
. Basically this comment (#3772 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why the centroid is correctly computed for metrics other than euclidean and manhattan distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not being supported. This PR just adds the manhattan averaging. I have updated the docstring.
besides LGTM. +1 for merge. |
@arjoly Done. Let me know if you have anything else? |
Could you raise an error if it's not supported? |
Beside my last my comment, +1. |
Well, it was "supported" before this PR, by falling back on to the mean. I'm not sure it will be backward compatible. Maybe a warning, saying that the centroid is calculated using the mean? |
+1 for the warning. However I think it would be safer to raise an error. Nevertheless, I am not a user of this code. |
Thanks for the reviews. I am keeping those open for some time, just to see the reviews of others. |
@robertlayton @jnothman Any final comments? |
LGTM, aside from my comment on the fact that I have the impression that the conversion to csr is a be brutal in certain cases. |
I think you meant csc. Anyhow, I've checked the conversion. Please merge if you are happy. |
if self.metric == 'manhattan': | ||
X, y = check_X_y(X, y, ['csc']) | ||
else: | ||
X, y = check_X_y(X, y, ['csr']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put back ['csr', 'csc'] allowed if not manhattan. So you don't change the old behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e84b2e0
to
2cfe952
Compare
@@ -1,4 +1,4 @@ | |||
# -*- coding: utf-8 -*- | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how this came.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Please update what's new, rebase and I'll merge.
2cfe952
to
49f225a
Compare
If metric is "manhattan", then store it as csc since calculating the median is easier.
49f225a
to
e871972
Compare
[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data
thanks @MechCoder ! |
Good job! |
Fixes #743
I have also added a utility for calculating the median of csc sparse matrices, since NumPy does not handle it and it is not a trivial one-liner.