Skip to content

[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data #3772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 21, 2014

Conversation

MechCoder
Copy link
Member

Fixes #743

I have also added a utility for calculating the median of csc sparse matrices, since NumPy does not handle it and it is not a trivial one-liner.

@MechCoder MechCoder changed the title ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data [MRG] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data Oct 14, 2014
@MechCoder
Copy link
Member Author

ping @mblondel @robertlayton @larsmans ?

@MechCoder
Copy link
Member Author

@jnothman Do you approve of this utility in sparsefuncs?

X, y = check_X_y(X, y, ['csr', 'csc'])
if sp.issparse(X) and self.shrink_threshold:
X, y = check_X_y(X, y, ['csc'])
X_sparse = sp.issparse(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_sparse should be renamed to something less ambigious. We typically use X_something to denote the dataset (i.e. X_train). Perhaps is_X_sparse?

@MechCoder
Copy link
Member Author

@robertlayton Thanks for your reviews. Do you think such a utility (median) would be useful and should be placed in sparsefuncs, i.e where it is right now.?

@robertlayton
Copy link
Member

Short (by amount of work) answer: yes -- and probably enough for this PR

Long answer: I would suggest that we move the distance stuff out of the metrics folder and put functions like this in there, with a wrapper function that you can just call: median(X, metric) that does the right thing based on what X and metric are.

@MechCoder
Copy link
Member Author

Thanks, and also a general doubt. How does the "manhattan" metric which is the L1 distance between two points, translate into the median? I googled but I was unable to find any answer. (Similar question for the euclidean metric too.)

@robertlayton
Copy link
Member

You are right I think, metric doesn't matter. The case I was thinking of was with an even number of samples, choosing the "middle" to be the median, and that doesn't change from metric to metric that I'm aware of (although maybe with some metrics it does). My bad.

@MechCoder
Copy link
Member Author

I see, so do we keep the metric keyword as it is, and use it only for the predict and have a new keyword (say centroids ) that determines if the centroids should be computed using mean or median?

@jnothman
Copy link
Member

I have also added a utility for calculating the median of csc sparse matrices, since NumPy does not handle it and it is not a trivial one-liner

I've not looked at your code, but see also the imputer which has implemented this in pure python.

@MechCoder
Copy link
Member Author

@jnothman oops. btw this is also in pure python. Should we refactor the code from imputer?

@jnothman
Copy link
Member

It may not be so simple to refactor, but at least the two implementations should be close to each other, if not reusing each other's code.

@MechCoder
Copy link
Member Author

@jnothman I should have just asked if if there was already an implementation around. I can just warp around the _get_median function in imputer!

@MechCoder
Copy link
Member Author

@robertlayton @jnothman csc_row_median now calls the _get_median function.

Now the question is if we should have a separate keyword for the centroid being either the mean or median, or somehow answer the question that how using the manhattan metric, translates to the median being the centroid?

@@ -86,8 +87,9 @@ def fit(self, X, y):
y : array, shape = [n_samples]
Target values (integers)
"""
X, y = check_X_y(X, y, ['csr', 'csc'])
if sp.issparse(X) and self.shrink_threshold:
X, y = check_X_y(X, y, ['csc'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sparse matrices, would be better to get converted to csc, so that the median can be easily calculated across all rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the metric is not manhattan, should we convert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying this because the row slice would be slower case of slicing the rows to calculate the mean (for csc) ? In that case I would explicitly convert it to csc for manhattan because of the median and to csr for the other metrics since it is easier to do the row slicing to calculate the mean, if its ok with you.

@MechCoder
Copy link
Member Author

@arjoly I have addressed all your comments. This comment (#3772 (comment)) remains.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 3364721 on MechCoder:manhattan_metric into b27ee40 on scikit-learn:master.

else:
self.centroids_[cur_class] = csc_row_median(X[center_mask])
else:
self.centroids_[cur_class] = X[center_mask].mean(axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it me or only euclidian and manhattan metrics are supported?

any good reason why no docstring is modified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other metrics are supported in the predict method. the question is that if the manhattan metric means that the average should be the median, or should we have a separate keyword for the centroid. Basically this comment (#3772 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why the centroid is correctly computed for metrics other than euclidean and manhattan distance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not being supported. This PR just adds the manhattan averaging. I have updated the docstring.

@agramfort
Copy link
Member

besides LGTM. +1 for merge.

@MechCoder
Copy link
Member Author

@arjoly Done. Let me know if you have anything else?

@arjoly
Copy link
Member

arjoly commented Oct 20, 2014

It is not being supported. This PR just adds the manhattan averaging. I have updated the docstring.

Could you raise an error if it's not supported?

@arjoly
Copy link
Member

arjoly commented Oct 20, 2014

Beside my last my comment, +1.

@MechCoder
Copy link
Member Author

Well, it was "supported" before this PR, by falling back on to the mean. I'm not sure it will be backward compatible. Maybe a warning, saying that the centroid is calculated using the mean?

@arjoly
Copy link
Member

arjoly commented Oct 20, 2014

+1 for the warning. However I think it would be safer to raise an error. Nevertheless, I am not a user of this code.

@MechCoder
Copy link
Member Author

Thanks for the reviews. I am keeping those open for some time, just to see the reviews of others.

@MechCoder MechCoder changed the title [MRG+1] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data [MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data Oct 20, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 86fb1aa on MechCoder:manhattan_metric into b27ee40 on scikit-learn:master.

@MechCoder
Copy link
Member Author

@robertlayton @jnothman Any final comments?

@GaelVaroquaux
Copy link
Member

LGTM, aside from my comment on the fact that I have the impression that the conversion to csr is a be brutal in certain cases.

@MechCoder
Copy link
Member Author

the conversion to csr is

I think you meant csc. Anyhow, I've checked the conversion. Please merge if you are happy.

if self.metric == 'manhattan':
X, y = check_X_y(X, y, ['csc'])
else:
X, y = check_X_y(X, y, ['csr'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back ['csr', 'csc'] allowed if not manhattan. So you don't change the old behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how this came.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Please update what's new, rebase and I'll merge.

agramfort added a commit that referenced this pull request Oct 21, 2014
[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data
@agramfort agramfort merged commit 30619ff into scikit-learn:master Oct 21, 2014
@agramfort
Copy link
Member

thanks @MechCoder !

@MechCoder MechCoder deleted the manhattan_metric branch October 21, 2014 08:44
@GaelVaroquaux
Copy link
Member

Good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

average utility in pairwise module
7 participants