[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data #3772

MechCoder · 2014-10-14T14:37:30Z

Fixes #743

I have also added a utility for calculating the median of csc sparse matrices, since NumPy does not handle it and it is not a trivial one-liner.

MechCoder · 2014-10-14T14:38:38Z

ping @mblondel @robertlayton @larsmans ?

MechCoder · 2014-10-14T16:36:00Z

@jnothman Do you approve of this utility in sparsefuncs?

robertlayton · 2014-10-14T19:28:25Z

sklearn/neighbors/nearest_centroid.py

-        X, y = check_X_y(X, y, ['csr', 'csc'])
-        if sp.issparse(X) and self.shrink_threshold:
+        X, y = check_X_y(X, y, ['csc'])
+        X_sparse = sp.issparse(X)


X_sparse should be renamed to something less ambigious. We typically use X_something to denote the dataset (i.e. X_train). Perhaps is_X_sparse?

MechCoder · 2014-10-14T19:46:09Z

@robertlayton Thanks for your reviews. Do you think such a utility (median) would be useful and should be placed in sparsefuncs, i.e where it is right now.?

robertlayton · 2014-10-14T19:48:59Z

Short (by amount of work) answer: yes -- and probably enough for this PR

Long answer: I would suggest that we move the distance stuff out of the metrics folder and put functions like this in there, with a wrapper function that you can just call: median(X, metric) that does the right thing based on what X and metric are.

MechCoder · 2014-10-14T19:52:10Z

Thanks, and also a general doubt. How does the "manhattan" metric which is the L1 distance between two points, translate into the median? I googled but I was unable to find any answer. (Similar question for the euclidean metric too.)

robertlayton · 2014-10-14T20:00:35Z

You are right I think, metric doesn't matter. The case I was thinking of was with an even number of samples, choosing the "middle" to be the median, and that doesn't change from metric to metric that I'm aware of (although maybe with some metrics it does). My bad.

MechCoder · 2014-10-14T20:10:54Z

I see, so do we keep the metric keyword as it is, and use it only for the predict and have a new keyword (say centroids ) that determines if the centroids should be computed using mean or median?

jnothman · 2014-10-14T21:00:45Z

I have also added a utility for calculating the median of csc sparse matrices, since NumPy does not handle it and it is not a trivial one-liner

I've not looked at your code, but see also the imputer which has implemented this in pure python.

MechCoder · 2014-10-14T21:06:19Z

@jnothman oops. btw this is also in pure python. Should we refactor the code from imputer?

jnothman · 2014-10-15T00:36:10Z

It may not be so simple to refactor, but at least the two implementations should be close to each other, if not reusing each other's code.

MechCoder · 2014-10-15T09:18:25Z

@jnothman I should have just asked if if there was already an implementation around. I can just warp around the _get_median function in imputer!

MechCoder · 2014-10-15T11:02:51Z

@robertlayton @jnothman csc_row_median now calls the _get_median function.

Now the question is if we should have a separate keyword for the centroid being either the mean or median, or somehow answer the question that how using the manhattan metric, translates to the median being the centroid?

arjoly · 2014-10-17T11:24:28Z

sklearn/neighbors/nearest_centroid.py

@@ -86,8 +87,9 @@ def fit(self, X, y):
        y : array, shape = [n_samples]
            Target values (integers)
        """
-        X, y = check_X_y(X, y, ['csr', 'csc'])
-        if sp.issparse(X) and self.shrink_threshold:
+        X, y = check_X_y(X, y, ['csc'])


Is this a regression?

All sparse matrices, would be better to get converted to csc, so that the median can be easily calculated across all rows.

But if the metric is not manhattan, should we convert?

Are you saying this because the row slice would be slower case of slicing the rows to calculate the mean (for csc) ? In that case I would explicitly convert it to csc for manhattan because of the median and to csr for the other metrics since it is easier to do the row slicing to calculate the mean, if its ok with you.

MechCoder · 2014-10-18T09:44:16Z

@arjoly I have addressed all your comments. This comment (#3772 (comment)) remains.

coveralls · 2014-10-18T09:55:17Z

Coverage increased (+0.01%) when pulling 3364721 on MechCoder:manhattan_metric into b27ee40 on scikit-learn:master.

agramfort · 2014-10-19T06:59:40Z

sklearn/neighbors/nearest_centroid.py

+                else:
+                    self.centroids_[cur_class] = csc_row_median(X[center_mask])
+            else:
+                self.centroids_[cur_class] = X[center_mask].mean(axis=0)


is it me or only euclidian and manhattan metrics are supported?

any good reason why no docstring is modified?

the other metrics are supported in the predict method. the question is that if the manhattan metric means that the average should be the median, or should we have a separate keyword for the centroid. Basically this comment (#3772 (comment))

I don't really understand why the centroid is correctly computed for metrics other than euclidean and manhattan distance.

It is not being supported. This PR just adds the manhattan averaging. I have updated the docstring.

agramfort · 2014-10-19T16:07:37Z

besides LGTM. +1 for merge.

MechCoder · 2014-10-20T10:52:08Z

@arjoly Done. Let me know if you have anything else?

arjoly · 2014-10-20T10:54:33Z

It is not being supported. This PR just adds the manhattan averaging. I have updated the docstring.

Could you raise an error if it's not supported?

arjoly · 2014-10-20T10:55:43Z

Beside my last my comment, +1.

MechCoder · 2014-10-20T10:57:41Z

Well, it was "supported" before this PR, by falling back on to the mean. I'm not sure it will be backward compatible. Maybe a warning, saying that the centroid is calculated using the mean?

arjoly · 2014-10-20T11:03:05Z

+1 for the warning. However I think it would be safer to raise an error. Nevertheless, I am not a user of this code.

MechCoder · 2014-10-20T11:11:29Z

Thanks for the reviews. I am keeping those open for some time, just to see the reviews of others.

coveralls · 2014-10-20T11:20:25Z

Coverage increased (+0.01%) when pulling 86fb1aa on MechCoder:manhattan_metric into b27ee40 on scikit-learn:master.

MechCoder · 2014-10-20T16:30:10Z

@robertlayton @jnothman Any final comments?

GaelVaroquaux · 2014-10-20T21:28:56Z

LGTM, aside from my comment on the fact that I have the impression that the conversion to csr is a be brutal in certain cases.

MechCoder · 2014-10-20T22:38:54Z

the conversion to csr is

I think you meant csc. Anyhow, I've checked the conversion. Please merge if you are happy.

agramfort · 2014-10-21T06:58:46Z

sklearn/neighbors/nearest_centroid.py

+        if self.metric == 'manhattan':
+            X, y = check_X_y(X, y, ['csc'])
+        else:
+            X, y = check_X_y(X, y, ['csr'])


put back ['csr', 'csc'] allowed if not manhattan. So you don't change the old behavior

agramfort · 2014-10-21T07:12:30Z

sklearn/neighbors/nearest_centroid.py

@@ -1,4 +1,4 @@
-# -*- coding: utf-8 -*-
+    # -*- coding: utf-8 -*-


not sure how this came.

ok. Please update what's new, rebase and I'll merge.

…se data

If metric is "manhattan", then store it as csc since calculating the median is easier.

[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data

agramfort · 2014-10-21T07:50:22Z

thanks @MechCoder !

GaelVaroquaux · 2014-10-21T11:10:34Z

Good job!

MechCoder changed the title ~~ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data~~ [MRG] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data Oct 14, 2014

robertlayton reviewed Oct 14, 2014
View reviewed changes

MechCoder force-pushed the manhattan_metric branch from 53565c3 to 9a804f4 Compare October 15, 2014 10:54

MechCoder mentioned this pull request Oct 16, 2014

[MRG+1] Add sample_weight support to Dummy Regressor #3779

Merged

arjoly reviewed Oct 17, 2014
View reviewed changes

MechCoder force-pushed the manhattan_metric branch from 8fa7d09 to 3364721 Compare October 18, 2014 09:46

agramfort reviewed Oct 19, 2014
View reviewed changes

MechCoder changed the title ~~[MRG+1] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data~~ [MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data Oct 20, 2014

agramfort reviewed Oct 21, 2014
View reviewed changes

MechCoder force-pushed the manhattan_metric branch from e84b2e0 to 2cfe952 Compare October 21, 2014 07:07

agramfort reviewed Oct 21, 2014
View reviewed changes

MechCoder force-pushed the manhattan_metric branch from 2cfe952 to 49f225a Compare October 21, 2014 07:14

MechCoder added 8 commits October 21, 2014 09:21

ENH: Patches Nearest Centroid for metric=manhattan for sparse and den…

6f41bd1

…se data

FIX: Wrap csc_row_median around the _get_median imputer function

91450c6

MAINT: Move _get_median into sparsefuncs to avoid circular imports

9622e85

DOC: Explain why the centroid of the manhattan metric is the median

3e7b816

Renamed csc_row_median to csc_median_axis_0

2daa028

Warning for non-euclidean and non-manhattan metrics

bda6706

Sparse matrix conversion depending on the type of metric

8fc9918

If metric is "manhattan", then store it as csc since calculating the median is easier.

Update what's new.rst

e871972

MechCoder force-pushed the manhattan_metric branch from 49f225a to e871972 Compare October 21, 2014 07:26

agramfort added a commit that referenced this pull request Oct 21, 2014

Merge pull request #3772 from MechCoder/manhattan_metric

30619ff

[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data

agramfort merged commit 30619ff into scikit-learn:master Oct 21, 2014

MechCoder deleted the manhattan_metric branch October 21, 2014 08:44

ArturoAmorQ mentioned this pull request Aug 24, 2022

API Deprecate metrics other than euclidean and manhattan for NearestCentroid #24083

Merged

		@@ -1,4 +1,4 @@
		# -- coding: utf-8 --
		# -- coding: utf-8 --

Uh oh!

[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data #3772

[MRG+2] ENH: Patches Nearest Centroid for metric=manhattan for sparse and dense data #3772

Uh oh!

Conversation

MechCoder commented Oct 14, 2014

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

robertlayton commented Oct 14, 2014

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

robertlayton commented Oct 14, 2014

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

jnothman commented Oct 14, 2014

Uh oh!

MechCoder commented Oct 14, 2014

Uh oh!

jnothman commented Oct 15, 2014

Uh oh!

MechCoder commented Oct 15, 2014

Uh oh!

MechCoder commented Oct 15, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Oct 18, 2014

Uh oh!

coveralls commented Oct 18, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agramfort commented Oct 19, 2014

Uh oh!

MechCoder commented Oct 20, 2014

Uh oh!

arjoly commented Oct 20, 2014

Uh oh!

arjoly commented Oct 20, 2014

Uh oh!

MechCoder commented Oct 20, 2014

Uh oh!

arjoly commented Oct 20, 2014

Uh oh!

MechCoder commented Oct 20, 2014

Uh oh!

coveralls commented Oct 20, 2014

Uh oh!

MechCoder commented Oct 20, 2014

Uh oh!

GaelVaroquaux commented Oct 20, 2014

Uh oh!

MechCoder commented Oct 20, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment