[MRG+1] Uncontroversial fixes from estimator tags branch #8086

amueller · 2016-12-19T22:13:17Z

These are some of the simple fixes from #8022 which should be fairly uncontroversial and easy to review.
I hope that makes the review of #8022 easier and allows that PR to focus on the API.

amueller · 2016-12-19T22:15:19Z

doc/whats_new.rst

+   - Fixes to the input validation in :class:`sklearn.covariance.EllipticEnvelope` by
+     `Andreas Müller`_.
+
+   - Fix shape output shape of :class:`sklearn.decomposition.DictionaryLearning` transform


this still needs a test, I think...

amueller · 2016-12-19T22:17:22Z

doc/whats_new.rst

+
+   - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.
+
+   - :class:`feature_extraction.text.TfidfTransformer` now supports numpy


this needs tests, I guess

amueller · 2016-12-19T22:43:34Z

sklearn/covariance/outlier_detection.py

@@ -101,7 +102,7 @@ def predict(self, X):
        return is_inlier


-class EllipticEnvelope(ClassifierMixin, OutlierDetectionMixin, MinCovDet):
+class EllipticEnvelope(OutlierDetectionMixin, MinCovDet):


This is not a classifier and should not inherit from ClassifierMixin.

amueller · 2016-12-19T22:45:26Z

sklearn/neighbors/approximate.py

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
                                   GaussianRandomProjection):
    """Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
    def __init__(self,
-                 n_components=8,
+                 n_components=32,


This class is never instantiated with default parameters and the code doesn't run with n_compenents != 32

is the error captured if not 32? should it be a parameter if only one param value works?

I told @ogrisel to look into it ;)

amueller · 2016-12-19T22:48:11Z

sklearn/feature_extraction/text.py

-        else:
-            # convert counts or binary occurrences to floats
-            X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+        X = check_array(X, accept_sparse=["csr"], copy=copy,


I'm not actually somewhat uncertain of whether this is the right approach. In fit we always convert to sparse. Maybe we should be consistent? I'm not sure.

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

TomDLT · 2016-12-20T16:57:17Z

doc/whats_new.rst

+
+   - :class:`feature_extraction.text.TfidfTransformer` now supports numpy
+     arrays as inputs, and produces numpy arrays for list inputs and numpy
+     array inputs. By `Andreas Müller_`.


swap _ and `

jnothman

otherwise LGTM, I think.

Thanks! But I can't say it's so easy to review a heterogeneous patch like this, and I wish you'd pulled the entirely cosmetic things out separately.

jnothman · 2016-12-27T03:13:10Z

sklearn/covariance/outlier_detection.py

@@ -176,3 +177,29 @@ def fit(self, X, y=None):
        self.threshold_ = sp.stats.scoreatpercentile(
            self.dist_, 100. * (1. - self.contamination))
        return self
+
+    def score(self, X, y, sample_weight=None):


Why is this not part of OutlierDetectionMixin?

moved it there.

jnothman · 2016-12-27T03:19:50Z

sklearn/dummy.py

@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):

        self.sparse_output_ = sp.issparse(y)

+        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])


I think we need to be liberal wrt X: DummyClassifier will often substitute for a pipeline / grid search, where X may be a list of strings, a dataframe, a list of open files, or other mess. Don't demand too much of it. Here you ensure it is 2d and numeric and of a particular format, but why?

jnothman · 2016-12-27T03:23:52Z

sklearn/feature_extraction/text.py

-        else:
-            # convert counts or binary occurrences to floats
-            X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+        X = check_array(X, accept_sparse=["csr"], copy=copy,


I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

jnothman · 2016-12-27T03:24:41Z

doc/whats_new.rst

+
+   - :class:`feature_selection.SelectFromModel` now validates the ``threshold``
+     parameter and sets the ``threshold_`` attribute during the call to
+     ``fit``, and no longer during the call to ``transform```, by `Andreas


I think it's still being set in transform now, no?

jnothman · 2016-12-27T03:25:22Z

sklearn/neighbors/approximate.py

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
                                   GaussianRandomProjection):
    """Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
    def __init__(self,
-                 n_components=8,
+                 n_components=32,


jnothman · 2016-12-27T03:28:14Z

sklearn/naive_bayes.py

@@ -480,13 +480,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
        y : array-like, shape = [n_samples]
            Target values.

-        classes : array-like, shape = [n_classes], optional (default=None)
+        classes : array-like, shape = [n_classes], (default=None)


no comma before parentheses?

jnothman · 2016-12-27T03:28:21Z

sklearn/naive_bayes.py

            List of all the classes that can possibly appear in the y vector.

            Must be provided at the first call to partial_fit, can be omitted
            in subsequent calls.

-        sample_weight : array-like, shape = [n_samples], optional (default=None)
+        sample_weight : array-like, shape = [n_samples], (default=None)


no comma before parentheses?

codecov · 2017-02-25T19:13:15Z

Codecov Report

Merging #8086 into master will decrease coverage by 0.05%.
The diff coverage is 98.01%.

@@            Coverage Diff             @@
##           master    #8086      +/-   ##
==========================================
- Coverage   95.53%   95.48%   -0.06%     
==========================================
  Files         333      342       +9     
  Lines       61184    60958     -226     
==========================================
- Hits        58451    58204     -247     
- Misses       2733     2754      +21

Impacted Files	Coverage Δ
sklearn/utils/multiclass.py	`96.52% <ø> (+0.69%)`	⬆️
sklearn/naive_bayes.py	`100% <ø> (ø)`	⬆️
sklearn/feature_selection/rfe.py	`97.45% <ø> (ø)`	⬆️
sklearn/neighbors/approximate.py	`98.95% <ø> (ø)`	⬆️
sklearn/multiclass.py	`94.78% <100%> (ø)`
sklearn/decomposition/truncated_svd.py	`94.11% <100%> (-0.23%)`	⬇️
sklearn/covariance/outlier_detection.py	`97.22% <100%> (+0.25%)`	⬆️
sklearn/decomposition/dict_learning.py	`93.45% <100%> (ø)`	⬆️
sklearn/tests/test_multiclass.py	`100% <100%> (ø)`
sklearn/ensemble/gradient_boosting.py	`95.79% <100%> (-0.01%)`	⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0301e06...f4c9d60. Read the comment docs.

GaelVaroquaux · 2017-02-27T17:47:19Z

sklearn/ensemble/gradient_boosting.py

@@ -64,7 +64,7 @@
 from ..exceptions import NotFittedError


-class QuantileEstimator(BaseEstimator):
+class QuantileEstimator(object):


Why this change? (there is probably a good reason, just asking)

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

They are also not in a position for get_params/set_params to be used.

GaelVaroquaux · 2017-02-27T18:21:37Z

Overall +1 for merge, but I added a small comment. It also seems that there are unanswered comments by @jnothman

… "fit", use estimates from last "fit".

amueller · 2017-05-15T21:22:18Z

sklearn/dummy.py

@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):

        self.sparse_output_ = sp.issparse(y)

+        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])


amueller · 2017-05-15T21:24:31Z

sklearn/ensemble/gradient_boosting.py

@@ -64,7 +64,7 @@
 from ..exceptions import NotFittedError


-class QuantileEstimator(BaseEstimator):
+class QuantileEstimator(object):


These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

amueller · 2017-05-15T21:27:24Z

sklearn/feature_extraction/text.py

-        else:
-            # convert counts or binary occurrences to floats
-            X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+        X = check_array(X, accept_sparse=["csr"], copy=copy,


amueller · 2017-05-15T21:45:17Z

doc/whats_new.rst

+   - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.
+
+   - :class:`feature_selection.SelectFromModel` now validates the ``threshold``
+     parameter and sets the ``threshold_`` attribute during the call to


Now actually doesn't set it any more during transform. This is a backward incompatible change, but could be considered a bug in the previous implementation? I tried to make it clearer what happens now.

amueller · 2017-05-15T22:37:51Z

Ok so I reworked the SelectFromModel again. The point of self.threshold_ is to provide a threshold to the user, it's not used internally (before or after this PR). However, before the PR, if the user re-assigned threshold - a use-case that we explicitly support and test - then self.threshold_ is outdated until transform is called. Even calling fit would not update it!

The PR now makes threshold_ a property so it doesn't go stale. Computing it is quick so it's not a big deal. We can think about deprecating the property and making it a method, which it should have probably been in the first place.

jnothman · 2017-05-15T22:54:26Z

I've not looked, but sounds good

…

On 16 May 2017 8:37 am, "Andreas Mueller" ***@***.***> wrote: Ok so I reworked the SelectFromModel again. The point of self.threshold_ is to provide a threshold to the user, it's not really used internally (before or after this PR). However, before the PR, if the user re-assigned threshold - a use-case that we explicitly support and test - then self.threshold_ is outdated until transform is called. Even calling fit would not update it! The PR now makes threshold_ a property so it doesn't go stale. Computing it is quick so it's not a big deal. We can think about deprecating the property and making it a method. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8086 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63bjbnRiEjugvUTvBY-BlvfyDM6Tks5r6NPBgaJpZM4LRPhH> .

amueller · 2017-06-05T13:40:53Z

Is @ogrisel around? I think he did the GaussianRandomProjection. It's only used internally. I guess we should open an issue for that?

vene · 2017-06-05T13:54:06Z

issue #8029 also mentions that these classes are not the greatest. (btw this is GaussianRandomProjectionHash, sorry for the typo.)

I'll comment under #8029 with the weird behaviour we are finding.

agramfort · 2017-06-06T08:42:54Z

sklearn/covariance/outlier_detection.py

+        X : array-like, shape = (n_samples, n_features)
+            Test samples.
+
+        y : array-like, shape = (n_samples) or (n_samples, n_outputs)


shape = (n_samples,) or (n_samples, n_outputs)

agramfort · 2017-06-06T08:43:05Z

sklearn/covariance/outlier_detection.py

+        y : array-like, shape = (n_samples) or (n_samples, n_outputs)
+            True labels for X.
+
+        sample_weight : array-like, shape = [n_samples], optional


array-like, shape = (n_samples,), optional

agramfort · 2017-06-06T08:49:15Z

sklearn/naive_bayes.py

@@ -484,13 +484,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
        y : array-like, shape = [n_samples]
            Target values.

-        classes : array-like, shape = [n_classes], optional (default=None)
+        classes : array-like, shape = [n_classes] (default=None)


I see that we are not consistent here.

I think we should use tuples for shapes in doctrings so:

array-like, shape = (n_classes,) (default=None)

well I could change that but that would really increase the size of the pull request, even if I just do it for naive_bayes.py

TomDLT · 2017-06-06T08:52:00Z

doc/whats_new.rst

+     ``fit``, and no longer during the call to ``transform```, by `Andreas
+     Müller`_.
+
+   - :class:`features_selection.SelectFromModel` now has a ``partial_fit``


features -> feature

agramfort · 2017-06-06T08:52:56Z

that's it for me.

TomDLT · 2017-06-06T08:55:29Z

doc/whats_new.rst

+     :issue:`8086` by `Andreas Müller`_.
+
+   - Fix output shape and bugs with n_jobs > 1 in  
+     :class:`sklearn.decomposition.SparseEncoder` transform and :func:`sklarn.decomposition.sparse_encode`


SparseEncoder -> SparseCoder

vene · 2017-06-06T08:56:42Z

sklearn/decomposition/dict_learning.py

+    if dictionary.shape[1] != X.shape[1]:
+        raise ValueError("Dictionary and X have different numbers of features:"
+                         "dictionary.shape: {} X.shape{}".format(
+                             dictionary.shape, X.shape))


could use check_consistent_length(X.T, dictionary.T)

You could argue that's less clear though.

It'll say "different number of samples" in the error, right?

TomDLT · 2017-06-06T08:59:19Z

doc/whats_new.rst

@@ -213,7 +213,8 @@ Bug fixes

   - Fixed a bug where :class:`sklearn.linear_model.LassoLars` does not give
     the same result as the LassoLars implementation available
-     in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez <jmontoyam>`
+     in R (lars library). :issue:`7849` by `Jair Montoya Martinez`_


There is no link for Jair Montoya Martinez.
Revert this one?

use this url : https://github.com/jmontoyam

thanks, I guess that got lost in a merge.

use this url : https://github.com/jmontoyam

Then revert it to

:user:`Jair Montoya Martinez <jmontoyam>`

TomDLT · 2017-06-06T09:03:48Z

doc/whats_new.rst

-     being subtracted from the centroids. :issue:`7872` by `Josh Karnofsky <https://github.com/jkarno>`_.
+   - Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse
+     array X and initial centroids, where X's means were unnecessarily being
+     subtracted from the centroids. :issue:`7872` by `Josh Karnofsky


:user:`Josh Karnofsky <jkarno>`

agramfort · 2017-06-06T09:26:30Z

good to go on my end

vene · 2017-06-06T10:21:19Z

doc/whats_new.rst

@@ -336,6 +351,20 @@ API changes summary
     :func:`sklearn.model_selection.cross_val_predict`.
     :issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`.

+
+   - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.


not longer -> no longer

vene · 2017-06-06T10:33:25Z

Comment moved to new issue: #8998, this is not relevant to this PR.

vene · 2017-06-06T11:50:34Z

(Since we plan to deprecate the GaussianRandomProjectionHash class in this cycle, maybe we should simply skip it from the common tests rather than change its default n_components. I really doubt that class is seeing any external use, and internally it is only ever called with a fixed n_components.) scratch everything up to here. The code simply doesn't work with the old default so there can be no breakage.

Other than the very minor check_consistent_length question in dict learning this PR looks good to me.

amueller · 2017-06-06T12:13:57Z

@GaelVaroquaux do you wanna have a final look or merge?

amueller · 2017-06-06T14:14:15Z

@vene merge?

…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo

amueller force-pushed the tags_backup branch from e1ee846 to f6c1758 Compare December 19, 2016 22:16

amueller commented Dec 19, 2016

View reviewed changes

amueller changed the title ~~Uncontroversial fixes from estimator tags branch~~ [MRG] Uncontroversial fixes from estimator tags branch Dec 19, 2016

amueller commented Dec 19, 2016

View reviewed changes

TomDLT reviewed Dec 20, 2016

View reviewed changes

jnothman reviewed Dec 27, 2016

View reviewed changes

amueller added 5 commits February 25, 2017 11:23

some bug fixes.

7ef1deb

minor fixes to whatsnew

c99b9ec

typo in whatsnew

534b0c5

add test for n_components = 1 transform in dict learning

c7cd00d

feature extraction doc fix

91559ce

amueller force-pushed the tags_backup branch from 00b8447 to 91559ce Compare February 25, 2017 16:23

fix broken test

6ee218d

GaelVaroquaux reviewed Feb 27, 2017

View reviewed changes

amueller added 2 commits May 15, 2017 17:42

revert aggressive input validation changes

27775e9

in SelectFromModel, don't store threshold_ in transform. If we called…

f4c9d60

… "fit", use estimates from last "fit".

amueller force-pushed the tags_backup branch from f1e679c to f4c9d60 Compare May 15, 2017 21:43

amueller commented May 15, 2017

View reviewed changes

amueller added 6 commits May 15, 2017 17:47

Merge branch 'master' into tags_backup

0555e22

move score from EllipticEnvelope to OutlierDetectionMixin

30bdd04

revert changes to Tfidf documentation

5ed1174

remove dummy input validation from whatsnew

adee7a3

fix text feature tests

a83697f

rewrite from_model threshold again...

9ce4747

agramfort reviewed Jun 6, 2017

View reviewed changes

TomDLT reviewed Jun 6, 2017

View reviewed changes

vene reviewed Jun 6, 2017

View reviewed changes

TomDLT reviewed Jun 6, 2017

View reviewed changes

amueller added 2 commits June 6, 2017 11:09

minor fixes in whats_new.rst

bb7f085

slightly more consistency with tuples for shapes

c18d646

agramfort changed the title ~~[MRG] Uncontroversial fixes from estimator tags branch~~ [MRG+1] Uncontroversial fixes from estimator tags branch Jun 6, 2017

vene reviewed Jun 6, 2017

View reviewed changes

not longer typo

8a3ea13

vene merged commit 1c41368 into scikit-learn:master Jun 6, 2017


		- Gradient boosting base models are not longer estimators. By `Andreas Müller`_.

		- :class:`feature_extraction.text.TfidfTransformer` now supports numpy

		@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):

		self.sparse_output_ = sp.issparse(y)

		X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])

[MRG+1] Uncontroversial fixes from estimator tags branch #8086

[MRG+1] Uncontroversial fixes from estimator tags branch #8086

Conversation

amueller commented Dec 19, 2016 • edited by TomDLT Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 25, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Feb 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented May 15, 2017 • edited Loading

jnothman commented May 15, 2017 via email

amueller commented Jun 5, 2017

vene commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agramfort commented Jun 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT Jun 6, 2017 • edited Loading

Choose a reason for hiding this comment

agramfort commented Jun 6, 2017

Choose a reason for hiding this comment

vene commented Jun 6, 2017 • edited Loading

vene commented Jun 6, 2017 • edited Loading

amueller commented Jun 6, 2017

amueller commented Jun 6, 2017

amueller commented Dec 19, 2016 •

edited by TomDLT

Loading

codecov bot commented Feb 25, 2017 •

edited

Loading

amueller commented May 15, 2017 •

edited

Loading

TomDLT Jun 6, 2017 •

edited

Loading

vene commented Jun 6, 2017 •

edited

Loading

vene commented Jun 6, 2017 •

edited

Loading