-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Uncontroversial fixes from estimator tags branch #8086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e1ee846
to
f6c1758
Compare
doc/whats_new.rst
Outdated
- Fixes to the input validation in :class:`sklearn.covariance.EllipticEnvelope` by | ||
`Andreas Müller`_. | ||
|
||
- Fix shape output shape of :class:`sklearn.decomposition.DictionaryLearning` transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still needs a test, I think...
doc/whats_new.rst
Outdated
|
||
- Gradient boosting base models are not longer estimators. By `Andreas Müller`_. | ||
|
||
- :class:`feature_extraction.text.TfidfTransformer` now supports numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs tests, I guess
@@ -101,7 +102,7 @@ def predict(self, X): | |||
return is_inlier | |||
|
|||
|
|||
class EllipticEnvelope(ClassifierMixin, OutlierDetectionMixin, MinCovDet): | |||
class EllipticEnvelope(OutlierDetectionMixin, MinCovDet): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a classifier and should not inherit from ClassifierMixin.
@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin, | |||
GaussianRandomProjection): | |||
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint""" | |||
def __init__(self, | |||
n_components=8, | |||
n_components=32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is never instantiated with default parameters and the code doesn't run with n_compenents != 32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the error captured if not 32? should it be a parameter if only one param value works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I told @ogrisel to look into it ;)
sklearn/feature_extraction/text.py
Outdated
else: | ||
# convert counts or binary occurrences to floats | ||
X = sp.csr_matrix(X, dtype=np.float64, copy=copy) | ||
X = check_array(X, accept_sparse=["csr"], copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually somewhat uncertain of whether this is the right approach. In fit
we always convert to sparse. Maybe we should be consistent? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the principle is that for df
to be meaningful, X
needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
doc/whats_new.rst
Outdated
|
||
- :class:`feature_extraction.text.TfidfTransformer` now supports numpy | ||
arrays as inputs, and produces numpy arrays for list inputs and numpy | ||
array inputs. By `Andreas Müller_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swap _ and `
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM, I think.
Thanks! But I can't say it's so easy to review a heterogeneous patch like this, and I wish you'd pulled the entirely cosmetic things out separately.
@@ -176,3 +177,29 @@ def fit(self, X, y=None): | |||
self.threshold_ = sp.stats.scoreatpercentile( | |||
self.dist_, 100. * (1. - self.contamination)) | |||
return self | |||
|
|||
def score(self, X, y, sample_weight=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not part of OutlierDetectionMixin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it there.
sklearn/dummy.py
Outdated
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None): | |||
|
|||
self.sparse_output_ = sp.issparse(y) | |||
|
|||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be liberal wrt X
: DummyClassifier
will often substitute for a pipeline / grid search, where X
may be a list of strings, a dataframe, a list of open files, or other mess. Don't demand too much of it. Here you ensure it is 2d and numeric and of a particular format, but why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
sklearn/feature_extraction/text.py
Outdated
else: | ||
# convert counts or binary occurrences to floats | ||
X = sp.csr_matrix(X, dtype=np.float64, copy=copy) | ||
X = check_array(X, accept_sparse=["csr"], copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the principle is that for df
to be meaningful, X
needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.
doc/whats_new.rst
Outdated
|
||
- :class:`feature_selection.SelectFromModel` now validates the ``threshold`` | ||
parameter and sets the ``threshold_`` attribute during the call to | ||
``fit``, and no longer during the call to ``transform```, by `Andreas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still being set in transform
now, no?
@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin, | |||
GaussianRandomProjection): | |||
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint""" | |||
def __init__(self, | |||
n_components=8, | |||
n_components=32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha!
sklearn/naive_bayes.py
Outdated
@@ -480,13 +480,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None): | |||
y : array-like, shape = [n_samples] | |||
Target values. | |||
|
|||
classes : array-like, shape = [n_classes], optional (default=None) | |||
classes : array-like, shape = [n_classes], (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comma before parentheses?
sklearn/naive_bayes.py
Outdated
List of all the classes that can possibly appear in the y vector. | ||
|
||
Must be provided at the first call to partial_fit, can be omitted | ||
in subsequent calls. | ||
|
||
sample_weight : array-like, shape = [n_samples], optional (default=None) | ||
sample_weight : array-like, shape = [n_samples], (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comma before parentheses?
00b8447
to
91559ce
Compare
Codecov Report
@@ Coverage Diff @@
## master #8086 +/- ##
==========================================
- Coverage 95.53% 95.48% -0.06%
==========================================
Files 333 342 +9
Lines 61184 60958 -226
==========================================
- Hits 58451 58204 -247
- Misses 2733 2754 +21
Continue to review full report at Codecov.
|
@@ -64,7 +64,7 @@ | |||
from ..exceptions import NotFittedError | |||
|
|||
|
|||
class QuantileEstimator(BaseEstimator): | |||
class QuantileEstimator(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? (there is probably a good reason, just asking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are also not in a position for get_params/set_params to be used.
Overall +1 for merge, but I added a small comment. It also seems that there are unanswered comments by @jnothman |
… "fit", use estimates from last "fit".
sklearn/dummy.py
Outdated
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None): | |||
|
|||
self.sparse_output_ = sp.issparse(y) | |||
|
|||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
@@ -64,7 +64,7 @@ | |||
from ..exceptions import NotFittedError | |||
|
|||
|
|||
class QuantileEstimator(BaseEstimator): | |||
class QuantileEstimator(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.
sklearn/feature_extraction/text.py
Outdated
else: | ||
# convert counts or binary occurrences to floats | ||
X = sp.csr_matrix(X, dtype=np.float64, copy=copy) | ||
X = check_array(X, accept_sparse=["csr"], copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
doc/whats_new.rst
Outdated
- Gradient boosting base models are not longer estimators. By `Andreas Müller`_. | ||
|
||
- :class:`feature_selection.SelectFromModel` now validates the ``threshold`` | ||
parameter and sets the ``threshold_`` attribute during the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now actually doesn't set it any more during transform. This is a backward incompatible change, but could be considered a bug in the previous implementation? I tried to make it clearer what happens now.
Ok so I reworked the The PR now makes |
I've not looked, but sounds good
…On 16 May 2017 8:37 am, "Andreas Mueller" ***@***.***> wrote:
Ok so I reworked the SelectFromModel again. The point of self.threshold_
is to provide a threshold to the user, it's not really used internally
(before or after this PR). However, before the PR, if the user re-assigned
threshold - a use-case that we explicitly support and test - then
self.threshold_ is outdated until transform is called. Even calling fit
would not update it!
The PR now makes threshold_ a property so it doesn't go stale. Computing
it is quick so it's not a big deal. We can think about deprecating the
property and making it a method.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8086 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63bjbnRiEjugvUTvBY-BlvfyDM6Tks5r6NPBgaJpZM4LRPhH>
.
|
Is @ogrisel around? I think he did the GaussianRandomProjection. It's only used internally. I guess we should open an issue for that? |
X : array-like, shape = (n_samples, n_features) | ||
Test samples. | ||
|
||
y : array-like, shape = (n_samples) or (n_samples, n_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape = (n_samples,) or (n_samples, n_outputs)
y : array-like, shape = (n_samples) or (n_samples, n_outputs) | ||
True labels for X. | ||
|
||
sample_weight : array-like, shape = [n_samples], optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array-like, shape = (n_samples,), optional
@@ -484,13 +484,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None): | |||
y : array-like, shape = [n_samples] | |||
Target values. | |||
|
|||
classes : array-like, shape = [n_classes], optional (default=None) | |||
classes : array-like, shape = [n_classes] (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we are not consistent here.
I think we should use tuples for shapes in doctrings so:
array-like, shape = (n_classes,) (default=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I could change that but that would really increase the size of the pull request, even if I just do it for naive_bayes.py
doc/whats_new.rst
Outdated
``fit``, and no longer during the call to ``transform```, by `Andreas | ||
Müller`_. | ||
|
||
- :class:`features_selection.SelectFromModel` now has a ``partial_fit`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features -> feature
that's it for me. |
doc/whats_new.rst
Outdated
:issue:`8086` by `Andreas Müller`_. | ||
|
||
- Fix output shape and bugs with n_jobs > 1 in | ||
:class:`sklearn.decomposition.SparseEncoder` transform and :func:`sklarn.decomposition.sparse_encode` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparseEncoder -> SparseCoder
if dictionary.shape[1] != X.shape[1]: | ||
raise ValueError("Dictionary and X have different numbers of features:" | ||
"dictionary.shape: {} X.shape{}".format( | ||
dictionary.shape, X.shape)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use check_consistent_length(X.T, dictionary.T)
You could argue that's less clear though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll say "different number of samples" in the error, right?
doc/whats_new.rst
Outdated
@@ -213,7 +213,8 @@ Bug fixes | |||
|
|||
- Fixed a bug where :class:`sklearn.linear_model.LassoLars` does not give | |||
the same result as the LassoLars implementation available | |||
in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez <jmontoyam>` | |||
in R (lars library). :issue:`7849` by `Jair Montoya Martinez`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no link for Jair Montoya Martinez.
Revert this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use this url : https://github.com/jmontoyam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I guess that got lost in a merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use this url : https://github.com/jmontoyam
Then revert it to
:user:`Jair Montoya Martinez <jmontoyam>`
doc/whats_new.rst
Outdated
being subtracted from the centroids. :issue:`7872` by `Josh Karnofsky <https://github.com/jkarno>`_. | ||
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse | ||
array X and initial centroids, where X's means were unnecessarily being | ||
subtracted from the centroids. :issue:`7872` by `Josh Karnofsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:user:`Josh Karnofsky <jkarno>`
good to go on my end |
doc/whats_new.rst
Outdated
@@ -336,6 +351,20 @@ API changes summary | |||
:func:`sklearn.model_selection.cross_val_predict`. | |||
:issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`. | |||
|
|||
|
|||
- Gradient boosting base models are not longer estimators. By `Andreas Müller`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not longer -> no longer
Comment moved to new issue: #8998, this is not relevant to this PR. |
Other than the very minor |
@GaelVaroquaux do you wanna have a final look or merge? |
@vene merge? |
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
…n#8086) * some bug fixes. * minor fixes to whatsnew * typo in whatsnew * add test for n_components = 1 transform in dict learning * feature extraction doc fix * fix broken test * revert aggressive input validation changes * in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit". * move score from EllipticEnvelope to OutlierDetectionMixin * revert changes to Tfidf documentation * remove dummy input validation from whatsnew * fix text feature tests * rewrite from_model threshold again... * remove stray condition * fix self.estimator -> estimator, slightly more interesting test * typo in comment * Fix issues in SparseEncoder, add tests. more explicit explanation of SparseEncoder change, add issue numbers to whatsnew * minor fixes in whats_new.rst * slightly more consistency with tuples for shapes * not longer typo
These are some of the simple fixes from #8022 which should be fairly uncontroversial and easy to review.
I hope that makes the review of #8022 easier and allows that PR to focus on the API.