Skip to content

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
2 changes: 1 addition & 1 deletion doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1222,7 +1222,7 @@ polynomial regression can be created and used as follows::
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
>>> model.named_steps_['linear'].coef_
array([ 3., -2., 1., -1.])

The linear model trained on polynomial features is able to exactly recover
Expand Down
72 changes: 22 additions & 50 deletions doc/modules/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,13 @@ it takes a variable number of estimators and returns a pipeline,
filling in the names automatically::

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB()) # doctest: +NORMALIZE_WHITESPACE
>>> make_pipeline(PCA(), SVC()) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Pipeline(memory=None,
steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
('multinomialnb', MultinomialNB(alpha=1.0,
class_prior=None,
fit_prior=True))])
steps=[('pca', PCA(copy=True,...)),
('svc', SVC(C=1.0,...))])

The estimators of a pipeline are stored as a list in the ``steps`` attribute::
The original estimators of a pipeline are stored as a list in the ``steps``
attribute::

>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
Expand All @@ -71,6 +68,23 @@ and as a ``dict`` in ``named_steps``::
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)

Once the pipeline has been fitted, ``steps_`` and ``named_steps_`` have to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to be used? Or contain the fitted models?

used.

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> pipe.fit(iris.data, iris.target)
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Pipeline(memory=None,
steps=[('reduce_dim', PCA(copy=True,...)),
('clf', SVC(C=1.0,...))])
>>> pipe.named_steps_['reduce_dim'] # doctest: +NORMALIZE_WHITESPACE
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
>>> pipe.steps_[0] # doctest: +NORMALIZE_WHITESPACE
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None,
random_state=None, svd_solver='auto', tol=0.0, whiten=False))

Parameters of the estimators in the pipeline can be accessed using the
``<estimator>__<parameter>`` syntax::

Expand Down Expand Up @@ -152,48 +166,6 @@ object::
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)

.. warning:: **Side effect of caching transfomers**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge issue?


Using a :class:`Pipeline` without cache enabled, it is possible to
inspect the original instance such as::

>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> pca1 = PCA()
>>> svm1 = SVC()
>>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
>>> pipe.fit(digits.data, digits.target)
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Pipeline(memory=None,
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
>>> # The pca instance can be inspected directly
>>> print(pca1.components_) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
[[ -1.77484909e-19 ... 4.07058917e-18]]

Enabling caching triggers a clone of the transformers before fitting.
Therefore, the transformer instance given to the pipeline cannot be
inspected directly.
In following example, accessing the :class:`PCA` instance ``pca2``
will raise an ``AttributeError`` since ``pca2`` will be an unfitted
transformer.
Instead, use the attribute ``named_steps`` to inspect estimators within
the pipeline::

>>> cachedir = mkdtemp()
>>> pca2 = PCA()
>>> svm2 = SVC()
>>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
... memory=cachedir)
>>> cached_pipe.fit(digits.data, digits.target)
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Pipeline(memory=...,
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
>>> print(cached_pipe.named_steps['reduce_dim'].components_)
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
[[ -1.77484909e-19 ... 4.07058917e-18]]
>>> # Remove the cache directory
>>> rmtree(cachedir)

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_compare_reduction.py`
Expand Down
8 changes: 8 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Enhancements
- :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier`
now support online learning using `partial_fit`.
issue: `8053` by :user:`Peng Yu <yupbank>`.

- :class:`pipeline.Pipeline` allows to cache transformers
within a pipeline by using the ``memory`` constructor parameter.
By :issue:`7990` by :user:`Guillaume Lemaitre <glemaitre>`.
Expand Down Expand Up @@ -251,6 +252,13 @@ API changes summary
:func:`sklearn.model_selection.cross_val_predict`.
:issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`.

- In the future, the estimators in the ``steps`` and ``named_steps``
attributes will no longer have their ``fit()`` methods called directly.
Users will have to access fitted Pipeline steps in ``steps_``
and ``named_steps_`. The warning was introduced in 0.19 and will take
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing backtick after named_steps_

effect in 0.22.
:issue:`8350` by :user:`Guillaume Lemaitre <glemaitre>`.

.. _changes_0_18_1:

Version 0.18.1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def nudge_dataset(X, Y):
# Plotting

plt.figure(figsize=(4.2, 4))
for i, comp in enumerate(rbm.components_):
for i, comp in enumerate(classifier.named_steps_['rbm'].components_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to remove this use-case, right? (the way it was previously).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to warn a user doing this, right?

plt.subplot(10, 10, i + 1)
plt.imshow(comp.reshape((8, 8)), cmap=plt.cm.gray_r,
interpolation='nearest')
Expand Down
2 changes: 1 addition & 1 deletion examples/plot_digits_pipe.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
logistic__C=Cs))
estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
plt.axvline(estimator.best_estimator_.named_steps_['pca'].n_components,
linestyle=':', label='n_components chosen')
plt.legend(prop=dict(size=12))
plt.show()
6 changes: 3 additions & 3 deletions examples/preprocessing/plot_scaling_importance.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,15 @@
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))

# Extract PCA from pipeline
pca = unscaled_clf.named_steps['pca']
pca_std = std_clf.named_steps['pca']
pca = unscaled_clf.named_steps_['pca']
pca_std = std_clf.named_steps_['pca']

# Show first principal componenets
print('\nPC 1 without scaling:\n', pca.components_[0])
print('\nPC 1 with scaling:\n', pca_std.components_[0])

# Scale and use PCA on X_train data for visualization.
scaler = std_clf.named_steps['standardscaler']
scaler = std_clf.named_steps_['standardscaler']
X_train_std = pca_std.transform(scaler.transform(X_train))

# visualize standardized vs. untouched dataset with PCA performed
Expand Down
6 changes: 3 additions & 3 deletions sklearn/feature_extraction/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ def test_countvectorizer_custom_vocabulary_pipeline():
('count', CountVectorizer(vocabulary=what_we_like)),
('tfidf', TfidfTransformer())])
X = pipe.fit_transform(ALL_FOOD_DOCS)
assert_equal(set(pipe.named_steps['count'].vocabulary_),
assert_equal(set(pipe.named_steps_['count'].vocabulary_),
set(what_we_like))
assert_equal(X.shape[1], len(what_we_like))

Expand Down Expand Up @@ -728,7 +728,7 @@ def test_count_vectorizer_pipeline_grid_selection():
# the grid_search is considered the best estimator since they all converge
# to 100% accuracy models
assert_equal(grid_search.best_score_, 1.0)
best_vectorizer = grid_search.best_estimator_.named_steps['vect']
best_vectorizer = grid_search.best_estimator_.named_steps_['vect']
assert_equal(best_vectorizer.ngram_range, (1, 1))


Expand Down Expand Up @@ -765,7 +765,7 @@ def test_vectorizer_pipeline_grid_selection():
# the grid_search is considered the best estimator since they all converge
# to 100% accuracy models
assert_equal(grid_search.best_score_, 1.0)
best_vectorizer = grid_search.best_estimator_.named_steps['vect']
best_vectorizer = grid_search.best_estimator_.named_steps_['vect']
assert_equal(best_vectorizer.ngram_range, (1, 1))
assert_equal(best_vectorizer.norm, 'l2')
assert_false(best_vectorizer.fixed_vocabulary_)
Expand Down
Loading