DOC clearer definition of estimator to be used in last step of a pipeline #26952

StefanieSenger · 2023-07-31T12:04:04Z

This PR aims to provide a more precise distinction of estimator, transformer and predictor in the documentation for Pipeline.

The current docstring defines Pipeline as Pipeline of transforms with a final estimator , stating The last transform must be an estimator, suggesting that estimator and transformer are distinct concepts or that the last step of a pipeline is somehow more restricted than the others.

However, this wording is misleading and doesn't adequately convey the fact that the last step in the Pipeline can be any estimator, not necessarily a transformer.

github-actions · 2023-07-31T12:05:32Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 0f149d6. Link to the linter CI: here}

thomasjpfan

Thank you for the PR @StefanieSenger !

thomasjpfan · 2023-07-31T13:48:54Z

sklearn/pipeline.py

+    A sequence of data transformers, which can be optionally complemented with
+    a final predictor.


The CI is failing because we are following NumPyDoc standards which require the "summary line" to fit one line.

Ah, I see. Thank you. :)

thomasjpfan · 2023-07-31T13:55:13Z

sklearn/pipeline.py

+        that are chained in sequential order. The last step can only implement
+        `fit`.


There isn't many requirements on the last step. For example, one can have a stateless transformers in a pipeline:

from sklearn.pipeline import make_pipeline import numpy as np class DoubleIt: def transform(self, X, y=None): return 2*X X = np.array([[1, 2, 3], [4, 5, 6]]) p = make_pipeline(DoubleIt(), DoubleIt()) p.transform(X) # array([[ 4, 8, 12], # [16, 20, 24]])

As long as the methods you call on the pipeline exist in the final step, then everything still works.

Thanks @thomasjpfan, I didn't think of that... I will change the wording to also allow for only stateless transformers. :)

But look at this:

from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression import numpy as np class DoubleIt: def transform(self, X, y=None): return 2*X X = np.array([[1, 2, 3], [4, 5, 6]]) p = Pipeline([ ('double1', DoubleIt()), ('double2', DoubleIt()), ('linreg', LinearRegression()) # same results with ('linreg', None), but not with "passthrough" ]) print(p.fit(X))

Error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<__main__.DoubleIt object at 0x7fbed9d1f310>' (type <class '__main__.DoubleIt'>) doesn't

Whenever the one of the steps implements a fit, the previous steps are expected to be stateful transformers, too. This is because def _validate_steps() only checks the steps if .fit() is called.

Should _validate_steps() be modified to be less strict about stateless transformers? And should it then also run on transform(), and all the other possible methods?

PS: I just opened an issue for this (#27014). :)

Given where the discussion in #27014 is headed, I think we need to enforce that pipeline.fit works. According to our estimator docs, an estimator must define fit. For pipeline.fit to work:

All steps must define fit.

All non-last steps must also define transform.

betatim · 2023-08-03T12:20:42Z

doc/modules/compose.rst

-<transformed_target_regressor>` deals with transforming the :term:`target`
-(i.e. log-transform :term:`y`). In contrast, Pipelines only transform the
-observed data (:term:`X`).
+:term:`predictors` to build a composite estimator. The most common tool is a


I like the new paragraph. It is easier to read and is still correct. Getting this right is not easy, at least I had to go and read the definitions of estimator, predictor, etc and then think to figure out if it was correct or not!

I find the start of the first sentence hard to read. What do you think of starting with the end "To build a composite estimator ...."?

Another question: would it be correct to say "... combined with classifiers, regressors, other predictors or estimators to build a composite estimator"? The reason I'm asking is that if I want to make a composite estimator that is a transformer, then I don't need a predictor as the last step no? The model I'm starting to form in my head is that the type of the composite estimator is equal to the type of the final step in the pipeline. Is that useful as a mental model?

@betatim Thanks for your review. Your thoughts helped me to clarify how to communicate this. I've changed how to start the first sentence and found some words to express, that Pipeline would also combine two transformers, and what the type of the pipeline then would be.

This is the result:

To build a composite estimator, transformers are usually combined with other transformers or with :term:`predictors` (such as classifiers or regressors). The most common tool used for this is a :ref:`Pipeline <pipeline>`. Pipelines without a final predictor will fit and transform the observed data (:term:`X`) and return it. In contrast, pipelines with a final predictor will fit the predictor with the transformed data from the previous steps and can then also be used for predictions. The methods to be called on the pipeline are the same methods that the last contained estimator provides.

What do you think?

I like it! (sorry, I have nothing more to say :D)

betatim · 2023-08-03T12:24:47Z

The current docstring defines Pipeline as Pipeline of transforms with a final estimator , stating The last transform must be an estimator, suggesting that estimator and transformer are distinct concepts or that the last step of a pipeline is somehow more regulated than the others.

However, this wording is misleading and doesn't adequately convey the fact that the last step in the Pipeline can be any estimator, not necessarily a transformer.

I think all transformers are estimators, but not all estimators are transformers. If that is true, then saying that you can combine transformers (more restrictive) followed by an estimator (less restrictive) is right. It includes (maybe a bit obscurely) the case where all steps are transformers, because a transformer as the last step is an estimator.

StefanieSenger · 2023-08-04T14:28:59Z

@betatim wrote

I think all transformers are estimators, but not all estimators are transformers. If that is true, then saying that you can combine transformers (more restrictive) followed by an estimator (less restrictive) is right. It includes (maybe a bit obscurely) the case where all steps are transformers, because a transformer as the last step is an estimator.

I also understood it this way, but only after researching what an estimator is. Before, I thought estimator=predictor. Also, this docstring seemed to stress that, the final estimator is somehow more restricted than the other steps of the pipeline. This is why I was substituting the word estimator with predictor and stress on that it's optional to add one, to have some more clarity here.

…ipeline

doc/modules/compose.rst

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

adrinjalali · 2023-09-14T13:57:10Z

sklearn/pipeline.py

+        List of (name of step, transformer implementing `transform`) tuples
+        that are to be chained in sequential order. To be compatible with the
+        scikit-learn API, all steps must define `fit`. All non-last steps must
+        also define `transform`.


this can also be rephrased. Maybe like:

Suggested change

List of (name of step, transformer implementing `transform`) tuples

that are to be chained in sequential order. To be compatible with the

scikit-learn API, all steps must define `fit`. All non-last steps must

also define `transform`.

List of (name, estimator) tuples that are to be chained in sequential

order. All steps except the last must implement `fit` and `transform`,

and the last step doesn't have to implement `transform`. See

:ref:`Combining Estimators <combining_estimators>` for more details.

I will suggest is mixture of both ;)

Edit: Because (name, estimator) is unclear. It doesn't show, that users can pick name themselves.

sklearn/pipeline.py

adrinjalali · 2023-09-14T13:59:00Z

sklearn/pipeline.py

+        Fit all the transformers sequentially to the input data and transform
+        it. Only valid if the final estimator implements `fit_transform`.


Same here, and note that the last step doesn't have to implement fit_transform.

Yes, this is why you can only call fit_transform on the pipeline, if the last step implements fit_transform.
The phrase "Only valid if ..." is used on all the other 8 methods that are exposed after the pipeline is fitted. I copied it here, because to me it makes sense.

Here fit and transform of all the estimators (including the last one) are both used on the data.

"transformers sequentially to the input data" could be misunderstood though. I will suggest an updated sentence. :)

that's not true, you can call fit_transform, and it will call .fit.transform if fit_transform doesn't exist.

Ah, now I see. Should we put something like:
Only valid if the final estimator either implements fit_transform or fit and transform.
?

adrinjalali

Otherwise LGTM.

doc/modules/compose.rst

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

ogrisel

LGTM as well once the CI is fixed.

doc/modules/compose.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…line (scikit-learn#26952) Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

specified type of estimator for last step in pipeline docs

d888574

github-actions bot added module:pipeline Documentation labels Jul 31, 2023

thomasjpfan reviewed Jul 31, 2023

View reviewed changes

betatim reviewed Aug 3, 2023

View reviewed changes

StefanieSenger mentioned this pull request Aug 4, 2023

Pipeline throws TypeError on stateless transformers #27014

Closed

changes after review

1abce29

StefanieSenger changed the title ~~DOC specifying type of estimator to be used in last step of a pipeline~~ DOC clearer definition of estimator to be used in last step of a pipeline Aug 4, 2023

StefanieSenger and others added 2 commits August 23, 2023 11:33

Merge branch 'main' into pipeline_estimator

e4a617a

revised for clarity regarding fit and transform methods in steps of p…

9d3725a

…ipeline

adrinjalali reviewed Aug 24, 2023

View reviewed changes

doc/modules/compose.rst Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits September 14, 2023 12:48

Update doc/modules/compose.rst

43e9ed6

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Merge branch 'main' into pipeline_estimator

a0db834

adrinjalali reviewed Sep 14, 2023

View reviewed changes

little changes

5f27130

adrinjalali approved these changes Sep 17, 2023

View reviewed changes

doc/modules/compose.rst Outdated Show resolved Hide resolved

Update doc/modules/compose.rst

f025ae8

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

ogrisel approved these changes Sep 19, 2023

View reviewed changes

doc/modules/compose.rst Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits September 21, 2023 09:34

Update doc/modules/compose.rst

732ac76

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

little wording changes

0f149d6

adrinjalali merged commit 7b6b657 into scikit-learn:main Sep 21, 2023

StefanieSenger deleted the pipeline_estimator branch September 21, 2023 16:17

		A sequence of data transformers, which can be optionally complemented with
		a final predictor.

		that are chained in sequential order. The last step can only implement
		`fit`.

		Fit all the transformers sequentially to the input data and transform
		it. Only valid if the final estimator implements `fit_transform`.

Uh oh!

DOC clearer definition of estimator to be used in last step of a pipeline #26952

DOC clearer definition of estimator to be used in last step of a pipeline #26952

Uh oh!

Conversation

StefanieSenger commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented Aug 3, 2023

Uh oh!

StefanieSenger commented Aug 4, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

StefanieSenger commented Jul 31, 2023 •

edited

Loading

github-actions bot commented Jul 31, 2023 •

edited

Loading

StefanieSenger Aug 4, 2023 •

edited

Loading

betatim Aug 4, 2023 •

edited

Loading

StefanieSenger Sep 15, 2023 •

edited

Loading

StefanieSenger Sep 15, 2023 •

edited

Loading

StefanieSenger Sep 15, 2023 •

edited

Loading