Skip to content

Conversation

StefanieSenger
Copy link
Member

@StefanieSenger StefanieSenger commented Jul 31, 2023

This PR aims to provide a more precise distinction of estimator, transformer and predictor in the documentation for Pipeline.

The current docstring defines Pipeline as Pipeline of transforms with a final estimator , stating The last transform must be an estimator, suggesting that estimator and transformer are distinct concepts or that the last step of a pipeline is somehow more restricted than the others.

However, this wording is misleading and doesn't adequately convey the fact that the last step in the Pipeline can be any estimator, not necessarily a transformer.

@github-actions
Copy link

github-actions bot commented Jul 31, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0f149d6. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @StefanieSenger !

Comment on lines 56 to 57
A sequence of data transformers, which can be optionally complemented with
a final predictor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing because we are following NumPyDoc standards which require the "summary line" to fit one line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thank you. :)

Comment on lines 84 to 85
that are chained in sequential order. The last step can only implement
`fit`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't many requirements on the last step. For example, one can have a stateless transformers in a pipeline:

from sklearn.pipeline import make_pipeline
import numpy as np

class DoubleIt:
    def transform(self, X, y=None):
        return 2*X

X = np.array([[1, 2, 3], [4, 5, 6]])
p = make_pipeline(DoubleIt(), DoubleIt())

p.transform(X)
# array([[ 4,  8, 12],
#       [16, 20, 24]])

As long as the methods you call on the pipeline exist in the final step, then everything still works.

Copy link
Member Author

@StefanieSenger StefanieSenger Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan, I didn't think of that... I will change the wording to also allow for only stateless transformers. :)

But look at this:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np

class DoubleIt:

    def transform(self, X, y=None):
        return 2*X

X = np.array([[1, 2, 3], [4, 5, 6]])
p = Pipeline([
            ('double1', DoubleIt()),
            ('double2', DoubleIt()),
            ('linreg', LinearRegression())     # same results with ('linreg', None), but not with "passthrough"
            ])

print(p.fit(X))

Error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<__main__.DoubleIt object at 0x7fbed9d1f310>' (type <class '__main__.DoubleIt'>) doesn't

Whenever the one of the steps implements a fit, the previous steps are expected to be stateful transformers, too. This is because def _validate_steps() only checks the steps if .fit() is called.

Should _validate_steps() be modified to be less strict about stateless transformers? And should it then also run on transform(), and all the other possible methods?

PS: I just opened an issue for this (#27014). :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given where the discussion in #27014 is headed, I think we need to enforce that pipeline.fit works. According to our estimator docs, an estimator must define fit. For pipeline.fit to work:

  • All steps must define fit.
  • All non-last steps must also define transform.

<transformed_target_regressor>` deals with transforming the :term:`target`
(i.e. log-transform :term:`y`). In contrast, Pipelines only transform the
observed data (:term:`X`).
:term:`predictors` to build a composite estimator. The most common tool is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new paragraph. It is easier to read and is still correct. Getting this right is not easy, at least I had to go and read the definitions of estimator, predictor, etc and then think to figure out if it was correct or not!

I find the start of the first sentence hard to read. What do you think of starting with the end "To build a composite estimator ...."?

Another question: would it be correct to say "... combined with classifiers, regressors, other predictors or estimators to build a composite estimator"? The reason I'm asking is that if I want to make a composite estimator that is a transformer, then I don't need a predictor as the last step no? The model I'm starting to form in my head is that the type of the composite estimator is equal to the type of the final step in the pipeline. Is that useful as a mental model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betatim Thanks for your review. Your thoughts helped me to clarify how to communicate this. I've changed how to start the first sentence and found some words to express, that Pipeline would also combine two transformers, and what the type of the pipeline then would be.

This is the result:

To build a composite estimator, transformers are usually combined with other
transformers or with :term:`predictors` (such as classifiers or regressors).
The most common tool used for this is a :ref:`Pipeline <pipeline>`. Pipelines
without a final predictor will fit and transform the observed data (:term:`X`)
and return it. In contrast, pipelines with a final predictor will fit the
predictor with the transformed data from the previous steps and can then also
be used for predictions. The methods to be called on the pipeline are the same
methods that the last contained estimator provides.

What do you think?

Copy link
Member

@betatim betatim Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! (sorry, I have nothing more to say :D)

@betatim
Copy link
Member

betatim commented Aug 3, 2023

The current docstring defines Pipeline as Pipeline of transforms with a final estimator , stating The last transform must be an estimator, suggesting that estimator and transformer are distinct concepts or that the last step of a pipeline is somehow more regulated than the others.

However, this wording is misleading and doesn't adequately convey the fact that the last step in the Pipeline can be any estimator, not necessarily a transformer.

I think all transformers are estimators, but not all estimators are transformers. If that is true, then saying that you can combine transformers (more restrictive) followed by an estimator (less restrictive) is right. It includes (maybe a bit obscurely) the case where all steps are transformers, because a transformer as the last step is an estimator.

@StefanieSenger
Copy link
Member Author

@betatim wrote

I think all transformers are estimators, but not all estimators are transformers. If that is true, then saying that you can combine transformers (more restrictive) followed by an estimator (less restrictive) is right. It includes (maybe a bit obscurely) the case where all steps are transformers, because a transformer as the last step is an estimator.

I also understood it this way, but only after researching what an estimator is. Before, I thought estimator=predictor. Also, this docstring seemed to stress that, the final estimator is somehow more restricted than the other steps of the pipeline. This is why I was substituting the word estimator with predictor and stress on that it's optional to add one, to have some more clarity here.

@StefanieSenger StefanieSenger changed the title DOC specifying type of estimator to be used in last step of a pipeline DOC clearer definition of estimator to be used in last step of a pipeline Aug 4, 2023
StefanieSenger and others added 2 commits September 14, 2023 12:48
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Comment on lines 88 to 91
List of (name of step, transformer implementing `transform`) tuples
that are to be chained in sequential order. To be compatible with the
scikit-learn API, all steps must define `fit`. All non-last steps must
also define `transform`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be rephrased. Maybe like:

Suggested change
List of (name of step, transformer implementing `transform`) tuples
that are to be chained in sequential order. To be compatible with the
scikit-learn API, all steps must define `fit`. All non-last steps must
also define `transform`.
List of (name, estimator) tuples that are to be chained in sequential
order. All steps except the last must implement `fit` and `transform`,
and the last step doesn't have to implement `transform`. See
:ref:`Combining Estimators <combining_estimators>` for more details.

Copy link
Member Author

@StefanieSenger StefanieSenger Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest is mixture of both ;)

Edit: Because (name, estimator) is unclear. It doesn't show, that users can pick name themselves.

Comment on lines 486 to 487
Fit all the transformers sequentially to the input data and transform
it. Only valid if the final estimator implements `fit_transform`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, and note that the last step doesn't have to implement fit_transform.

Copy link
Member Author

@StefanieSenger StefanieSenger Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is why you can only call fit_transform on the pipeline, if the last step implements fit_transform.
The phrase "Only valid if ..." is used on all the other 8 methods that are exposed after the pipeline is fitted. I copied it here, because to me it makes sense.

Here fit and transform of all the estimators (including the last one) are both used on the data.

"transformers sequentially to the input data" could be misunderstood though. I will suggest an updated sentence. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not true, you can call fit_transform, and it will call .fit.transform if fit_transform doesn't exist.

Copy link
Member Author

@StefanieSenger StefanieSenger Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see. Should we put something like:
Only valid if the final estimator either implements fit_transform or fit and transform.
?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well once the CI is fixed.

StefanieSenger and others added 2 commits September 21, 2023 09:34
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@adrinjalali adrinjalali merged commit 7b6b657 into scikit-learn:main Sep 21, 2023
@StefanieSenger StefanieSenger deleted the pipeline_estimator branch September 21, 2023 16:17
lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request Sep 28, 2023
…line (scikit-learn#26952)

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…line (scikit-learn#26952)

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants