-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
DOC clearer definition of estimator to be used in last step of a pipeline #26952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC clearer definition of estimator to be used in last step of a pipeline #26952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @StefanieSenger !
sklearn/pipeline.py
Outdated
A sequence of data transformers, which can be optionally complemented with | ||
a final predictor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI is failing because we are following NumPyDoc standards which require the "summary line" to fit one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thank you. :)
sklearn/pipeline.py
Outdated
that are chained in sequential order. The last step can only implement | ||
`fit`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't many requirements on the last step. For example, one can have a stateless transformers in a pipeline:
from sklearn.pipeline import make_pipeline
import numpy as np
class DoubleIt:
def transform(self, X, y=None):
return 2*X
X = np.array([[1, 2, 3], [4, 5, 6]])
p = make_pipeline(DoubleIt(), DoubleIt())
p.transform(X)
# array([[ 4, 8, 12],
# [16, 20, 24]])
As long as the methods you call on the pipeline exist in the final step, then everything still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan, I didn't think of that... I will change the wording to also allow for only stateless transformers. :)
But look at this:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
class DoubleIt:
def transform(self, X, y=None):
return 2*X
X = np.array([[1, 2, 3], [4, 5, 6]])
p = Pipeline([
('double1', DoubleIt()),
('double2', DoubleIt()),
('linreg', LinearRegression()) # same results with ('linreg', None), but not with "passthrough"
])
print(p.fit(X))
Error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<__main__.DoubleIt object at 0x7fbed9d1f310>' (type <class '__main__.DoubleIt'>) doesn't
Whenever the one of the steps implements a fit
, the previous steps are expected to be stateful transformers, too. This is because def _validate_steps()
only checks the steps if .fit()
is called.
Should _validate_steps
() be modified to be less strict about stateless transformers? And should it then also run on transform()
, and all the other possible methods?
PS: I just opened an issue for this (#27014). :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given where the discussion in #27014 is headed, I think we need to enforce that pipeline.fit
works. According to our estimator docs, an estimator must define fit
. For pipeline.fit
to work:
- All steps must define
fit
. - All non-last steps must also define
transform
.
doc/modules/compose.rst
Outdated
<transformed_target_regressor>` deals with transforming the :term:`target` | ||
(i.e. log-transform :term:`y`). In contrast, Pipelines only transform the | ||
observed data (:term:`X`). | ||
:term:`predictors` to build a composite estimator. The most common tool is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new paragraph. It is easier to read and is still correct. Getting this right is not easy, at least I had to go and read the definitions of estimator, predictor, etc and then think to figure out if it was correct or not!
I find the start of the first sentence hard to read. What do you think of starting with the end "To build a composite estimator ...."?
Another question: would it be correct to say "... combined with classifiers, regressors, other predictors or estimators to build a composite estimator"? The reason I'm asking is that if I want to make a composite estimator that is a transformer, then I don't need a predictor as the last step no? The model I'm starting to form in my head is that the type of the composite estimator is equal to the type of the final step in the pipeline. Is that useful as a mental model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betatim Thanks for your review. Your thoughts helped me to clarify how to communicate this. I've changed how to start the first sentence and found some words to express, that Pipeline would also combine two transformers, and what the type of the pipeline then would be.
This is the result:
To build a composite estimator, transformers are usually combined with other
transformers or with :term:`predictors` (such as classifiers or regressors).
The most common tool used for this is a :ref:`Pipeline <pipeline>`. Pipelines
without a final predictor will fit and transform the observed data (:term:`X`)
and return it. In contrast, pipelines with a final predictor will fit the
predictor with the transformed data from the previous steps and can then also
be used for predictions. The methods to be called on the pipeline are the same
methods that the last contained estimator provides.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! (sorry, I have nothing more to say :D)
I think all transformers are estimators, but not all estimators are transformers. If that is true, then saying that you can combine transformers (more restrictive) followed by an estimator (less restrictive) is right. It includes (maybe a bit obscurely) the case where all steps are transformers, because a transformer as the last step is an estimator. |
@betatim wrote
I also understood it this way, but only after researching what an estimator is. Before, I thought estimator=predictor. Also, this docstring seemed to stress that, the final estimator is somehow more restricted than the other steps of the pipeline. This is why I was substituting the word |
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
sklearn/pipeline.py
Outdated
List of (name of step, transformer implementing `transform`) tuples | ||
that are to be chained in sequential order. To be compatible with the | ||
scikit-learn API, all steps must define `fit`. All non-last steps must | ||
also define `transform`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be rephrased. Maybe like:
List of (name of step, transformer implementing `transform`) tuples | |
that are to be chained in sequential order. To be compatible with the | |
scikit-learn API, all steps must define `fit`. All non-last steps must | |
also define `transform`. | |
List of (name, estimator) tuples that are to be chained in sequential | |
order. All steps except the last must implement `fit` and `transform`, | |
and the last step doesn't have to implement `transform`. See | |
:ref:`Combining Estimators <combining_estimators>` for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will suggest is mixture of both ;)
Edit: Because (name, estimator)
is unclear. It doesn't show, that users can pick name
themselves.
sklearn/pipeline.py
Outdated
Fit all the transformers sequentially to the input data and transform | ||
it. Only valid if the final estimator implements `fit_transform`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, and note that the last step doesn't have to implement fit_transform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is why you can only call fit_transform
on the pipeline, if the last step implements fit_transform
.
The phrase "Only valid if ..." is used on all the other 8 methods that are exposed after the pipeline is fitted. I copied it here, because to me it makes sense.
Here fit
and transform
of all the estimators (including the last one) are both used on the data.
"transformers sequentially to the input data" could be misunderstood though. I will suggest an updated sentence. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not true, you can call fit_transform
, and it will call .fit.transform
if fit_transform
doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see. Should we put something like:
Only valid if the final estimator either implements fit_transform
or fit
and transform
.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well once the CI is fixed.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…line (scikit-learn#26952) Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…line (scikit-learn#26952) Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This PR aims to provide a more precise distinction of
estimator
,transformer
andpredictor
in the documentation for Pipeline.The current docstring defines Pipeline as
Pipeline of transforms with a final estimator
, statingThe last transform must be an estimator
, suggesting thatestimator
andtransformer
are distinct concepts or that the last step of a pipeline is somehow more restricted than the others.However, this wording is misleading and doesn't adequately convey the fact that the last step in the Pipeline can be any estimator, not necessarily a transformer.