-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] DOC New Getting Started guide #14920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] DOC New Getting Started guide #14920
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly small nitpicks, otherwise it looks really good to me.
|
||
Scikit-learn provides dozens of built-in machine learning algorithms and | ||
models, called :term:`estimators`. Each estimator can be fitted to some data | ||
using its :term:`fit` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not a bad idea to mention that it's the equivalent of the train
method in some other libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't fit
the most common term though? I mean, scikit-learn pretty much set the standard ^^
>>> result['test_score'] | ||
array([1., 1., 1.]) | ||
|
||
Automatic parameter searches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could even call this Hyper parameter tuning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, I like your subtle thumb down, what would you call it instead @jnothman ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave a thumbs up to the "aka hyperparameters" suggestion below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is various terminology used in scikit-learn that is not ideal in hindsight (as language changes), but for consistency it's best to call these "parameters". I'd also consider calling it "Model selection"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Automatic parameter search and tuning" ?
search over the parameter state of a random forest with a | ||
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search | ||
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as | ||
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use a pipeline instead of a single model, to illustrate how to treat the whole pipeline as an estimator, as well as how to pass grid search parameters of a pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I agree but I'm a bit concerned that the estimator__param
logic is a bit too much/scary for such an introduction.
Would you be OK to add a note that basically says in practice you always want to use a pipeline, let alone for the fact that it prevents data leaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's scary. We may have to add a paragraph or two to explain how it works, but ideally at the end of this tutorial, the user has a code snippet they can copy/paste and use. I know we want to discourage people from copy pasting random code, but that's what they do. Also, if it's too scary for people to pass the pipeline parameters to a gridsearch, then we need to change the API, since that's not gonna happen (any time soon at least) I think we really should have it in the first tutorial.
We also have a specific example for the combination of grid search and pipeline, we can have the code snippet here, with a bit of explanation, and then link to that example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to come up with something, but I really don't think we should consider the getting-started guide as a tutorial.
The way I see it, the purpose of this guide is to showcase the main features of scikit-learn, ideally with as few cognitive overload as possible. IMO grid searching a pipeline adds a significant cognitive overload when the point is simply to showcase grid search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this
>>> from sklearn.datasets.california_housing import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # create a pipeline
>>> pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=0))
...
>>> # define the parameter space that will be searched over. We here only
>>> # search over the parameters of the random forest.
>>> param_distributions = {'randomforestregressor__n_estimators': randint(1, 5),
... 'randomforestregressor__max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=pipe,
... n_iter=5,
... param_distributions=param_distributions,
... random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler',
StandardScaler()),
('randomforestregressor',
RandomForestRegressor(random_state=0))]),
n_iter=5,
param_distributions={'randomforestregressor__max_depth': ...,
'randomforestregressor__n_estimators': ...},
random_state=0)
>>> search.best_params_
{'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 4}
>>> # the search object now acts like a normal pipeline / estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...
I don't really like it because I think it's way too much code for illustrating the grid searching (the current example is already too big IMO), and it's also a bad example because you don't care about scaling when using a RF.
I've tried to think of real-case scenarios e.g. using an imputer, but that requires much more code, and that is irrelevant w.r.t the point of the example which is to illustrate grid search (in its simplest form).
I'm still open to the idea, but unless we can have a simple example, I would be -1. We need simple examples in this guide else users will be discouraged right from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: I added a note in
a0359a2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very happy to see Pipeline presented up front. Thank you!!
doc/getting_started.rst
Outdated
- The samples matrix (or design matrix) :term:`X`. The size of ``X`` | ||
is ``(n_samples, n_features)``, which means that samples are represented | ||
as rows and features are represented as columns. | ||
- The target values :term:`y` which are real number for regression tasks, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"real numbers" or "real valued"
fit was the standard in python before scikit-learn as far as I can tell...
But everyone also calls it "train" in ml courses...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with what we have now. As for the example you had in your comment, I'd use an SVM instead, which needs the scaler, instead of the RF. But I'm already quite happy with this, thanks @NicolasHug
>>> from sklearn.datasets import load_iris | ||
>>> from sklearn.model_selection import train_test_split | ||
>>> from sklearn.metrics import accuracy_score | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's interesting that the doctest tests pass here. I remember they really needed a space after ...
Thanks @NicolasHug! |
I have replaced the current Quick Start guide by a Getting Started guide which illustrates some of the main features of scikit-learn. The "old" quick start is still available as the first tutorial. (I'm not sure what to think of the tutorials TBH, but that's another discussion).
We discussed with @thomasjpfan that it could be relevant to add this guide in the header of the main page of the website, in the new design.
This is obviously just a proposal, please provide any feedback you may have!