Skip to content

[MRG] DOC New Getting Started guide #14920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 18, 2019

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Sep 6, 2019

I have replaced the current Quick Start guide by a Getting Started guide which illustrates some of the main features of scikit-learn. The "old" quick start is still available as the first tutorial. (I'm not sure what to think of the tutorials TBH, but that's another discussion).

We discussed with @thomasjpfan that it could be relevant to add this guide in the header of the main page of the website, in the new design.

This is obviously just a proposal, please provide any feedback you may have!

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small nitpicks, otherwise it looks really good to me.


Scikit-learn provides dozens of built-in machine learning algorithms and
models, called :term:`estimators`. Each estimator can be fitted to some data
using its :term:`fit` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not a bad idea to mention that it's the equivalent of the train method in some other libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't fit the most common term though? I mean, scikit-learn pretty much set the standard ^^

>>> result['test_score']
array([1., 1., 1.])

Automatic parameter searches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even call this Hyper parameter tuning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I like your subtle thumb down, what would you call it instead @jnothman ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a thumbs up to the "aka hyperparameters" suggestion below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is various terminology used in scikit-learn that is not ideal in hindsight (as language changes), but for consistency it's best to call these "parameters". I'd also consider calling it "Model selection"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Automatic parameter search and tuning" ?

search over the parameter state of a random forest with a
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use a pipeline instead of a single model, to illustrate how to treat the whole pipeline as an estimator, as well as how to pass grid search parameters of a pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I agree but I'm a bit concerned that the estimator__param logic is a bit too much/scary for such an introduction.

Would you be OK to add a note that basically says in practice you always want to use a pipeline, let alone for the fact that it prevents data leaks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's scary. We may have to add a paragraph or two to explain how it works, but ideally at the end of this tutorial, the user has a code snippet they can copy/paste and use. I know we want to discourage people from copy pasting random code, but that's what they do. Also, if it's too scary for people to pass the pipeline parameters to a gridsearch, then we need to change the API, since that's not gonna happen (any time soon at least) I think we really should have it in the first tutorial.

We also have a specific example for the combination of grid search and pipeline, we can have the code snippet here, with a bit of explanation, and then link to that example.

Copy link
Member Author

@NicolasHug NicolasHug Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to come up with something, but I really don't think we should consider the getting-started guide as a tutorial.

The way I see it, the purpose of this guide is to showcase the main features of scikit-learn, ideally with as few cognitive overload as possible. IMO grid searching a pipeline adds a significant cognitive overload when the point is simply to showcase grid search.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this

>>> from sklearn.datasets.california_housing import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # create a pipeline
>>> pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=0))
...
>>> # define the parameter space that will be searched over. We here only
>>> # search over the parameters of the random forest.
>>> param_distributions = {'randomforestregressor__n_estimators': randint(1, 5),
...                        'randomforestregressor__max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=pipe,
...                             n_iter=5,
...                             param_distributions=param_distributions,
...                             random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler',
                                              StandardScaler()),
                                             ('randomforestregressor',
                                              RandomForestRegressor(random_state=0))]),
                   n_iter=5,
                   param_distributions={'randomforestregressor__max_depth': ...,
                                        'randomforestregressor__n_estimators': ...},
                   random_state=0)
>>> search.best_params_
{'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 4}

>>> # the search object now acts like a normal pipeline / estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...

I don't really like it because I think it's way too much code for illustrating the grid searching (the current example is already too big IMO), and it's also a bad example because you don't care about scaling when using a RF.

I've tried to think of real-case scenarios e.g. using an imputer, but that requires much more code, and that is irrelevant w.r.t the point of the example which is to illustrate grid search (in its simplest form).

I'm still open to the idea, but unless we can have a simple example, I would be -1. We need simple examples in this guide else users will be discouraged right from the start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: I added a note in
a0359a2

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy to see Pipeline presented up front. Thank you!!

- The samples matrix (or design matrix) :term:`X`. The size of ``X``
is ``(n_samples, n_features)``, which means that samples are represented
as rows and features are represented as columns.
- The target values :term:`y` which are real number for regression tasks, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"real numbers" or "real valued"

@jnothman
Copy link
Member

jnothman commented Sep 9, 2019 via email

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with what we have now. As for the example you had in your comment, I'd use an SVM instead, which needs the scaler, instead of the RF. But I'm already quite happy with this, thanks @NicolasHug

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's interesting that the doctest tests pass here. I remember they really needed a space after ...

@jnothman jnothman merged commit 57cb9e8 into scikit-learn:master Sep 18, 2019
@jnothman
Copy link
Member

Thanks @NicolasHug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants