[MRG] DOC New Getting Started guide #14920

NicolasHug · 2019-09-06T21:09:02Z

I have replaced the current Quick Start guide by a Getting Started guide which illustrates some of the main features of scikit-learn. The "old" quick start is still available as the first tutorial. (I'm not sure what to think of the tutorials TBH, but that's another discussion).

We discussed with @thomasjpfan that it could be relevant to add this guide in the header of the main page of the website, in the new design.

This is obviously just a proposal, please provide any feedback you may have!

adrinjalali

Mostly small nitpicks, otherwise it looks really good to me.

doc/getting_started.rst

adrinjalali · 2019-09-09T11:20:35Z

doc/getting_started.rst

+
+Scikit-learn provides dozens of built-in machine learning algorithms and
+models, called :term:`estimators`. Each estimator can be fitted to some data
+using its :term:`fit` method.


Maybe it's not a bad idea to mention that it's the equivalent of the train method in some other libraries.

isn't fit the most common term though? I mean, scikit-learn pretty much set the standard ^^

doc/getting_started.rst

adrinjalali · 2019-09-09T12:14:14Z

doc/getting_started.rst

+  >>> result['test_score']
+  array([1., 1., 1.])
+
+Automatic parameter searches


Could even call this Hyper parameter tuning

lol, I like your subtle thumb down, what would you call it instead @jnothman ?

I gave a thumbs up to the "aka hyperparameters" suggestion below.

There is various terminology used in scikit-learn that is not ideal in hindsight (as language changes), but for consistency it's best to call these "parameters". I'd also consider calling it "Model selection"

"Automatic parameter search and tuning" ?

doc/getting_started.rst

adrinjalali · 2019-09-09T12:19:25Z

doc/getting_started.rst

+search over the parameter state of a random forest with a
+:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
+is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
+a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with


we should use a pipeline instead of a single model, to illustrate how to treat the whole pipeline as an estimator, as well as how to pass grid search parameters of a pipeline.

Ugh, I agree but I'm a bit concerned that the estimator__param logic is a bit too much/scary for such an introduction.

Would you be OK to add a note that basically says in practice you always want to use a pipeline, let alone for the fact that it prevents data leaks?

I don't think it's scary. We may have to add a paragraph or two to explain how it works, but ideally at the end of this tutorial, the user has a code snippet they can copy/paste and use. I know we want to discourage people from copy pasting random code, but that's what they do. Also, if it's too scary for people to pass the pipeline parameters to a gridsearch, then we need to change the API, since that's not gonna happen (any time soon at least) I think we really should have it in the first tutorial.

We also have a specific example for the combination of grid search and pipeline, we can have the code snippet here, with a bit of explanation, and then link to that example.

I'll try to come up with something, but I really don't think we should consider the getting-started guide as a tutorial.

The way I see it, the purpose of this guide is to showcase the main features of scikit-learn, ideally with as few cognitive overload as possible. IMO grid searching a pipeline adds a significant cognitive overload when the point is simply to showcase grid search.

I made this

>>> from sklearn.datasets.california_housing import fetch_california_housing >>> from sklearn.ensemble import RandomForestRegressor >>> from sklearn.model_selection import RandomizedSearchCV >>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.model_selection import train_test_split >>> from scipy.stats import randint ... >>> X, y = fetch_california_housing(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) ... >>> # create a pipeline >>> pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=0)) ... >>> # define the parameter space that will be searched over. We here only >>> # search over the parameters of the random forest. >>> param_distributions = {'randomforestregressor__n_estimators': randint(1, 5), ... 'randomforestregressor__max_depth': randint(5, 10)} ... >>> # now create a searchCV object and fit it to the data >>> search = RandomizedSearchCV(estimator=pipe, ... n_iter=5, ... param_distributions=param_distributions, ... random_state=0) >>> search.fit(X_train, y_train) RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()), ('randomforestregressor', RandomForestRegressor(random_state=0))]), n_iter=5, param_distributions={'randomforestregressor__max_depth': ..., 'randomforestregressor__n_estimators': ...}, random_state=0) >>> search.best_params_ {'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 4} >>> # the search object now acts like a normal pipeline / estimator >>> # with max_depth=9 and n_estimators=4 >>> search.score(X_test, y_test) 0.73...

I don't really like it because I think it's way too much code for illustrating the grid searching (the current example is already too big IMO), and it's also a bad example because you don't care about scaling when using a RF.

I've tried to think of real-case scenarios e.g. using an imputer, but that requires much more code, and that is irrelevant w.r.t the point of the example which is to illustrate grid search (in its simplest form).

I'm still open to the idea, but unless we can have a simple example, I would be -1. We need simple examples in this guide else users will be discouraged right from the start.

related: I added a note in
a0359a2

doc/getting_started.rst

jnothman

I'm very happy to see Pipeline presented up front. Thank you!!

jnothman · 2019-09-09T12:39:21Z

doc/getting_started.rst

+- The samples matrix (or design matrix) :term:`X`. The size of ``X``
+  is ``(n_samples, n_features)``, which means that samples are represented
+  as rows and features are represented as columns.
+- The target values :term:`y` which are real number for regression tasks, or


"real numbers" or "real valued"

doc/getting_started.rst

jnothman · 2019-09-09T22:18:14Z

fit was the standard in python before scikit-learn as far as I can tell... But everyone also calls it "train" in ml courses...

…tting_start_guide

adrinjalali

I'm happy with what we have now. As for the example you had in your comment, I'd use an SVM instead, which needs the scaler, instead of the RF. But I'm already quite happy with this, thanks @NicolasHug

doc/getting_started.rst

adrinjalali · 2019-09-10T18:19:20Z

doc/getting_started.rst

+  >>> from sklearn.datasets import load_iris
+  >>> from sklearn.model_selection import train_test_split
+  >>> from sklearn.metrics import accuracy_score
+  ...


it's interesting that the doctest tests pass here. I remember they really needed a space after ...

doc/getting_started.rst

…tting_start_guide

jnothman · 2019-09-18T22:04:07Z

Thanks @NicolasHug!

Added getting started guide

a18a705

NicolasHug mentioned this pull request Sep 8, 2019

[MRG+1] Updates entire website design #14849

Merged

adrinjalali reviewed Sep 9, 2019

View reviewed changes

jnothman reviewed Sep 9, 2019

View reviewed changes

NicolasHug added 2 commits September 9, 2019 12:42

addressed comments

dc4360e

Added comments to code snippets

e0d2f46

NicolasHug added 5 commits September 10, 2019 08:48

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ge…

b0082bb

…tting_start_guide

nits

932322b

Added note about always searching over a pipeline

a0359a2

typo

44cf8a7

minimal adjustments

1fc48ef

adrinjalali approved these changes Sep 10, 2019

View reviewed changes

NicolasHug added 4 commits September 10, 2019 14:35

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ge…

15bedb4

…tting_start_guide

Addressed comments

8f42aef

typ

d947a84

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ge…

feed31c

…tting_start_guide

jnothman approved these changes Sep 18, 2019

View reviewed changes

jnothman merged commit 57cb9e8 into scikit-learn:master Sep 18, 2019

adrinjalali mentioned this pull request Sep 23, 2019

DOC link to the new getting started #15060

Merged

NicolasHug mentioned this pull request Aug 25, 2020

RFC Plan for reworking the tutorials #18257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] DOC New Getting Started guide #14920

[MRG] DOC New Getting Started guide #14920

NicolasHug commented Sep 6, 2019 •

edited

Loading

adrinjalali left a comment

adrinjalali Sep 9, 2019

NicolasHug Sep 9, 2019

adrinjalali Sep 9, 2019

adrinjalali Sep 9, 2019

jnothman Sep 9, 2019

jnothman Sep 9, 2019

NicolasHug Sep 9, 2019

adrinjalali Sep 9, 2019

NicolasHug Sep 9, 2019

adrinjalali Sep 10, 2019

NicolasHug Sep 10, 2019 •

edited

Loading

NicolasHug Sep 10, 2019

NicolasHug Sep 10, 2019

jnothman left a comment

jnothman Sep 9, 2019

jnothman commented Sep 9, 2019 via email

adrinjalali left a comment

adrinjalali Sep 10, 2019

jnothman commented Sep 18, 2019

[MRG] DOC New Getting Started guide #14920

[MRG] DOC New Getting Started guide #14920

Conversation

NicolasHug commented Sep 6, 2019 • edited Loading

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Sep 9, 2019 via email

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Sep 18, 2019

NicolasHug commented Sep 6, 2019 •

edited

Loading

NicolasHug Sep 10, 2019 •

edited

Loading