Better documentation for random_state #15222

NicolasHug · 2019-10-12T17:13:27Z

Sort of like #14228, but for random_state.

For any public object that accepts a random_state parameter, we should document what parts of the algorithm are randomized. It's not always obvious what is and what isn't randomized. We should also always link to the glossary, where the different possible values of random_state are clearly explained.

For example for the random forest estimators, it would be helpful to indicate that random_state determines in particular the subsampling of the samples and the subsampling of the features. Something like:


random_state : int, np.random.RandomStateInstance or None, default=None
	Controls the randomness of the estimator, in particular the subsampling
    of the samples and the subsampling of the features. See 
	term:`random_state` for details.

The text was updated successfully, but these errors were encountered:

mschaffenroth · 2019-10-12T18:49:26Z

The script from #14228 (see here) adapted for the random_state parameter got the following results:

sklearn/multioutput.py - 561, 721
sklearn/multiclass.py - 669
sklearn/random_projection.py - 170, 229, 447, 567
sklearn/dummy.py - 52
sklearn/kernel_approximation.py - 41, 143, 468
sklearn/inspection/permutation_importance.py - 86
sklearn/impute/_iterative.py - 124
sklearn/covariance/robust_covariance.py - 63, 233, 328, 545
sklearn/covariance/elliptic_envelope.py - 40
sklearn/utils/testing.py - 566
sklearn/utils/init.py - 484, 629
sklearn/utils/random.py - 39
sklearn/utils/extmath.py - 185, 297
sklearn/tree/tree.py - 652, 1033, 1320, 1549
sklearn/neighbors/nca.py - 111
sklearn/neighbors/kde.py - 209
sklearn/feature_extraction/image.py - 324, 456
sklearn/mixture/gaussian_mixture.py - 504
sklearn/mixture/bayesian_mixture.py - 166
sklearn/mixture/base.py - 139
sklearn/metrics/cluster/unsupervised.py - 80
sklearn/manifold/locally_linear.py - 146, 252, 584
sklearn/manifold/t_sne.py - 552
sklearn/manifold/spectral_embedding_.py - 171, 387
sklearn/manifold/mds.py - 51, 198, 314
sklearn/neural_network/_multilayer_perceptron.py - 782, 1174
sklearn/neural_network/_rbm.py - 59
sklearn/decomposition/fastica_.py - 205, 344
sklearn/decomposition/nmf.py - 282, 467, 958, 1158
sklearn/decomposition/online_lda.py - 60, 79, 225
sklearn/decomposition/kernel_pca.py - 78
sklearn/decomposition/factor_analysis.py - 90
sklearn/decomposition/truncated_svd.py - 59
sklearn/decomposition/dict_learning.py - 364, 485, 692, 1135, 1325
sklearn/decomposition/sparse_pca.py - 82, 285
sklearn/decomposition/pca.py - 192
sklearn/gaussian_process/gpc.py - 110, 527
sklearn/gaussian_process/gpr.py - 109, 366
sklearn/preprocessing/data.py - 2138, 2557
sklearn/feature_selection/mutual_info_.py - 226, 335, 414
sklearn/model_selection/_split.py - 373, 578, 1080, 1185, 1239, 1379, 1481, 1594, 2036
sklearn/model_selection/_validation.py - 1006, 1176
sklearn/model_selection/_search.py - 214, 1301
sklearn/linear_model/ransac.py - 152
sklearn/linear_model/perceptron.py - 55
sklearn/linear_model/theil_sen.py - 243
sklearn/linear_model/stochastic_gradient.py - 357, 798, 1403
sklearn/linear_model/logistic.py - 587, 753, 1087, 1262, 1812
sklearn/linear_model/passive_aggressive.py - 76, 322
sklearn/linear_model/coordinate_descent.py - 579, 859, 1312, 1486, 1664, 1850, 2013, 2187
sklearn/linear_model/ridge.py - 324, 692, 846
sklearn/linear_model/base.py - 65
sklearn/linear_model/sag.py - 154
sklearn/svm/classes.py - 90, 310, 544, 750
sklearn/svm/base.py - 852
sklearn/datasets/kddcup99.py - 79
sklearn/datasets/twenty_newsgroups.py - 187
sklearn/datasets/olivetti_faces.py - 64
sklearn/datasets/samples_generator.py - 127, 323, 429, 520, 604, 666, 738, 875, 936, 1001, 1077, 1130, 1189, 1229, 1278, 1339, 1391, 1454, 1542, 1633
sklearn/datasets/rcv1.py - 114
sklearn/datasets/base.py - 146
sklearn/datasets/covtype.py - 69
sklearn/cluster/k_means_.py - 56, 242, 382, 585, 701, 1148, 1366
sklearn/cluster/spectral.py - 41, 197, 313
sklearn/cluster/bicluster.py - 235, 379
sklearn/cluster/mean_shift_.py - 48
sklearn/cluster/tests/test_k_means.py - 294, 305
sklearn/ensemble/gradient_boosting.py - 1919, 2402
sklearn/ensemble/bagging.py - 500, 900
sklearn/ensemble/weight_boosting.py - 197, 330, 471, 890, 1012
sklearn/ensemble/iforest.py - 109
sklearn/ensemble/forest.py - 941, 1256, 1531, 1822, 2051
sklearn/ensemble/base.py - 56
sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - 738, 918
sklearn/ensemble/_hist_gradient_boosting/binning.py - 37, 112

matsmaiwald · 2019-10-13T06:13:29Z

Just had a brief look at this.

It seems that approporiate documentation is already in place, for e.g. these two:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py

albertcthomas · 2019-10-14T10:48:48Z

+1 for this, thanks for the report @mschaffenroth!

It was already done for sklearn/svm/classes.py - 90, 544 and 750. See issue #9497 and PR #9703.

For the svm module, it thus only remains to do

svm/_base.py 852 and svm/_classes.py 310 as detailed in scikit-learn#15222.

sara-es · 2019-10-19T13:42:55Z

Hi, I am new to contributing and would like to help out with this. I have done the two instances mentioned by @albertcthomas in the svm module above and can continue working through the above list.

jnothman · 2019-10-19T22:02:17Z

I think it would be helpful, @NicolasHug, to give some examples of what this should look like. Thanks for continuing this work.

glemaitre · 2020-01-08T13:52:39Z

closing in favour of #10548

NicolasHug added Documentation Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Oct 12, 2019

sara-es added a commit to sara-es/scikit-learn that referenced this issue Oct 19, 2019

Improved docstrings related to random_state parameter in

2b4a379

svm/_base.py 852 and svm/_classes.py 310 as detailed in scikit-learn#15222.

sara-es mentioned this issue Oct 19, 2019

[WIP] #15222 Improve docstrings relating to random_state #15300

Closed

NicolasHug mentioned this issue Oct 21, 2019

[MRG] DOC Better explain the source of randomness for tree based models #15264

Merged

MDouriez mentioned this issue Nov 2, 2019

[MRG] documentation for random_state in forest.py #15516

Merged

edwardcqian mentioned this issue Nov 9, 2019

documentation for random_state in model_selection/split #15575

Merged

happilyeverafter95 mentioned this issue Nov 9, 2019

DOC improve random_state docstring random modue #15576

Merged

This was referenced Nov 28, 2019

DOC improve random_state docsting in _ridge and small clean-up #15728

Merged

DOC random_state in _logistic #15729

Merged

cmarmo mentioned this issue Jan 8, 2020

Make random_state descriptions more informative and refer to Glossary #10548

Closed

74 tasks

glemaitre closed this as completed Jan 8, 2020

keyianpai mentioned this issue Jan 11, 2020

Better documentation for random_state in model selection module #16096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better documentation for random_state #15222

Better documentation for random_state #15222

NicolasHug commented Oct 12, 2019 •

edited

Loading

mschaffenroth commented Oct 12, 2019 •

edited

Loading

matsmaiwald commented Oct 13, 2019

albertcthomas commented Oct 14, 2019 •

edited

Loading

sara-es commented Oct 19, 2019

jnothman commented Oct 19, 2019

glemaitre commented Jan 8, 2020

Better documentation for random_state #15222

Better documentation for random_state #15222

Comments

NicolasHug commented Oct 12, 2019 • edited Loading

mschaffenroth commented Oct 12, 2019 • edited Loading

matsmaiwald commented Oct 13, 2019

albertcthomas commented Oct 14, 2019 • edited Loading

sara-es commented Oct 19, 2019

jnothman commented Oct 19, 2019

glemaitre commented Jan 8, 2020

NicolasHug commented Oct 12, 2019 •

edited

Loading

mschaffenroth commented Oct 12, 2019 •

edited

Loading

albertcthomas commented Oct 14, 2019 •

edited

Loading