Skip to content

ENH Replaced RandomState with Generator compatible calls #22271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 28, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Jan 22, 2022

Reference Issues/PRs

Issue #20669
Towards to #16988

What does this implement/fix? Explain your changes.

Per #20669 it is discussed that we will likely need an eventual change towards adopting NumPy's Generator interface instead of remaining on RandomState. To ease such a transition, this PR replaces RandomState.random_sample calls with their corresponding RandomState.uniform to match the Generator.uniform syntax, preserving functionality but allowing for a drop-in replacement of the underlying object without syntax errors.

Any other comments?

Thank you to @thomasjpfan for the direction and guidance for this PR.

In working on this PR, I also looked similar changing RandomState methods to equivalent methods w/ overlapping names/signatures between the two interfaces, namely: randn -> standard_normal, random_sample, rand -> uniform.

@Micky774 Micky774 changed the title Replaced RandomState.random_sample calls to RandomState.uniform [WIP] Replaced RandomState.random_sample calls to RandomState.uniform Jan 22, 2022
@Micky774 Micky774 changed the title [WIP] Replaced RandomState.random_sample calls to RandomState.uniform Replaced RandomState.random_sample calls to RandomState.uniform Jan 24, 2022
@thomasjpfan
Copy link
Member

thomasjpfan commented Jan 25, 2022

There looks to be a few more for random_sample. You can run: (Some false positives)

grep -r "random_sample" sklearn --exclude="*tests*" --exclude="*.pyc" \
    --exclude="*.pxi" --exclude="*.c"  --exclude="*.html" --exclude="*.so" -in
Results
sklearn/ensemble/_gb.py:36:from ._gradient_boosting import _random_sample_mask
sklearn/ensemble/_gb.py:660:                sample_mask = _random_sample_mask(n_samples, n_inbag, random_state)
sklearn/ensemble/_gradient_boosting.pyx:237:def _random_sample_mask(np.npy_intp n_total_samples,
sklearn/cluster/_kmeans.py:212:        rand_vals = random_state.random_sample(n_local_trials) * current_pot
sklearn/multiclass.py:1058:        self.code_book_ = random_state.random_sample((n_classes, code_size_))
sklearn/neural_network/_rbm.py:199:        return rng.random_sample(size=p.shape) < p
sklearn/neural_network/_rbm.py:220:        return rng.random_sample(size=p.shape) < p
sklearn/neighbors/_kde.py:108:    >>> X = rng.random_sample((100, 3))

For randn: (Some false positives)

grep -r "randn" sklearn --exclude="*tests*" --exclude="*.pyc" \
    --exclude="*.pxi" --exclude="*.c"  --exclude="*.html" --exclude="*.so" -in
Results
sklearn/cluster/_affinity_propagation.py:174:    ) * random_state.randn(n_samples, n_samples)
sklearn/datasets/_samples_generator.py:233:    X[:, :n_informative] = generator.randn(n_samples, n_informative)
sklearn/datasets/_samples_generator.py:262:        X[:, -n_useless:] = generator.randn(n_samples, n_useless)
sklearn/datasets/_samples_generator.py:598:        X = generator.randn(n_samples, n_features)
sklearn/datasets/_samples_generator.py:1025:        + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:1091:    ) ** 0.5 + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:1156:    ) + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:1221:    u, _ = linalg.qr(generator.randn(n_samples, n), mode="economic", check_finite=False)
sklearn/datasets/_samples_generator.py:1223:        generator.randn(n_features, n), mode="economic", check_finite=False
sklearn/datasets/_samples_generator.py:1283:    D = generator.randn(n_features, n_components)
sklearn/datasets/_samples_generator.py:1292:        X[idx, i] = generator.randn(n_nonzero_coefs)
sklearn/datasets/_samples_generator.py:1522:    X += noise * generator.randn(3, n_samples)
sklearn/datasets/_samples_generator.py:1564:    X += noise * generator.randn(3, n_samples)
sklearn/linear_model/_quantile.py:92:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_quantile.py:93:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:216:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:217:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_ridge.py:971:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_ridge.py:972:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_stochastic_gradient.py:1890:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_stochastic_gradient.py:1891:    >>> X = rng.randn(n_samples, n_features)
sklearn/kernel_ridge.py:126:    >>> y = rng.randn(n_samples)
sklearn/kernel_ridge.py:127:    >>> X = rng.randn(n_samples, n_features)
sklearn/utils/estimator_checks.py:171:    X = rng.randn(10, 5)
sklearn/feature_selection/_mutual_info.py:293:            1e-10 * means * rng.randn(n_samples, np.sum(continuous_mask))
sklearn/feature_selection/_mutual_info.py:298:        y += 1e-10 * np.maximum(1, np.mean(np.abs(y))) * rng.randn(n_samples)
sklearn/svm/_classes.py:1196:    >>> y = rng.randn(n_samples)
sklearn/svm/_classes.py:1197:    >>> X = rng.randn(n_samples, n_features)
sklearn/svm/_classes.py:1388:    >>> y = np.random.randn(n_samples)
sklearn/svm/_classes.py:1389:    >>> X = np.random.randn(n_samples, n_features)
sklearn/manifold/_t_sne.py:994:            X_embedded = 1e-4 * random_state.randn(n_samples, self.n_components).astype(
sklearn/manifold/_spectral_embedding.py:339:        X = random_state.randn(laplacian.shape[0], n_components + 1)
sklearn/manifold/_spectral_embedding.py:370:            X = random_state.randn(laplacian.shape[0], n_components + 1)
sklearn/mixture/_base.py:459:                    mean + rng.randn(sample, n_features) * np.sqrt(covariance)
sklearn/model_selection/_split.py:1006:    >>> X = np.random.randn(12, 2)
sklearn/decomposition/_nmf.py:317:        H = avg * rng.randn(n_components, n_features).astype(X.dtype, copy=False)
sklearn/decomposition/_nmf.py:318:        W = avg * rng.randn(n_samples, n_components).astype(X.dtype, copy=False)
sklearn/decomposition/_nmf.py:372:        W[W == 0] = abs(avg * rng.randn(len(W[W == 0])) / 100)
sklearn/decomposition/_nmf.py:373:        H[H == 0] = abs(avg * rng.randn(len(H[H == 0])) / 100)
sklearn/neighbors/_nca.py:454:                transformation = self.random_state_.randn(n_components, X.shape[1])

@Micky774
Copy link
Contributor Author

Thank you for that. I'm on windows and their string searching/parsing utilities aren't...great. Will update the PR

@Micky774
Copy link
Contributor Author

How would I log this in the changelog?

@thomasjpfan
Copy link
Member

thomasjpfan commented Jan 25, 2022

How would I log this in the changelog?

I do not think this needs a change log since there should be changes to the user. Title should be updated tho since the scope increased.

(Test failing is unrelated, looking into it now)

@Micky774 Micky774 changed the title Replaced RandomState.random_sample calls to RandomState.uniform Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls Jan 25, 2022
@thomasjpfan thomasjpfan changed the title Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls MAINT Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls Jan 25, 2022
@thomasjpfan thomasjpfan changed the title MAINT Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls ENH Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls Jan 25, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

For future reviewers, this change does not change any generated random numbers. It's more so to make NumPy Generators easier to adopt.

  1. For reference, randn -> standard_normal are the same call: https://github.com/numpy/numpy/blob/6077afd650a503034d0a8a5917bb9a5fa3f115fd/numpy/random/mtrand.pyx#L1243-L1246

  2. rand calls random_sample: https://github.com/numpy/numpy/blob/6077afd650a503034d0a8a5917bb9a5fa3f115fd/numpy/random/mtrand.pyx#L1179-L1182

  3. random_sample -> uniform generate the same values as they have use the same underlying C code. Through testing, they are all the same:

import numpy as np
from numpy.testing import assert_allclose

for i in range(20):
    rng1 = np.random.RandomState(i)
    rng2 = np.random.RandomState(i)

    for row, col in zip(range(0, 1000, 100), range(0, 1000, 100)):
        x1 = rng1.random_sample((row, col))
        x2 = rng2.uniform(size=(row, col))
        assert_allclose(x1, x2)

The only difference is that uniform expands it a little by allowing for low and high.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM. Thanks @Micky774 !

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title ENH Replaced RandomState-specific calls to equivalent calls that match signature with Generator calls ENH Replaced RandomState with Generator compatible calls Jan 28, 2022
@thomasjpfan thomasjpfan merged commit 254ea8c into scikit-learn:main Jan 28, 2022
@thomasjpfan
Copy link
Member

As a follow up PR, there are rand -> uniform that needs to be updated too:

grep -r "rand(" sklearn --exclude="*tests*" --exclude="*.pyc" --exclude="*.cpp" --exclude="*.h" \
    --exclude="*.pxi" --exclude="*.c"  --exclude="*.html" --exclude="*.so" -in
Results (Could be some false positives)
sklearn/metrics/pairwise.py:1655:    >>> X = np.random.RandomState(0).rand(5, 3)
sklearn/ensemble/_gradient_boosting.pyx:259:          random_state.rand(n_total_samples)
sklearn/semi_supervised/_label_propagation.py:39:>>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:411:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:567:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_self_training.py:133:    >>> random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
sklearn/datasets/_samples_generator.py:229:        centroids *= generator.rand(n_clusters, 1)
sklearn/datasets/_samples_generator.py:230:        centroids *= generator.rand(1, n_informative)
sklearn/datasets/_samples_generator.py:242:        A = 2 * generator.rand(n_informative, n_informative) - 1
sklearn/datasets/_samples_generator.py:249:        B = 2 * generator.rand(n_informative, n_redundant) - 1
sklearn/datasets/_samples_generator.py:257:        indices = ((n - 1) * generator.rand(n_repeated) + 0.5).astype(np.intp)
sklearn/datasets/_samples_generator.py:266:        flip_mask = generator.rand(n_samples) < flip_y
sklearn/datasets/_samples_generator.py:271:        shift = (2 * generator.rand(n_features) - 1) * class_sep
sklearn/datasets/_samples_generator.py:275:        scale = 1 + 100 * generator.rand(n_features)
sklearn/datasets/_samples_generator.py:394:    p_c = generator.rand(n_classes)
sklearn/datasets/_samples_generator.py:397:    p_w_c = generator.rand(n_features, n_classes)
sklearn/datasets/_samples_generator.py:412:            c = np.searchsorted(cumulative_p_c, generator.rand(y_size - len(y)))
sklearn/datasets/_samples_generator.py:430:        words = np.searchsorted(cumulative_p_w_sample, generator.rand(n_words))
sklearn/datasets/_samples_generator.py:614:    ground_truth[:n_informative, :] = 100 * generator.rand(n_informative, n_targets)
sklearn/datasets/_samples_generator.py:1019:    X = generator.rand(n_samples, n_features)
sklearn/datasets/_samples_generator.py:1082:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:1147:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:1377:    A = generator.rand(n_dim, n_dim)
sklearn/datasets/_samples_generator.py:1379:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), Vt)
sklearn/datasets/_samples_generator.py:1439:    aux = random_state.rand(dim, dim)
sklearn/datasets/_samples_generator.py:1443:    ) * random_state.rand(np.sum(aux > alpha))
sklearn/datasets/_samples_generator.py:1507:        t = 1.5 * np.pi * (1 + 2 * generator.rand(n_samples))
sklearn/datasets/_samples_generator.py:1508:        y = 21 * generator.rand(n_samples)
sklearn/datasets/_samples_generator.py:1515:        parameters = generator.rand(2, n_samples) * np.array([[np.pi], [7]])
sklearn/datasets/_samples_generator.py:1558:    t = 3 * np.pi * (generator.rand(1, n_samples) - 0.5)
sklearn/datasets/_samples_generator.py:1560:    y = 2.0 * generator.rand(1, n_samples)
sklearn/random_projection.py:512:    >>> X = rng.rand(25, 3000)
sklearn/random_projection.py:662:    >>> X = rng.rand(25, 3000)
sklearn/utils/estimator_checks.py:801:    X = rng.rand(40, 3)
sklearn/utils/estimator_checks.py:805:    y = (4 * rng.rand(40)).astype(int)
sklearn/utils/estimator_checks.py:1091:    X = _pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig)
sklearn/utils/random.py:92:                class_probability_nz_norm.cumsum(), rng.rand(nnz)
sklearn/manifold/_mds.py:87:        X = random_state.rand(n_samples * n_components)
sklearn/mixture/_base.py:151:            resp = random_state.rand(n_samples, self.n_components)
sklearn/decomposition/_truncated_svd.py:143:    >>> X_dense = np.random.rand(100, 100)

@lorentzenchr
Copy link
Member

This is a great first step for Generator, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants