Bug in AdaBoostRegressor with randomstate #7408

StevenLOL · 2016-09-13T04:37:53Z

Description

Consider following regressor:

xlf1 = Pipeline([('svd', PCA(n_components=pca_n_components)),
                 ('regressor', AdaBoostRegressor(
                     #random_state=random_state,
                     base_estimator=MLPRegressor(random_state=random_state,
                                                 early_stopping=True,
                                                 max_iter=2000),
                     n_estimators=30,
                     learning_rate=0.01)),
                 ])

If set random_state to some value , the performance is worse than just ignore it.
I create a project for this problem here.

By the way there is no much differences when set the LinearSVR as base_estimator.

Expected Results

Actual Results

Versions

Linux-4.4.0-31-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Jul 1 2016, 15:12:24) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.1')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')

nelson-liu · 2016-09-13T05:03:07Z

i'm not sure that i understand correctly. are you saying that the performance of your estimator is lower with a fixed random_state vs not providing it?

If so, be aware that (in certain methods) the results of training are stochastic and depend on random_state. Thus, in your specific case, setting a random state just happened to (one could say randomly) lead to a slightly worse result than seeding the training routine with the global numpy random state.

The purpose of random_state is to enhance reproducibility, and minor differences are to be expected across different values of random_state (try setting the seed to different values and observe that the performance will increase or decrease).

If my interpretation of your question was incorrect, please clarify.

StevenLOL · 2016-09-13T05:46:44Z

I just test with random_state=int(time.time())

30 rounds average shows that it is worse than not set at all. ( 0.732 MAE vs 0.621)

Seems that it happened as well as a value is assigned , not only for a fixed random_state .

jnothman · 2016-09-13T05:55:06Z

I can imagine this sort of thing coming about by that random state value (rather than sharing a generator, or values generated) being passed to multiple sub-estimators, so that they all have the same randomisation.

jnothman · 2016-09-13T05:56:22Z

Indeed, that's what's happening: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/weight_boosting.py#L1001

StevenLOL · 2016-09-13T06:01:49Z

OK, here is a quick test

set random_state=int(time.time()) to see the differences. , 16+- vs 49+-

from sklearn.datasets import make_regression

import numpy as np

from sklearn.model_selection import KFold,train_test_split

from sklearn.metrics import mean_absolute_error,mean_squared_error

from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.decomposition import PCA


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import time

trainx, trainy = make_regression(n_samples=2000, n_features=100, random_state=0, noise=4.0,
                       bias=2.0)

print trainx[0][0:5]
print trainy[0:5]
def evalTrainData(trainDatax,trainV,random_state=2016,eid=''):
    fold=10

    scores=[]
    pccs=[]
    cvcount=0
    assert len(trainDatax)==len(trainV)

    for roundindex in range(0,3):
        skf=KFold(fold,shuffle=True,random_state=random_state+roundindex)
        for trainIndex,evalIndex in skf.split(trainDatax):
            t1=time.time()
            cvTrainx,cvTrainy=trainDatax[trainIndex],trainV[trainIndex]
            cvEvalx,cvEvaly=trainDatax[evalIndex],trainV[evalIndex]

            scaler=StandardScaler()
            cvTrainy=scaler.fit_transform(cvTrainy.reshape(-1, 1)).ravel()
            lsvr=getxlf(random_state=random_state)
            lsvr.fit(cvTrainx,cvTrainy)
            predict=lsvr.predict(cvEvalx)
            predict=scaler.inverse_transform(predict.reshape(-1,1)).ravel()
            score=mean_absolute_error(cvEvaly,predict)
            pcc=np.corrcoef(cvEvaly,predict)[0, 1]
            print (cvcount,'MAE',score,'PCC',pcc,time.time()-t1,time.asctime( time.localtime(time.time()) ) ,'Train sahpe:',cvTrainx.shape,'eval sahpe:', cvEvalx.shape)
            scores.append(score)
            pccs.append(pcc)
            cvcount+=1

    print ('###',eid,'MAE',np.mean(scores),'PCC',np.mean(pccs))


pca_n_components=100
def getxlf(random_state=2016):
    xlf1= Pipeline([
                          ('svd',PCA(n_components=pca_n_components)),
                          ('regressor',AdaBoostRegressor(#random_state=int(time.time()),

                           base_estimator=MLPRegressor(random_state=random_state,early_stopping=True,max_iter=2000)
                                                         ,n_estimators=30,learning_rate=0.01)),
                          ])

    return xlf1

evalTrainData(trainx,trainy)

jnothman · 2016-09-13T06:09:05Z

I have a half-completed patch. Will post in a few hours.

(fixes scikit-learn#7408) ENH add utility to set nested random_state FIX ensure nested random_state is set in ensembles

jnothman · 2016-09-13T10:30:51Z

See #7411. For good or bad, that PR goes beyond the scope of just fixing this issue, so it might take a little time to review and merge.

StevenLOL · 2016-09-14T02:49:41Z

Cool, thanks.

jnothman · 2016-09-14T04:04:22Z

We usually close after the bug is fixed.

StevenLOL · 2016-09-14T05:25:33Z

OK, thanks.

(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles

* FIX adaboost estimators not randomising correctly (fixes #7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states

…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states

jnothman added the Bug label Sep 13, 2016

jnothman mentioned this issue Sep 13, 2016

[MRG+2] FIX adaboost estimators not randomising correctly #7411

Merged

jnothman added a commit to jnothman/scikit-learn that referenced this issue Sep 13, 2016

FIX adaboost estimators not randomising correctly

8e9e0c1

(fixes scikit-learn#7408) ENH add utility to set nested random_state FIX ensure nested random_state is set in ensembles

StevenLOL closed this as completed Sep 14, 2016

jnothman reopened this Sep 14, 2016

jnothman added a commit to jnothman/scikit-learn that referenced this issue Sep 14, 2016

FIX adaboost estimators not randomising correctly

9e94d95

(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles

jnothman added a commit to jnothman/scikit-learn that referenced this issue Sep 21, 2016

FIX adaboost estimators not randomising correctly

ded20af

(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles

jnothman added a commit to jnothman/scikit-learn that referenced this issue Sep 22, 2016

FIX adaboost estimators not randomising correctly

54d788c

(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles

jnothman closed this as completed in #7411 Sep 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in AdaBoostRegressor with randomstate #7408

Bug in AdaBoostRegressor with randomstate #7408

StevenLOL commented Sep 13, 2016 •

edited

Loading

nelson-liu commented Sep 13, 2016 •

edited

Loading

StevenLOL commented Sep 13, 2016

jnothman commented Sep 13, 2016

jnothman commented Sep 13, 2016

StevenLOL commented Sep 13, 2016 •

edited by TomDLT

Loading

jnothman commented Sep 13, 2016

jnothman commented Sep 13, 2016

StevenLOL commented Sep 14, 2016

jnothman commented Sep 14, 2016

StevenLOL commented Sep 14, 2016

Bug in AdaBoostRegressor with randomstate #7408

Bug in AdaBoostRegressor with randomstate #7408

Comments

StevenLOL commented Sep 13, 2016 • edited Loading

Description

Expected Results

Actual Results

Versions

nelson-liu commented Sep 13, 2016 • edited Loading

StevenLOL commented Sep 13, 2016

jnothman commented Sep 13, 2016

jnothman commented Sep 13, 2016

StevenLOL commented Sep 13, 2016 • edited by TomDLT Loading

jnothman commented Sep 13, 2016

jnothman commented Sep 13, 2016

StevenLOL commented Sep 14, 2016

jnothman commented Sep 14, 2016

StevenLOL commented Sep 14, 2016

StevenLOL commented Sep 13, 2016 •

edited

Loading

nelson-liu commented Sep 13, 2016 •

edited

Loading

StevenLOL commented Sep 13, 2016 •

edited by TomDLT

Loading