Skip to content

scikit's GridSearch and Python in general are not freeing memory #3973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rasbt opened this issue Dec 16, 2014 · 6 comments
Closed

scikit's GridSearch and Python in general are not freeing memory #3973

rasbt opened this issue Dec 16, 2014 · 6 comments
Labels
Milestone

Comments

@rasbt
Copy link
Contributor

rasbt commented Dec 16, 2014

Hi,
I recently asked a question on StackOverflow here about an issue that I encountered with scikit-learn's GridSearch and memory utilization.

Basically, it allocated more and more memory the longer it runs after the job fails when it reaches the 128 GB on the system I am running it on. I have more details written in the StackOverflow question I linked above, and I also created a GitHub repo where I put the script and data if you want to reproduce this issue.

https://github.com/rasbt/bugreport/tree/master/scikit-learn/gridsearch_memory

@amueller amueller added the Bug label Jan 22, 2015
@rth
Copy link
Member

rth commented Sep 27, 2016

As this is a 2 year old issue and addresses v0.15, both GridSearchCV and joblib probably had significant changes since; @rasbt is this issue still relevant, or should it be closed?

@rasbt
Copy link
Contributor Author

rasbt commented Sep 27, 2016

Haven't tried this particular setting for a while. I would say let's close it due to the reasons you mentioned, but let me just try to re-run this in the next few days to double check.

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@rasbt
Copy link
Contributor Author

rasbt commented Sep 30, 2016

I just ran the same code (the one that I posted on github in Dec 2014) over night and it seemed to be very stable this time (sklearn 0.18 and python 3.5) :); looks like the issue has been resolved some time ago!

@rasbt rasbt closed this as completed Sep 30, 2016
@amueller
Copy link
Member

thanks for checking :)

@rishabhgit
Copy link

rishabhgit commented Apr 15, 2019

Hi @amueller , @rasbt ,
I've run into the same issue while trying to optimise hyper-parameters for a Random Forest Regressor.
I'm using Python 3.6 and sklearn version 0.20.3
My data set is not huge (~350k rows * 370 cols). Here's a snippet of the code that I'm using

    y_train = np.array(df_train[label])
    X_train = np.array(df_train[x_cols])
    weights = np.array(df_train['PREMISES_COUNT'])
    print('X_train shape ', X_train.shape)
    print('y_train shape ', y_train.shape)
    print('Sample weight shape ', weights.shape)
    
    param_grid = {'n_estimators': [100,150, 200],
                  'min_samples_leaf': [2,10,20],
                  'min_samples_split': [10,15,20],
                  'max_features': ['auto', 'sqrt', 'log2']
    }
    rf = sk_rfreg( random_state=0)
    #scorer = make_scorer(r2_score,sample_weight=weights)
    rs = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=15, 
                           scoring = 'neg_mean_absolute_error',
                           n_jobs=5, cv=3)
    
    rs = rs.fit(X_train, y_train)
    print('Best R^2 score', rs.best_score_)
    print ('Best Params:')
    print(rs.best_params_)

I'm running this code on an EC2 server with 128 GB RAM. RandomSearchCV spawns more than 5 python processes which slowly consume all the RAM and run out of memory before the execution completes. Here's the output of the 'top' command

image

Could this issue be reopened and investigated?

@jnothman
Copy link
Member

@rishabhgit I don't think this issue is relevant...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants