Skip to content

RandomizedSearchCV's training time too much longer than cross_validate function sum of training times #22716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aurelQ opened this issue Mar 7, 2022 · 3 comments

Comments

@aurelQ
Copy link

aurelQ commented Mar 7, 2022

Describe the bug

I am currently working on a project and I have to make a choice between 5 machine learning algorithm's.
But my dataset is very large and I have more than 70 columns.

So to test my programs I wanted to know the training time with a cross validation with one set of params, to see approximately how much time will it take for a set of 10 or more sets of parameters in my RandomizedSearchCV.
I have created a variable witch is the sum of training times and I want to compare it with the training time of my RandomizedSearchCV.
The issue is : I have imagined that a RandomizedSearchCV with a 20 iterations (n_iter=20) would take 20 times more time to execute than cross_validate function that test only one set of parameters but it is really not the case.

Steps/Code to Reproduce

import numpy as np
from sklearn.model_selection import cross_validate

numeric_transformer= Pipeline(
                                [
                                    ('imputer_num', SimpleImputer(strategy='median')),
                                    ('scaler', StandardScaler())
                                ]
                             )
preprocessor=ColumnTransformer(
                                [
                                    ('numericals', numeric_transformer, numerical_features)
                                ], remainder='drop'
                              )
#Choix du modèle
pipeline_log= Pipeline(
                        [
                            ('preprocessing', preprocessor)
                        ]
                      )
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression()

#Définition du scorer 
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2} 

cv_results = cross_validate(log_reg, X_train_pip, y_train_ech, cv=5,scoring=scoring)
res=sum(cv_results["fit_time"])
print(cv_results["fit_time"],res)

-- AND SECOND ONE --

from sklearn.model_selection import RandomizedSearchCV
numeric_transformer= Pipeline(
                                [
                                    ('imputer_num', SimpleImputer(strategy='median')),
                                    ('scaler', StandardScaler())
                                ]
                             )
preprocessor=ColumnTransformer(
                                [
                                    ('numericals', numeric_transformer, numerical_features)
                                ], remainder='drop'
                              )
#Choix du modèle
pipeline_log= Pipeline(
                        [
                            ('preprocessing', preprocessor)
                        ]
                      )

#Définition des différents paramètres à tester 
params={'class_weight': ['balanced'],
        'C' : np.logspace(-4, 4, 20),
        'solver' : ['liblinear'],
        'max_iter' : [100], #1000
        'penalty' : ['l1', 'l2'] #l1
    }
#Définition du scorer 
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2} 

#Ajout du SMOTE au pipeline
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression(max_iter=100,solver='liblinear',class_weight='balanced')

#Définition de gridsearch 
clf_log_w = RandomizedSearchCV(log_reg, distributions, cv=5 ,scoring=scoring, refit='average_precision', n_iter=10)
grid_result_log_w=clf_log_w.fit(X_train_pip, y_train_ech)
#grid_result_log_w=clf_log_w.fit(X_train_pip, y_train)
#clf_log_w.fit(X_train_pip, y_train)
score_w_log = grid_result_log_w.best_score_

print("Best: %f using %s" % (grid_result_log_w.best_score_, grid_result_log_w.best_params_), "\n")

Expected Results

For a cross validation my programs take 2.5 seconds to execute
But for a RandomsearchCV with 20 iterations it should take 20 times more time so about 2.5*20=50 seconds

Actual Results

The RandomizedSearchCV's algorithm does not take 50s to execute but it takes so many time that it appears really not normal. Even with a parameter n_iter=2 (training time should be about 2.5*2= 5 seconds) it takes more than 10 minutes (maybe much more) and It works but I wanted to understand the difference between what my program does and what I have thought it should do. Did I do something wrong or is there an issue in the libraries ?

Versions

System:
    python: 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\aurel\anaconda3\python.exe
   machine: Windows-10-10.0.22000-SP0

Python dependencies:
          pip: 20.2.4
   setuptools: 50.3.1.post20201107
      sklearn: 0.24.2
        numpy: 1.21.2
        scipy: 1.5.2
       Cython: 0.29.23
       pandas: 1.3.4
   matplotlib: 3.3.2
       joblib: 0.17.0
threadpoolctl: 2.1.0

Built with OpenMP: True
@aurelQ aurelQ added Bug Needs Triage Issue requires triage labels Mar 7, 2022
@NartayAikyn
Copy link
Contributor

NartayAikyn commented Mar 11, 2022

I'm relatively new to scikit-learn and hope I can provide some help to this problem.
I tried to reproduce the bug with your given code by colab and here is the result

I tested the fitting time for 5 fold CV and 10 iters RandomizedSearchCV using a dataset of 90k samples. Five solvers were tested each using the same LogisticRegression classifier and preprocessing pipeline as provided in this issue.

  • newton-cg: 11.06s for 5 fold CV, 58.10s for 10 iters RandomizedSearchCV
  • lbfgs: 1.86s for 5 fold CV, 17.51s for 10 iters RandomizedSearchCV
  • liblinear: 4.53s for 5 fold CV, 46.08s for 10 iters RandomizedSearchCV
  • sag: 16.17s for 5 fold CV, 158.64s for 10 iters RandomizedSearchCV
  • saga: 11.13s for 5 fold CV, 99.79s for 10 iters RandomizedSearchCV

Even though different solvers have different efficiencies, the overall time consumption of CV and RSCV does not deviate much from 1:10. So, I guess the problem is not from RandomizedSearchCV.

However, I think the problem might be related to the liblinear solver. When I tested it, the liblinear solver seemed to take a long time to train for some particular input. For example, If we gradually increase the training samples:

  • for number of samples 96000: total 37.49s time elapse [7.25920844 6.88055277 7.19919753 7.88366318 8.11918521]
  • for number of samples 98000: total 9.75s time elapse [1.81098437 1.80373168 1.91449213 1.89562345 2.15277672]
  • for number of samples 100000: total 1028.38s time elapse [200.85734701 194.54049253 205.3806529 195.01785779 232.43018031]
  • for number of samples 102000: total 7.33s time elapse [1.55084085 1.3190589 1.22186828 1.59017873 1.4933095 ]
  • for number of samples 104000: total 56.57s time elapse [ 0.67970514 0.79958177 0.98124504 6.01783133 47.93945074]

When samples size = 100k, the fitting time for one 5 fold CV jump to 17 minutes, code for reproducing is also in the colab notebook

I don't know if this is a bug or the characteristic of liblinear solver. This problem might be related to issue #18264, as in our case, the data is also centered with StandardScaler. The time-consuming feature of the liblinear solver can also be seen in some other posts. For example, here, liblinear solver takes 10177s to fit a 1M sample dataset, while lbfgs takes 10s. And here, for MNIST, liblinear solver takes 2893s, while lbfgs takes 53s.

For now, I guess just not using liblinear will solve the problem.

@thomasjpfan
Copy link
Member

I agree with #22716 (comment). liblinear can take longer compared to solver="lbfgs", which would result in a longer runtime for the RandomizedSearchCV.

@PLaToniC48
Copy link

param_distribs = {
'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
'degree': randint(1, 10),
'gamma': ['scale', 'auto'] + list(uniform(0.1, 5).rvs(10)),
'C': uniform(0.1, 10).rvs(20),}

    how much estimated time randomsearchcv should take to train svm on 55000 mnist images with n_iter=50, cv=3,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants