Skip to content

RandomizedSearchCV's training time too much longer than cross_validate function sum of training times #22716

Closed
@aurelQ

Description

@aurelQ

Describe the bug

I am currently working on a project and I have to make a choice between 5 machine learning algorithm's.
But my dataset is very large and I have more than 70 columns.

So to test my programs I wanted to know the training time with a cross validation with one set of params, to see approximately how much time will it take for a set of 10 or more sets of parameters in my RandomizedSearchCV.
I have created a variable witch is the sum of training times and I want to compare it with the training time of my RandomizedSearchCV.
The issue is : I have imagined that a RandomizedSearchCV with a 20 iterations (n_iter=20) would take 20 times more time to execute than cross_validate function that test only one set of parameters but it is really not the case.

Steps/Code to Reproduce

import numpy as np
from sklearn.model_selection import cross_validate

numeric_transformer= Pipeline(
                                [
                                    ('imputer_num', SimpleImputer(strategy='median')),
                                    ('scaler', StandardScaler())
                                ]
                             )
preprocessor=ColumnTransformer(
                                [
                                    ('numericals', numeric_transformer, numerical_features)
                                ], remainder='drop'
                              )
#Choix du modèle
pipeline_log= Pipeline(
                        [
                            ('preprocessing', preprocessor)
                        ]
                      )
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression()

#Définition du scorer 
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2} 

cv_results = cross_validate(log_reg, X_train_pip, y_train_ech, cv=5,scoring=scoring)
res=sum(cv_results["fit_time"])
print(cv_results["fit_time"],res)

-- AND SECOND ONE --

from sklearn.model_selection import RandomizedSearchCV
numeric_transformer= Pipeline(
                                [
                                    ('imputer_num', SimpleImputer(strategy='median')),
                                    ('scaler', StandardScaler())
                                ]
                             )
preprocessor=ColumnTransformer(
                                [
                                    ('numericals', numeric_transformer, numerical_features)
                                ], remainder='drop'
                              )
#Choix du modèle
pipeline_log= Pipeline(
                        [
                            ('preprocessing', preprocessor)
                        ]
                      )

#Définition des différents paramètres à tester 
params={'class_weight': ['balanced'],
        'C' : np.logspace(-4, 4, 20),
        'solver' : ['liblinear'],
        'max_iter' : [100], #1000
        'penalty' : ['l1', 'l2'] #l1
    }
#Définition du scorer 
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2} 

#Ajout du SMOTE au pipeline
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression(max_iter=100,solver='liblinear',class_weight='balanced')

#Définition de gridsearch 
clf_log_w = RandomizedSearchCV(log_reg, distributions, cv=5 ,scoring=scoring, refit='average_precision', n_iter=10)
grid_result_log_w=clf_log_w.fit(X_train_pip, y_train_ech)
#grid_result_log_w=clf_log_w.fit(X_train_pip, y_train)
#clf_log_w.fit(X_train_pip, y_train)
score_w_log = grid_result_log_w.best_score_

print("Best: %f using %s" % (grid_result_log_w.best_score_, grid_result_log_w.best_params_), "\n")

Expected Results

For a cross validation my programs take 2.5 seconds to execute
But for a RandomsearchCV with 20 iterations it should take 20 times more time so about 2.5*20=50 seconds

Actual Results

The RandomizedSearchCV's algorithm does not take 50s to execute but it takes so many time that it appears really not normal. Even with a parameter n_iter=2 (training time should be about 2.5*2= 5 seconds) it takes more than 10 minutes (maybe much more) and It works but I wanted to understand the difference between what my program does and what I have thought it should do. Did I do something wrong or is there an issue in the libraries ?

Versions

System:
    python: 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\aurel\anaconda3\python.exe
   machine: Windows-10-10.0.22000-SP0

Python dependencies:
          pip: 20.2.4
   setuptools: 50.3.1.post20201107
      sklearn: 0.24.2
        numpy: 1.21.2
        scipy: 1.5.2
       Cython: 0.29.23
       pandas: 1.3.4
   matplotlib: 3.3.2
       joblib: 0.17.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions