Description
Describe the bug
I am currently working on a project and I have to make a choice between 5 machine learning algorithm's.
But my dataset is very large and I have more than 70 columns.
So to test my programs I wanted to know the training time with a cross validation with one set of params, to see approximately how much time will it take for a set of 10 or more sets of parameters in my RandomizedSearchCV.
I have created a variable witch is the sum of training times and I want to compare it with the training time of my RandomizedSearchCV.
The issue is : I have imagined that a RandomizedSearchCV with a 20 iterations (n_iter=20) would take 20 times more time to execute than cross_validate function that test only one set of parameters but it is really not the case.
Steps/Code to Reproduce
import numpy as np
from sklearn.model_selection import cross_validate
numeric_transformer= Pipeline(
[
('imputer_num', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]
)
preprocessor=ColumnTransformer(
[
('numericals', numeric_transformer, numerical_features)
], remainder='drop'
)
#Choix du modèle
pipeline_log= Pipeline(
[
('preprocessing', preprocessor)
]
)
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression()
#Définition du scorer
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2}
cv_results = cross_validate(log_reg, X_train_pip, y_train_ech, cv=5,scoring=scoring)
res=sum(cv_results["fit_time"])
print(cv_results["fit_time"],res)
-- AND SECOND ONE --
from sklearn.model_selection import RandomizedSearchCV
numeric_transformer= Pipeline(
[
('imputer_num', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]
)
preprocessor=ColumnTransformer(
[
('numericals', numeric_transformer, numerical_features)
], remainder='drop'
)
#Choix du modèle
pipeline_log= Pipeline(
[
('preprocessing', preprocessor)
]
)
#Définition des différents paramètres à tester
params={'class_weight': ['balanced'],
'C' : np.logspace(-4, 4, 20),
'solver' : ['liblinear'],
'max_iter' : [100], #1000
'penalty' : ['l1', 'l2'] #l1
}
#Définition du scorer
scorer_1 = make_scorer( metrics.roc_auc_score, needs_proba=True) #param need_proba à mettre à true pour que ça corresponde à 'roc_auc' car ces métricss utilisent les proba
scorer_2 = make_scorer( metrics.average_precision_score, needs_proba=True)
scoring = {'roc_auc':scorer_1, 'average_precision':scorer_2}
#Ajout du SMOTE au pipeline
X_train_pip=pipeline_log.fit_transform(X_train_ech, y_train_ech)
#X_train_pip=pipeline_log.fit_transform(X_train, y_train)
log_reg=LogisticRegression(max_iter=100,solver='liblinear',class_weight='balanced')
#Définition de gridsearch
clf_log_w = RandomizedSearchCV(log_reg, distributions, cv=5 ,scoring=scoring, refit='average_precision', n_iter=10)
grid_result_log_w=clf_log_w.fit(X_train_pip, y_train_ech)
#grid_result_log_w=clf_log_w.fit(X_train_pip, y_train)
#clf_log_w.fit(X_train_pip, y_train)
score_w_log = grid_result_log_w.best_score_
print("Best: %f using %s" % (grid_result_log_w.best_score_, grid_result_log_w.best_params_), "\n")
Expected Results
For a cross validation my programs take 2.5 seconds to execute
But for a RandomsearchCV with 20 iterations it should take 20 times more time so about 2.5*20=50 seconds
Actual Results
The RandomizedSearchCV's algorithm does not take 50s to execute but it takes so many time that it appears really not normal. Even with a parameter n_iter=2 (training time should be about 2.5*2= 5 seconds) it takes more than 10 minutes (maybe much more) and It works but I wanted to understand the difference between what my program does and what I have thought it should do. Did I do something wrong or is there an issue in the libraries ?
Versions
System:
python: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\aurel\anaconda3\python.exe
machine: Windows-10-10.0.22000-SP0
Python dependencies:
pip: 20.2.4
setuptools: 50.3.1.post20201107
sklearn: 0.24.2
numpy: 1.21.2
scipy: 1.5.2
Cython: 0.29.23
pandas: 1.3.4
matplotlib: 3.3.2
joblib: 0.17.0
threadpoolctl: 2.1.0
Built with OpenMP: True