-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RandomizedSearchCV's training time too much longer than cross_validate function sum of training times #22716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm relatively new to scikit-learn and hope I can provide some help to this problem. I tested the fitting time for 5 fold CV and 10 iters RandomizedSearchCV using a dataset of 90k samples. Five solvers were tested each using the same LogisticRegression classifier and preprocessing pipeline as provided in this issue.
Even though different solvers have different efficiencies, the overall time consumption of CV and RSCV does not deviate much from 1:10. So, I guess the problem is not from RandomizedSearchCV. However, I think the problem might be related to the liblinear solver. When I tested it, the liblinear solver seemed to take a long time to train for some particular input. For example, If we gradually increase the training samples:
When samples size = 100k, the fitting time for one 5 fold CV jump to 17 minutes, code for reproducing is also in the colab notebook I don't know if this is a bug or the characteristic of liblinear solver. This problem might be related to issue #18264, as in our case, the data is also centered with StandardScaler. The time-consuming feature of the liblinear solver can also be seen in some other posts. For example, here, liblinear solver takes 10177s to fit a 1M sample dataset, while lbfgs takes 10s. And here, for MNIST, liblinear solver takes 2893s, while lbfgs takes 53s. For now, I guess just not using liblinear will solve the problem. |
I agree with #22716 (comment). |
param_distribs = {
|
Describe the bug
I am currently working on a project and I have to make a choice between 5 machine learning algorithm's.
But my dataset is very large and I have more than 70 columns.
So to test my programs I wanted to know the training time with a cross validation with one set of params, to see approximately how much time will it take for a set of 10 or more sets of parameters in my RandomizedSearchCV.
I have created a variable witch is the sum of training times and I want to compare it with the training time of my RandomizedSearchCV.
The issue is : I have imagined that a RandomizedSearchCV with a 20 iterations (n_iter=20) would take 20 times more time to execute than cross_validate function that test only one set of parameters but it is really not the case.
Steps/Code to Reproduce
-- AND SECOND ONE --
Expected Results
For a cross validation my programs take 2.5 seconds to execute
But for a RandomsearchCV with 20 iterations it should take 20 times more time so about 2.5*20=50 seconds
Actual Results
The RandomizedSearchCV's algorithm does not take 50s to execute but it takes so many time that it appears really not normal. Even with a parameter n_iter=2 (training time should be about 2.5*2= 5 seconds) it takes more than 10 minutes (maybe much more) and It works but I wanted to understand the difference between what my program does and what I have thought it should do. Did I do something wrong or is there an issue in the libraries ?
Versions
The text was updated successfully, but these errors were encountered: