Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Nov 11, 2021

#21598 @adrinjalali

Adapted the n_jobs parameter from 1 to -1 (auto-detect mode) which halfed the time needed to run the module
vc_before

vc_after

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2021

Running with -1 by default is problematic because on machines with a large number of CPUs (e.g. 64 or more), spawning the workers can dominate with concurrent access to the hard disk just to start the python interpreters and import the modules. Furthermore it can also use too much memory and cause crashes.

This is why we would rather use a small number of workers (e.g. 2 instead of -1) when we want to use parallelism in examples or tests in scikit-learn.

@adrinjalali adrinjalali changed the title Changed n_jobs parameter to increase speed Changed n_jobs parameter to increase speed in plot_validation_curve.py Nov 12, 2021
@adrinjalali
Copy link
Member

I agree with @ogrisel , and I think alternative is to find other ways to speed up the example. You can set the n_jobs to 2, and find other ways to further make the example faster.

@ghost
Copy link
Author

ghost commented Nov 12, 2021

@ogrisel @adrinjalali Okay, that makes sense, thanks for the explanation :) Will set n_jobs to 2 in a first step and then look for further change possibilities

@adrinjalali adrinjalali mentioned this pull request Nov 12, 2021
41 tasks
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's hope it runs faster on circle ci :)

@adrinjalali
Copy link
Member

This example uses the digits dataset, and I think that's the main source of it being slow. It'd be nice if you could try either iris or a synthetic dataset to see if you can get similar plots while making it significantly faster (I've seen a 100x speedup in some examples by getting rid of the digits dataset)

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2021

I believe the combo of Gaussian RBF + digits is important to get such charateristic validation curves for gamma.

But maybe it would be possible to get similar results with a random sub-sample, or considering a binary classification subproblem such as 1 vs 2 (to make it non trivial):

X, y = load_digits(return_X_y=True)
subset_mask = np.isin(y, [1, 2])  # binary classification: 1 vs 2
X, y = X[subset_mask], y[subset_mask]

Since SVC is and One vs Rest classifier that should greatly help ;)

Edit: changed to 1 vs 2 which is slightly harder than 1 vs 7

@adrinjalali
Copy link
Member

@sveneschlbeck could you please apply Olivier's suggestion?

@ghost
Copy link
Author

ghost commented Nov 16, 2021

@adrinjalali Yes, am on it!

@ghost
Copy link
Author

ghost commented Nov 16, 2021

@adrinjalali @ogrisel The result makes a big difference in exec time (18 sec vs. 3 sec) but the "C" isn't as big and clearly shaped as before. What do you think? Should I change the code after this result?:
before1 1
before1 2

after1 1
after1 2

@adrinjalali
Copy link
Member

To me it still shows the effect the same way, I'd be happy with it.

@adrinjalali adrinjalali merged commit 7bfa9cc into scikit-learn:main Nov 16, 2021
@ghost ghost deleted the speed_increased_example_valcurve branch November 16, 2021 17:53
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021
* Changed n_jobs parameter to increase speed

* Update plot_validation_curve.py

* Update plot_validation_curve.py
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
* Changed n_jobs parameter to increase speed

* Update plot_validation_curve.py

* Update plot_validation_curve.py
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
* Changed n_jobs parameter to increase speed

* Update plot_validation_curve.py

* Update plot_validation_curve.py
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
* Changed n_jobs parameter to increase speed

* Update plot_validation_curve.py

* Update plot_validation_curve.py
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
* Changed n_jobs parameter to increase speed

* Update plot_validation_curve.py

* Update plot_validation_curve.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants