-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
What's the point in this line and this function? #5820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Joblib didn't have the automatic batching back when this code was
written. It may not be necessary anymore these days. However, before
removing it, we would need careful benchmarking.
|
Yes, I've checked sources now and this function is used only from two files, bagging.py and forest.py, in last case it's used only for computing minimal number of threads: Not a big deal, as it turned out. I'll try to make some benchmarks. |
Results of current master:
Automatically adjusting batch sizes from this branch https://github.com/olologin/scikit-learn/blob/bagging_refactoring/sklearn/ensemble/bagging.py
|
I should say that in current forest.py from master all code which works with joblib is written in the same manner (as bagging.py from branch above).
|
Can you make the same benchmarks on much larger datasets? The reason behind writing things this way was to minimize overhead by transferring function arguments (i.e., copying X and y) only once per core. |
Digits dataset is big enough?
Hmm, interesting, i thought that joblib automatically serializes them only once. |
Results of current master:
Automatically adjusted batch sizes
I tested it both times on same machine and same number of processes/load average of CPU, but these results look like some random numbers, maybe because of very small difference between versions. I'm using IPython and python 3.4.3 to obtain those results. |
I'll add additional script which performs same test, but with timeit module (Because seems that IPython's %timeit ignores -n parameter), and i think that these results are more accurate.
Results:
Ratio Before/After:
We see from it that some degradation of time occurs in clf1.predict and clf2.predict, I don't know why, maybe because these operations are relatively fast while comparing to some internal joblib routines (batch_size adjusting, maybe some serialization overhead). |
I don't think this is relevant any more, feel free to reopen. |
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L344
I understand that it slices list of tasks into batches, but why do we need it if joblib provides ability to automatically adjust batch size (from list of tasks), and it automatically groups your tasks into batches for each process. Even if you don't want to waste time onto automatic adjustion process - you can just set
batch_size = int(n_tasks/n_jobs)
and it will work in same way as explicit slicing.https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation
The text was updated successfully, but these errors were encountered: