-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. Adding this parameter to all estimators that support |
311d1ba
to
eeb7707
Compare
Yes, that could be a good point, provided that joblib in the future will be enhanced to allow it. Meanwhile, I performed a PR revision to also support this. However, the introduced This PR and E.g, we can now define model = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', OneVsRestClassifier(
LogisticRegression(solver='lbfgs', max_iter=1000),
verbose=10,
max_nbytes='1000M'))
]) And subsequently we can operate a context manager to define a custom backend (e.g., to compare execution time with different backends) with parallel_backend('threading', n_jobs=-1):
model.fit(X_train, target_names) Also with the consideration that parameters defined within a specific classifier take precedence to the context manager ones ( |
To support the benefit of adding any with parallel_backend('loky', n_jobs=-1):
model.fit(X_train, target_names) Output:
vs. with parallel_backend('threading', n_jobs=-1):
model.fit(X_train, target_names) Output:
(Of course slowler, due to the GIL.) |
Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597 Changed _get_args in _testing.py in order to also accept 'parallel_params' vararg.
eeb7707
to
afec926
Compare
I think this would be solved by using the |
I see it was already the proposal of @rth. It will be most probably available in the next |
Changing
OneVsRestClassifier
,OneVsOneClassifier
andOutputCodeClassifier
multiclass learning algorithms within multiclass.py, by replacingn_jobs
parameter with keyworded, variable-length argument list, in order to allow anyParallel
parameter to be passed, as well as supportparallel_backend
context manager.n_jobs
remains one of the possible parameters, but other ones can be added, includingmax_nbytes
, which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined byn_jobs > 0
or byn_jobs = -1
.More specifically, in parallel computing of large arrays with "loky" backend, Parallel sets a default 1-megabyte threshold on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception ValueError: UPDATEIFCOPY base is read-only.
Parallel
usesmax_nbytes
to control this threshold.Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays.
Fixes #6614
See also #4597
Edited text and title, to reflect the support of
parallel_backend
context manager