-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Multiprocessing fix for dictionary learning main loop #4773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing fix for dictionary learning main loop #4773
Conversation
Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error
@@ -110,7 +110,8 @@ def _sparse_encode(X, dictionary, gram, cov=None, algorithm='lasso_lars', | |||
clf = Lasso(alpha=alpha, fit_intercept=False, precompute=gram, | |||
max_iter=max_iter, warm_start=True) | |||
clf.coef_ = init | |||
clf.fit(dictionary.T, X.T) | |||
# Copying X to avoid buffer source-array read only error | |||
clf.fit(dictionary.T, X.copy().T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum. That's a workaround but not nice. Can you add a test to see what's wrong with Lasso and read only X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging this with the data that causes error, I cannot see any indication on X being read only.
X.flags.writeable = True
Adding a unittest that would currently fail could be tricky as the error is only triggered by very large X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this_code = sparse_encode(this_X, dictionary.T, algorithm=method, | ||
alpha=alpha, n_jobs=n_jobs).T | ||
alpha=alpha, n_jobs=1).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you disabling joblib then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sparse_encode with n_jobs=1 actually disable joblib.Parallel. I think this is the only way if we do not want to manage workers manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a gain in the transform / sparse_encode function,once the dictionary has been computed. However, during the alternative process of online dictionary learning (sparse_encode -> _update_dict), using n_jobs > 1 actually slows down the process (see #4769). This is a joblib inherent issue, as new workers are set up at each iteration
However this should probably depends on the batch dimension : if it is large, multiprocessing should still improve performance
Copying the data is the wrong fix to address the readonly crash. We should not copy the data but instead ensure that fitting works on readonly data as discussed in #4772. I think we should close this PR. |
OK, i can open another one to set n_jobs = 1 within dict_learning_online loop if batch_size is too small |
if batch size is too small and n_jobs is big it should display a warning
but I am afraid of magic parameter overwrite. Your problem setup is maybe
not the one of everyone.
|
Granted. Should we make it possible for the user to have sequential dictionary learning and parallel sparse encoding ? |
cannot tell without benching.
|
The correct fix might be to merge joblib/joblib#157 in joblib and then synchronize the joblib that is embedded in master. @arthurmensch would you be interested in testing that this branch of joblib is actually solving your speed issues in dictionary learning with You can checkout the branch of joblib then run
try:
import joblib
from sklearn.externals import joblib as skl_joblib
print('Monkeypatching scikit-learn embedded joblib')
for k, v in vars(joblib).items():
setattr(skl_joblib, k, v)
except ImportError:
pass |
I am not sure that this tackle the issue, as we do not have a large number of tasks to provide to It is possible that the algorithm still works running sparse coding and dictionary update fully in parallel (just like a coordinate descent does), but I have not seen any evidence of this in the litterature. |
I tested it with joblib/joblib#157 PR : there is actually a slight drop in performance, which was expected : The only way in which improving this would be to initialize |
I agree with your analysis. joblib/joblib#157 solves the problem when we do a single call to Calling
If 2. is short (e.g. very few jobs to dispatch), the overhead of steps 1. and 3. will dominate and will completely ruin the performance. This is the case for a typical call to
@arthurmensch can you try to increase the If typical optimal values for |
Benchmark : https://gist.github.com/arthurmensch/f688c9ee5fdce1803cde Output with
On linux, with vectors of dimension 49, batches must be of size 200 to hide joblib overhead. I will compare the resuls quality, for a comparable number of single vector lasso resolution (hence the I think that this also depends on vector dimensions though, which makes the selection of |
Please also put the figure generated by the benchmark in the pull
request. It makes it easier for everybody.
|
I updated the gist, here is a benchmark for different p dimensions (X \in R^{nxp}), which correspond to patch size using an image as input. We can see that for default [1] : J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding (http://www.di.ens.fr/sierra/pdfs/icml09.pdf) |
It strikes me that we probably need to increase a bit the batch size, say set it to 10. This matches my experience on other problems. Sent from my phone. Please forgive brevity and mis spelling On May 31, 2015, 16:43, at 16:43, Arthur Mensch notifications@github.com wrote:
|
Using such default we should then set |
Is this a general remark or for this PR? According to the benchmark, even at @arthurmensch can you try at |
Is this a general remark or for this PR?
General remark.
|
Have you checked that all the models with different |
With What it means is that, in order to gain from multiprocessing, we have to use a Around |
Thanks @arthurmensch. Based on your analysis I would be in favor of setting Furthermore, in the |
Do we want to support explicit pools that are not recreated all the time? |
I think this would be a good idea as this problem has been appearing several times in the past. What I am thinking of is to make it possible to use a with Parallel(self.n_jobs) as p:
for stuff in stuffs:
results = p(delayed(some_function)(x for x in stuff))
# so something else with intermediate results (synchronization barrier) This would mean that when Off-course we would still keep the current behavior when the callel wants to call |
BTW, even if we improve joblib to make it possible to reuse a pool with a context manager, I think it would still be good to implement #4773 (comment) as a stopgap / short term solution. |
that would be pretty awesome! |
The joblib context manager API is implemented and available in #5016. We did some profiling with @arthurmensch and apparently we should benefit a lot from skipping redundant inner input data validation checks by introducing a |
It looks like there is a plan to address the issue in @ogrisel's last comment, but I presume no one's gone ahead and done it? Perhaps this should be added to the issue description and the PR closed as the wrong fix. |
Any progress on this? Not sure if this is related but I get the same error ('output array is read only ') when running randomized Lasso with more than 1 CPU. Same as in: |
Setting n_jobs = 1 during online dictionary learning process
Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error(see discussion)Fixing issues :
#4772, #4769