Skip to content

Multiprocessing fix for dictionary learning main loop #4773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

arthurmensch
Copy link
Contributor

Setting n_jobs = 1 during online dictionary learning process

Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error (see discussion)

Fixing issues : #4772, #4769

Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error
@@ -110,7 +110,8 @@ def _sparse_encode(X, dictionary, gram, cov=None, algorithm='lasso_lars',
clf = Lasso(alpha=alpha, fit_intercept=False, precompute=gram,
max_iter=max_iter, warm_start=True)
clf.coef_ = init
clf.fit(dictionary.T, X.T)
# Copying X to avoid buffer source-array read only error
clf.fit(dictionary.T, X.copy().T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum. That's a workaround but not nice. Can you add a test to see what's wrong with Lasso and read only X?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging this with the data that causes error, I cannot see any indication on X being read only.

X.flags.writeable = True

Adding a unittest that would currently fail could be tricky as the error is only triggered by very large X.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is related to the way joblib handles shared array. It has some common points with closed bug #4597.
I added an example in issue #4772

this_code = sparse_encode(this_X, dictionary.T, algorithm=method,
alpha=alpha, n_jobs=n_jobs).T
alpha=alpha, n_jobs=1).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you disabling joblib then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sparse_encode with n_jobs=1 actually disable joblib.Parallel. I think this is the only way if we do not want to manage workers manually

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a gain in the transform / sparse_encode function,once the dictionary has been computed. However, during the alternative process of online dictionary learning (sparse_encode -> _update_dict), using n_jobs > 1 actually slows down the process (see #4769). This is a joblib inherent issue, as new workers are set up at each iteration

However this should probably depends on the batch dimension : if it is large, multiprocessing should still improve performance

@arthurmensch arthurmensch changed the title Multiprocessing fix for dictionary elarning main loop, copying design matrix to avoid lasso_cd error on large dataset Multiprocessing fix for dictionary learning main loop, copying design matrix to avoid lasso_cd error on large dataset May 27, 2015
@ogrisel
Copy link
Member

ogrisel commented May 27, 2015

Copying the data is the wrong fix to address the readonly crash. We should not copy the data but instead ensure that fitting works on readonly data as discussed in #4772. I think we should close this PR.

@arthurmensch
Copy link
Contributor Author

OK, i can open another one to set n_jobs = 1 within dict_learning_online loop if batch_size is too small

@agramfort
Copy link
Member

agramfort commented May 27, 2015 via email

@arthurmensch
Copy link
Contributor Author

Granted. Should we make it possible for the user to have sequential dictionary learning and parallel sparse encoding ?

@agramfort
Copy link
Member

agramfort commented May 27, 2015 via email

@ogrisel
Copy link
Member

ogrisel commented May 27, 2015

The correct fix might be to merge joblib/joblib#157 in joblib and then synchronize the joblib that is embedded in master.

@arthurmensch would you be interested in testing that this branch of joblib is actually solving your speed issues in dictionary learning with n_jobs > 1?

You can checkout the branch of joblib then run python setup.py develop to install it in dev mode and finally create the following file in your home folder to use your joblib version instead of the one embedded in scikit-learn:

~/pythonstartup.py:

try:
    import joblib
    from sklearn.externals import joblib as skl_joblib
    print('Monkeypatching scikit-learn embedded joblib')
    for k, v in vars(joblib).items():
        setattr(skl_joblib, k, v)
except ImportError:
    pass

@arthurmensch
Copy link
Contributor Author

I am not sure that this tackle the issue, as we do not have a large number of tasks to provide to _sparse_encode at the same time : at each iteration, we run sparse_encode on a small batch, wait for _dict_update to finish, and rerun sparse_encode : at each iteration, sparse_encode calls atomic _sparse_encode batch_size times : we cannot expect a large improvement by batching these calls within one sequential call to sparse_encode. I hope this is clear enough...

It is possible that the algorithm still works running sparse coding and dictionary update fully in parallel (just like a coordinate descent does), but I have not seen any evidence of this in the litterature.

@arthurmensch arthurmensch changed the title Multiprocessing fix for dictionary learning main loop, copying design matrix to avoid lasso_cd error on large dataset Multiprocessing fix for dictionary learning main loop May 27, 2015
@arthurmensch
Copy link
Contributor Author

I tested it with joblib/joblib#157 PR : there is actually a slight drop in performance, which was expected : sparse_encode runs batch_size tasks, i.e. few jobs, and batching these tasks adds on overhead without permitting a gain of performance as there are too few tasks.

The only way in which improving this would be to initialize n_jobs workers before the main loop, and reuse these workers in every sparse_encode call. However, I gather that directly using multiprocessing is not permitted within scikit-learn code.

@ogrisel
Copy link
Member

ogrisel commented May 28, 2015

I agree with your analysis. joblib/joblib#157 solves the problem when we do a single call to Parallel.__call__ with a very large number of short jobs, not the case where we call Parallel.__call__ many times with a small number of short jobs each time as is the case here.

Calling Parallel.__call__ many times has a very large overhead. with the current multiprocessing backend, each call to Parallel.__call__ does:

  1. create a Pool of subprocesses and the queues and sockets to communicate with those,
  2. dispatch the processing to workers and collect the results,
  3. terminate the subprocesses and destroy all the communication infrastructure.

If 2. is short (e.g. very few jobs to dispatch), the overhead of steps 1. and 3. will dominate and will completely ruin the performance. This is the case for a typical call to dict_learning_online where batch_size is small. The cost of 1. and 3. is typically on the order of 150ms under a POSIX platform with multiprocessing with the default fork start_method. Under Windows it's typically twice that overhead.

batch_size is 3 on dict_learning_online by default and it's likely that increasing the size of batch_size will impact negatively the convergence rate of the outer online dictionary learning algorithm (to be confirmed, ping @GaelVaroquaux ).

@arthurmensch can you try to increase the batch_size parameter to something high enough that would hide the overhead without breaking the covergence (e.g. maybe try batch_size=100 or batch_size=1000)? Please compare the quality of the result with a model trained with the default parameteers batch_size=3 and n_jobs=1.

If typical optimal values for batch_size for dict_learning_online are always in the range 1-10 and each individual call to sparse_encode is smaller than a couple of second, then I see no point in trying to use multiprocessing inside this loop. The joblib API does not make it possible to keep a pool open, and even if we could the joblib dispatch overhead (the communication with the sub processes) would probably never make it advantageous to parallelize in the inner loop of dict_learning_online.

@arthurmensch
Copy link
Contributor Author

Benchmark : https://gist.github.com/arthurmensch/f688c9ee5fdce1803cde

Output with batch_size = 200, n_jobs = 1,2,3

Distorting image...
Extracting reference patches...
done in 0.02s.
Learning the dictionary...
done in 3.18s.
Learning the dictionary...
done in 3.29s.
Learning the dictionary...
done in 2.20s.

Process finished with exit code 0

On linux, with vectors of dimension 49, batches must be of size 200 to hide joblib overhead. I will compare the resuls quality, for a comparable number of single vector lasso resolution (hence the n_iter = 2000/batch_size in dictionary init parameters.

I think that this also depends on vector dimensions though, which makes the selection of n_jobs more complex.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 29, 2015 via email

@arthurmensch
Copy link
Contributor Author

I updated the gist, here is a benchmark for different p dimensions (X \in R^{nxp}), which correspond to patch size using an image as input. We can see that for default batch_size = 3, using more than one job is always smaller. We obtain slightly better results using n_jobs > 1 for p = 81, batch_size = 300, but such batch size is too large in an online setting, following [1]. I have not checked yet if all the obtained results were correct, I will check this where setting n_jobs > 1 yields faster results.

bench

[1] : J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding (http://www.di.ens.fr/sierra/pdfs/icml09.pdf)

@GaelVaroquaux
Copy link
Member

It strikes me that we probably need to increase a bit the batch size, say set it to 10. This matches my experience on other problems.

Sent from my phone. Please forgive brevity and mis spelling

On May 31, 2015, 16:43, at 16:43, Arthur Mensch notifications@github.com wrote:

I updated the gist, here is a benchmark for different p dimensions (X
\in R^{nxp}), which correspond to patch size using an image as input.
We can see that for default batch_size = 3, using more than one job
is always smaller. We obtain slightly better results using n_jobs > 1
for p = 81, batch_size = 300, but such batch size is too large in
an online setting, following [1]. I have not checked yet if all the
obtained results were correct, I will check this where setting `n_jobs

1` yields faster results.

bench

[1] : J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary
learning for sparse coding
(http://www.di.ens.fr/sierra/pdfs/icml09.pdf)


Reply to this email directly or view it on GitHub:
#4773 (comment)

@arthurmensch
Copy link
Contributor Author

Using such default we should then set n_jobs == 1 for the dictionary learning phase, given the benchmark above, or at least raise a warning

@ogrisel
Copy link
Member

ogrisel commented Jun 1, 2015

It strikes me that we probably need to increase a bit the batch size, say set it to 10. This matches my experience on other problems.

Is this a general remark or for this PR? According to the benchmark, even at batch_size=100, n_jobs=2 is no faster than n_jobs=1.

@arthurmensch can you try at batch_size=1000 and report the run time for n_jobs=2 and n_jobs=1. If n_jobs=2 is not at least1.2x faster, then I think we should force n_jobs=1 inside the mini-batch loop.

@arthurmensch
Copy link
Contributor Author

At n = 1000 performance are consistently improved :

bench

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 1, 2015 via email

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2015

At n = 1000 performance are consistently improved

Have you checked that all the models with different batch_size are converging to solution of similar quality (according to some metric function computed on both a training and validation set) ?

@arthurmensch
Copy link
Contributor Author

With batch_size from 3 to 1000, we obtain a satisfactory results for image denoising (L2 distance plot to come). However, running dictionary learning with batch_size = 1000 and n_jobs = 1, convergence time is 3 times larger than with batch_size = 10. This appear in the curves above, where n = 10 seems to be the argmin of the benchmark.

What it means is that, in order to gain from multiprocessing, we have to use a batch_size so large that it divides by 3 the performance on a single processor : except in some exceptional use cases, this should prevent the user from setting too large batch_size, and thus from ever benefiting from multiprocessing.

Around n = 200 gain and losses of multi-processing appear to compensate, but we already use batches whose size is too big for optimal performance.

@ogrisel
Copy link
Member

ogrisel commented Jun 8, 2015

Thanks @arthurmensch. Based on your analysis I would be in favor of setting batch_size=10 by default, both in the dict_learning_online function and in the MinibatchDictionaryLearning class.

Furthermore, in the fit and partial_fit methods of MinibatchDictionaryLearning I think we should call the inner dict_learning_online with n_jobs=self.n_jobs when batch_size > 100 and with n_jobs=1 otherwise (this should be the default). self.n_jobs would always be used to parallelize calls in the MinibatchDictionaryLearning.transform method.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

Do we want to support explicit pools that are not recreated all the time?

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

Do we want to support explicit pools that are not recreated all the time?

I think this would be a good idea as this problem has been appearing several times in the past. What I am thinking of is to make it possible to use a Parallel instance as a context manager:

with Parallel(self.n_jobs) as p:
    for stuff in stuffs:
        results = p(delayed(some_function)(x for x in stuff))
        # so something else with intermediate results (synchronization barrier)

This would mean that when Parallel.__enter__ we create a managed ._pool attribute on the parallel object that can be reused by several quick consecutive calls to Parallel.__call__ and gets only deleted Parallel.__exit__ (following the Python context manager protocol).

Off-course we would still keep the current behavior when the callel wants to call Parallel.__call__ only once.

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

BTW, even if we improve joblib to make it possible to reuse a pool with a context manager, I think it would still be good to implement #4773 (comment) as a stopgap / short term solution.

@amueller
Copy link
Member

that would be pretty awesome!

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

The joblib context manager API is implemented and available in #5016.

We did some profiling with @arthurmensch and apparently we should benefit a lot from skipping redundant inner input data validation checks by introducing a check_input=False flag for the functions called inside the fit loop (the default being check_input=True). Input validation takes a large fraction of the time and causes unnecessary GIL contention when using the threading backend.

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2017
@jnothman
Copy link
Member

It looks like there is a plan to address the issue in @ogrisel's last comment, but I presume no one's gone ahead and done it? Perhaps this should be added to the issue description and the PR closed as the wrong fix.

@ghost
Copy link

ghost commented Aug 18, 2017

Any progress on this? Not sure if this is related but I get the same error ('output array is read only ') when running randomized Lasso with more than 1 CPU. Same as in:
#4597

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018
@amueller amueller closed this Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants