Multiprocessing fix for dictionary learning main loop #4773

arthurmensch · 2015-05-27T08:07:16Z

Setting n_jobs = 1 during online dictionary learning process

~~Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error~~ (see discussion)

Fixing issues : ~~#4772~~, #4769

Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error

agramfort · 2015-05-27T08:09:34Z

sklearn/decomposition/dict_learning.py

@@ -110,7 +110,8 @@ def _sparse_encode(X, dictionary, gram, cov=None, algorithm='lasso_lars',
        clf = Lasso(alpha=alpha, fit_intercept=False, precompute=gram,
                    max_iter=max_iter, warm_start=True)
        clf.coef_ = init
-        clf.fit(dictionary.T, X.T)
+        # Copying X to avoid buffer source-array read only error
+        clf.fit(dictionary.T, X.copy().T)


hum. That's a workaround but not nice. Can you add a test to see what's wrong with Lasso and read only X?

Debugging this with the data that causes error, I cannot see any indication on X being read only.

X.flags.writeable = True

Adding a unittest that would currently fail could be tricky as the error is only triggered by very large X.

is that a joblib bug?

I think it is related to the way joblib handles shared array. It has some common points with closed bug #4597.
I added an example in issue #4772

agramfort · 2015-05-27T08:10:02Z

sklearn/decomposition/dict_learning.py

        this_code = sparse_encode(this_X, dictionary.T, algorithm=method,
-                                  alpha=alpha, n_jobs=n_jobs).T
+                                  alpha=alpha, n_jobs=1).T


are you disabling joblib then?

Using sparse_encode with n_jobs=1 actually disable joblib.Parallel. I think this is the only way if we do not want to manage workers manually

are you making n_jobs param useless? I am surprised there is no speed gain with n_jobs. It's not what I remember from benching this.

There is a gain in the transform / sparse_encode function,once the dictionary has been computed. However, during the alternative process of online dictionary learning (sparse_encode -> _update_dict), using n_jobs > 1 actually slows down the process (see #4769). This is a joblib inherent issue, as new workers are set up at each iteration

However this should probably depends on the batch dimension : if it is large, multiprocessing should still improve performance

ogrisel · 2015-05-27T12:32:53Z

Copying the data is the wrong fix to address the readonly crash. We should not copy the data but instead ensure that fitting works on readonly data as discussed in #4772. I think we should close this PR.

arthurmensch · 2015-05-27T12:55:35Z

OK, i can open another one to set n_jobs = 1 within dict_learning_online loop if batch_size is too small

agramfort · 2015-05-27T13:00:57Z

if batch size is too small and n_jobs is big it should display a warning but I am afraid of magic parameter overwrite. Your problem setup is maybe not the one of everyone.

arthurmensch · 2015-05-27T13:07:34Z

Granted. Should we make it possible for the user to have sequential dictionary learning and parallel sparse encoding ?

agramfort · 2015-05-27T13:09:27Z

cannot tell without benching.

ogrisel · 2015-05-27T13:34:30Z

The correct fix might be to merge joblib/joblib#157 in joblib and then synchronize the joblib that is embedded in master.

@arthurmensch would you be interested in testing that this branch of joblib is actually solving your speed issues in dictionary learning with n_jobs > 1?

You can checkout the branch of joblib then run python setup.py develop to install it in dev mode and finally create the following file in your home folder to use your joblib version instead of the one embedded in scikit-learn:

~/pythonstartup.py:

try:
    import joblib
    from sklearn.externals import joblib as skl_joblib
    print('Monkeypatching scikit-learn embedded joblib')
    for k, v in vars(joblib).items():
        setattr(skl_joblib, k, v)
except ImportError:
    pass

arthurmensch · 2015-05-27T15:46:48Z

I am not sure that this tackle the issue, as we do not have a large number of tasks to provide to _sparse_encode at the same time : at each iteration, we run sparse_encode on a small batch, wait for _dict_update to finish, and rerun sparse_encode : at each iteration, sparse_encode calls atomic _sparse_encode batch_size times : we cannot expect a large improvement by batching these calls within one sequential call to sparse_encode. I hope this is clear enough...

It is possible that the algorithm still works running sparse coding and dictionary update fully in parallel (just like a coordinate descent does), but I have not seen any evidence of this in the litterature.

arthurmensch · 2015-05-28T09:05:23Z

I tested it with joblib/joblib#157 PR : there is actually a slight drop in performance, which was expected : sparse_encode runs batch_size tasks, i.e. few jobs, and batching these tasks adds on overhead without permitting a gain of performance as there are too few tasks.

The only way in which improving this would be to initialize n_jobs workers before the main loop, and reuse these workers in every sparse_encode call. However, I gather that directly using multiprocessing is not permitted within scikit-learn code.

ogrisel · 2015-05-28T12:02:34Z

I agree with your analysis. joblib/joblib#157 solves the problem when we do a single call to Parallel.__call__ with a very large number of short jobs, not the case where we call Parallel.__call__ many times with a small number of short jobs each time as is the case here.

Calling Parallel.__call__ many times has a very large overhead. with the current multiprocessing backend, each call to Parallel.__call__ does:

create a Pool of subprocesses and the queues and sockets to communicate with those,
dispatch the processing to workers and collect the results,
terminate the subprocesses and destroy all the communication infrastructure.

If 2. is short (e.g. very few jobs to dispatch), the overhead of steps 1. and 3. will dominate and will completely ruin the performance. This is the case for a typical call to dict_learning_online where batch_size is small. The cost of 1. and 3. is typically on the order of 150ms under a POSIX platform with multiprocessing with the default fork start_method. Under Windows it's typically twice that overhead.

batch_size is 3 on dict_learning_online by default and it's likely that increasing the size of batch_size will impact negatively the convergence rate of the outer online dictionary learning algorithm (to be confirmed, ping @GaelVaroquaux ).

@arthurmensch can you try to increase the batch_size parameter to something high enough that would hide the overhead without breaking the covergence (e.g. maybe try batch_size=100 or batch_size=1000)? Please compare the quality of the result with a model trained with the default parameteers batch_size=3 and n_jobs=1.

If typical optimal values for batch_size for dict_learning_online are always in the range 1-10 and each individual call to sparse_encode is smaller than a couple of second, then I see no point in trying to use multiprocessing inside this loop. The joblib API does not make it possible to keep a pool open, and even if we could the joblib dispatch overhead (the communication with the sub processes) would probably never make it advantageous to parallelize in the inner loop of dict_learning_online.

arthurmensch · 2015-05-29T09:24:44Z

Benchmark : https://gist.github.com/arthurmensch/f688c9ee5fdce1803cde

Output with batch_size = 200, n_jobs = 1,2,3

Distorting image...
Extracting reference patches...
done in 0.02s.
Learning the dictionary...
done in 3.18s.
Learning the dictionary...
done in 3.29s.
Learning the dictionary...
done in 2.20s.

Process finished with exit code 0

On linux, with vectors of dimension 49, batches must be of size 200 to hide joblib overhead. I will compare the resuls quality, for a comparable number of single vector lasso resolution (hence the n_iter = 2000/batch_size in dictionary init parameters.

I think that this also depends on vector dimensions though, which makes the selection of n_jobs more complex.

GaelVaroquaux · 2015-05-29T14:49:30Z

Benchmark : https://gist.github.com/arthurmensch/f688c9ee5fdce1803cde

Please also put the figure generated by the benchmark in the pull request. It makes it easier for everybody.

arthurmensch · 2015-05-31T20:42:56Z

I updated the gist, here is a benchmark for different p dimensions (X \in R^{nxp}), which correspond to patch size using an image as input. We can see that for default batch_size = 3, using more than one job is always smaller. We obtain slightly better results using n_jobs > 1 for p = 81, batch_size = 300, but such batch size is too large in an online setting, following [1]. I have not checked yet if all the obtained results were correct, I will check this where setting n_jobs > 1 yields faster results.

[1] : J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding (http://www.di.ens.fr/sierra/pdfs/icml09.pdf)

GaelVaroquaux · 2015-05-31T21:09:06Z

It strikes me that we probably need to increase a bit the batch size, say set it to 10. This matches my experience on other problems.

Sent from my phone. Please forgive brevity and mis spelling

On May 31, 2015, 16:43, at 16:43, Arthur Mensch notifications@github.com wrote:

I updated the gist, here is a benchmark for different p dimensions (X
\in R^{nxp}), which correspond to patch size using an image as input.
We can see that for default batch_size = 3, using more than one job
is always smaller. We obtain slightly better results using n_jobs > 1
for p = 81, batch_size = 300, but such batch size is too large in
an online setting, following [1]. I have not checked yet if all the
obtained results were correct, I will check this where setting `n_jobs

1` yields faster results.

[1] : J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary
learning for sparse coding
(http://www.di.ens.fr/sierra/pdfs/icml09.pdf)

Reply to this email directly or view it on GitHub:
#4773 (comment)

arthurmensch · 2015-06-01T07:37:19Z

Using such default we should then set n_jobs == 1 for the dictionary learning phase, given the benchmark above, or at least raise a warning

ogrisel · 2015-06-01T09:30:30Z

It strikes me that we probably need to increase a bit the batch size, say set it to 10. This matches my experience on other problems.

Is this a general remark or for this PR? According to the benchmark, even at batch_size=100, n_jobs=2 is no faster than n_jobs=1.

@arthurmensch can you try at batch_size=1000 and report the run time for n_jobs=2 and n_jobs=1. If n_jobs=2 is not at least1.2x faster, then I think we should force n_jobs=1 inside the mini-batch loop.

arthurmensch · 2015-06-01T14:30:02Z

At n = 1000 performance are consistently improved :

GaelVaroquaux · 2015-06-01T15:45:07Z

Is this a general remark or for this PR?

General remark.

ogrisel · 2015-06-02T11:28:30Z

At n = 1000 performance are consistently improved

Have you checked that all the models with different batch_size are converging to solution of similar quality (according to some metric function computed on both a training and validation set) ?

arthurmensch · 2015-06-08T11:35:44Z

With batch_size from 3 to 1000, we obtain a satisfactory results for image denoising (L2 distance plot to come). However, running dictionary learning with batch_size = 1000 and n_jobs = 1, convergence time is 3 times larger than with batch_size = 10. This appear in the curves above, where n = 10 seems to be the argmin of the benchmark.

What it means is that, in order to gain from multiprocessing, we have to use a batch_size so large that it divides by 3 the performance on a single processor : except in some exceptional use cases, this should prevent the user from setting too large batch_size, and thus from ever benefiting from multiprocessing.

Around n = 200 gain and losses of multi-processing appear to compensate, but we already use batches whose size is too big for optimal performance.

ogrisel · 2015-06-08T12:38:13Z

Thanks @arthurmensch. Based on your analysis I would be in favor of setting batch_size=10 by default, both in the dict_learning_online function and in the MinibatchDictionaryLearning class.

Furthermore, in the fit and partial_fit methods of MinibatchDictionaryLearning I think we should call the inner dict_learning_online with n_jobs=self.n_jobs when batch_size > 100 and with n_jobs=1 otherwise (this should be the default). self.n_jobs would always be used to parallelize calls in the MinibatchDictionaryLearning.transform method.

amueller · 2015-06-08T17:48:56Z

Do we want to support explicit pools that are not recreated all the time?

ogrisel · 2015-06-24T13:29:37Z

Do we want to support explicit pools that are not recreated all the time?

I think this would be a good idea as this problem has been appearing several times in the past. What I am thinking of is to make it possible to use a Parallel instance as a context manager:

with Parallel(self.n_jobs) as p:
    for stuff in stuffs:
        results = p(delayed(some_function)(x for x in stuff))
        # so something else with intermediate results (synchronization barrier)

This would mean that when Parallel.__enter__ we create a managed ._pool attribute on the parallel object that can be reused by several quick consecutive calls to Parallel.__call__ and gets only deleted Parallel.__exit__ (following the Python context manager protocol).

Off-course we would still keep the current behavior when the callel wants to call Parallel.__call__ only once.

ogrisel · 2015-06-24T13:34:29Z

BTW, even if we improve joblib to make it possible to reuse a pool with a context manager, I think it would still be good to implement #4773 (comment) as a stopgap / short term solution.

amueller · 2015-07-12T22:26:07Z

that would be pretty awesome!

ogrisel · 2015-07-22T19:57:20Z

The joblib context manager API is implemented and available in #5016.

We did some profiling with @arthurmensch and apparently we should benefit a lot from skipping redundant inner input data validation checks by introducing a check_input=False flag for the functions called inside the fit loop (the default being check_input=True). Input validation takes a large fraction of the time and causes unnecessary GIL contention when using the threading backend.

jnothman · 2017-06-14T11:17:59Z

It looks like there is a plan to address the issue in @ogrisel's last comment, but I presume no one's gone ahead and done it? Perhaps this should be added to the issue description and the PR closed as the wrong fix.

ghost · 2017-08-18T09:15:15Z

Any progress on this? Not sure if this is related but I get the same error ('output array is read only ') when running randomized Lasso with more than 1 CPU. Same as in:
#4597

Setting n_jobs = 1 during online dictionary learning process

4e50ac3

Copying X before Lasso.fit call, for lasso_cd algorithm, to avoid buffer source-array read-only error

arthurmensch mentioned this pull request May 27, 2015

Dictionary learning is slower with n_jobs > 1 #4769

Open

agramfort reviewed May 27, 2015
View reviewed changes

arthurmensch mentioned this pull request May 27, 2015

lasso_cd with multiprocessing fails on large dataset #4772

Closed

agramfort reviewed May 27, 2015
View reviewed changes

arthurmensch changed the title ~~Multiprocessing fix for dictionary elarning main loop, copying design matrix to avoid lasso_cd error on large dataset~~ Multiprocessing fix for dictionary learning main loop, copying design matrix to avoid lasso_cd error on large dataset May 27, 2015

removed copy (bug fixed in another PR

4c94ee6

arthurmensch closed this May 27, 2015

arthurmensch reopened this May 27, 2015

arthurmensch changed the title ~~Multiprocessing fix for dictionary learning main loop, copying design matrix to avoid lasso_cd error on large dataset~~ Multiprocessing fix for dictionary learning main loop May 27, 2015

ogrisel mentioned this pull request May 28, 2015

[MRG] Automatically group short tasks in batches joblib/joblib#157

Merged

3 tasks

jnothman added this to the 0.20 milestone Jun 14, 2017

glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018

amueller closed this Aug 21, 2018

Uh oh!

Multiprocessing fix for dictionary learning main loop #4773

Multiprocessing fix for dictionary learning main loop #4773

Uh oh!

Conversation

arthurmensch commented May 27, 2015

Uh oh!

agramfort May 27, 2015

Choose a reason for hiding this comment

Uh oh!

arthurmensch May 27, 2015

Choose a reason for hiding this comment

Uh oh!

agramfort May 27, 2015 via email

Choose a reason for hiding this comment

Uh oh!

arthurmensch May 27, 2015

Choose a reason for hiding this comment

Uh oh!

agramfort May 27, 2015

Choose a reason for hiding this comment

Uh oh!

arthurmensch May 27, 2015

Choose a reason for hiding this comment

Uh oh!

agramfort May 27, 2015 via email

Choose a reason for hiding this comment

Uh oh!

arthurmensch May 27, 2015

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 27, 2015

Uh oh!

arthurmensch commented May 27, 2015

Uh oh!

agramfort commented May 27, 2015 via email

Uh oh!

arthurmensch commented May 27, 2015

Uh oh!

agramfort commented May 27, 2015 via email

Uh oh!

ogrisel commented May 27, 2015

Uh oh!

arthurmensch commented May 27, 2015

Uh oh!

arthurmensch commented May 28, 2015

Uh oh!

ogrisel commented May 28, 2015

Uh oh!

arthurmensch commented May 29, 2015

Uh oh!

GaelVaroquaux commented May 29, 2015 via email

Uh oh!

arthurmensch commented May 31, 2015

Uh oh!

GaelVaroquaux commented May 31, 2015

Uh oh!

arthurmensch commented Jun 1, 2015

Uh oh!

ogrisel commented Jun 1, 2015

Uh oh!

arthurmensch commented Jun 1, 2015

Uh oh!

GaelVaroquaux commented Jun 1, 2015 via email

Uh oh!

ogrisel commented Jun 2, 2015

Uh oh!

arthurmensch commented Jun 8, 2015

Uh oh!

ogrisel commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

ogrisel commented Jun 24, 2015

Uh oh!

ogrisel commented Jun 24, 2015

Uh oh!

amueller commented Jul 12, 2015

Uh oh!

ogrisel commented Jul 22, 2015