Dictionary learning is slower with n_jobs > 1 #4769

arthurmensch · 2015-05-26T07:45:45Z

Setting n_jobs > 1 in MiniBatchDictionaryLearning (and in function dictionary_learning_online) leads to worse performance.

Multi processing is handled in sklearn.decompositions, function dict_learning, l 249

    code_views = Parallel(n_jobs=n_jobs)(
        delayed(_sparse_encode)(
            X[this_slice], dictionary, gram, cov[:, this_slice], algorithm,
            regularization=regularization, copy_cov=copy_cov,
            init=init[this_slice] if init is not None else None,
            max_iter=max_iter)
        for this_slice in slices)

Minimal example :
https://gist.github.com/arthurmensch/091d16c135f4a3ba5580

Output n_jobs = 1

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 5.12s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 10.24s.

Output n_jobs == 2

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 78.98s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 6.15s.

Output n_jobs == 4

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 83.24s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 3.82s.

We can see that transform function of MiniBatchDictionaryLearning (relying on sparse_encode function) benefits from multi-processing as expected.

Dictionary learning relies on successive calls of sparse_encode function : slowness may come from this.

The text was updated successfully, but these errors were encountered:

arthurmensch · 2015-05-26T08:47:30Z

Incriminated line can be located l. 692 of dict_learning.py

        this_code = sparse_encode(this_X, dictionary.T, algorithm=method,
                                  alpha=alpha, n_jobs=n_jobs).T

We have :

this_X.shape[0] = batch_size

which is typically of the same order than n_jobs. sparse_encode computation time may be dominated by joblib overhead and not by the lasso computation time itself.

arthurmensch · 2015-05-27T08:08:32Z

Fix in pull request #4773

martinosorb · 2015-06-05T11:12:07Z

I believe I'm seeing this problem here: #4779

jakirkham · 2018-07-04T04:42:43Z

Agreed. Have run into this myself as well. Basically Joblib would normally make sense for finding the sparse code for a precomputed dictionary. However in this specific instance, it is being called in a tight loop as part of dictionary learning. The overhead of starting Joblib in this tight loop is too costly. One is better off using the threaded parallelism that BLAS provides instead.

Edit: Alternatively it may make sense to start the Joblib thread or multiprocessing pool before the tight loop and keep it running so that the tight loop merely schedules jobs for this pool. Have not tested this latter approach, but it may work reasonably well.

amueller · 2018-08-21T21:51:45Z

from #4773 (comment)

The joblib context manager API is implemented and available in #5016.

We did some profiling with @arthurmensch and apparently we should benefit a lot from skipping redundant inner input data validation checks by introducing a check_input=False flag for the functions called inside the fit loop (the default being check_input=True). Input validation takes a large fraction of the time and causes unnecessary GIL contention when using the threading backend.

jakirkham · 2018-08-21T22:46:50Z

FWIW there has been a lot of work on this (particularly in the past few weeks) and IMHO things are getting better.

The main things that remain slow on the sparse coding side are this copy and _pre_fit generally (though ~75% of that is just the aforementioned copy). A smaller amount of time is also spent in this asfortranarray call, but that is largely insignificant by comparison. It is worth noting that more time is spent in _pre_fit than enet_path (where the bulk of sparse coding happens). However _pre_fit and enet_path are generally more comparable now, which is a significant improvement.

Most of these improvements have involved skipping the assert_all_finite check, which is rather slow. More details on why that check is slow can be found in issue ( #11681 ).

jakirkham · 2018-08-21T22:51:55Z

On the Joblib side, would expect PR ( https://github.com/tomMoral/loky/pull/135 ) ends up being very helpful here as the bigger issue is likely thread contention between the BLAS used and Joblib in sparse coding.

cmarmo · 2021-01-11T11:58:31Z

The gists are no longer available.
The following code

from time import time
import numpy as np
from sklearn.datasets import make_sparse_coded_signal
from sklearn.decomposition import MiniBatchDictionaryLearning
X, dictionary, code = make_sparse_coded_signal(
    n_samples=1000, n_components=15, n_features=20, n_nonzero_coefs=10,
    random_state=42)

n_jobs = [1, 2, 3, 4, 10]

for nj in n_jobs:
    dict_learner = MiniBatchDictionaryLearning(
        n_components=15, transform_algorithm='lasso_lars', random_state=42,
        n_jobs=nj
    )
    t0 = time()
    X_transformed = dict_learner.fit_transform(X)
    print("done in %0.3fs, with n_jobs = %d." % ((time() - t0), nj))

gives the following output

done in 5.199s, with n_jobs = 1.
done in 6.903s, with n_jobs = 2.
done in 8.619s, with n_jobs = 3.
done in 8.418s, with n_jobs = 4.
done in 9.727s, with n_jobs = 10.

With the following environment:

$ python -c "import sklearn; sklearn.show_versions()"

System:
    python: 3.9.1 (default, Dec  8 2020, 00:00:00)  [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
executable: /home/cmarmo/.skldevenv/bin/python
   machine: Linux-5.9.16-200.fc33.x86_64-x86_64-with-glibc2.32

Python dependencies:
          pip: 20.3.3
   setuptools: 49.1.3
      sklearn: 1.0.dev0
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

It seems that this issue is still relevant?

jakirkham · 2021-01-11T20:04:06Z

Thanks for rechecking this. That is a good data point to have.

I had tried to Cythonize a good chunk of code and release the GIL for the bulk of it with PR ( #11874 ). Maybe that helps? Unfortunately don't have much time to work on that PR these days.

adrinjalali · 2021-08-22T10:11:12Z

Removing the milestone. Maybe @jeremiedbb would like to have a look?

amueller added the Bug label May 26, 2015

arthurmensch mentioned this issue May 27, 2015

Multiprocessing fix for dictionary learning main loop #4773

Closed

amueller modified the milestone: 0.19 Sep 29, 2016

cmarmo modified the milestones: 0.19, 1.0 Jan 11, 2021

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

cmarmo added Performance module:decomposition labels Nov 30, 2021

thomasjpfan added this to Quansight's scikit-learn Project Board Jun 5, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Jun 5, 2022

cmarmo added the help wanted label Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary learning is slower with n_jobs > 1 #4769

Dictionary learning is slower with n_jobs > 1 #4769

arthurmensch commented May 26, 2015

arthurmensch commented May 26, 2015

arthurmensch commented May 27, 2015

martinosorb commented Jun 5, 2015

jakirkham commented Jul 4, 2018 •

edited

Loading

amueller commented Aug 21, 2018

jakirkham commented Aug 21, 2018

jakirkham commented Aug 21, 2018

cmarmo commented Jan 11, 2021

jakirkham commented Jan 11, 2021

adrinjalali commented Aug 22, 2021

Dictionary learning is slower with n_jobs > 1 #4769

Dictionary learning is slower with n_jobs > 1 #4769

Comments

arthurmensch commented May 26, 2015

arthurmensch commented May 26, 2015

arthurmensch commented May 27, 2015

martinosorb commented Jun 5, 2015

jakirkham commented Jul 4, 2018 • edited Loading

amueller commented Aug 21, 2018

jakirkham commented Aug 21, 2018

jakirkham commented Aug 21, 2018

cmarmo commented Jan 11, 2021

jakirkham commented Jan 11, 2021

adrinjalali commented Aug 22, 2021

jakirkham commented Jul 4, 2018 •

edited

Loading