Skip to content

What's the point in this line and this function? #5820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olologin opened this issue Nov 15, 2015 · 10 comments
Closed

What's the point in this line and this function? #5820

olologin opened this issue Nov 15, 2015 · 10 comments

Comments

@olologin
Copy link
Contributor

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L344

I understand that it slices list of tasks into batches, but why do we need it if joblib provides ability to automatically adjust batch size (from list of tasks), and it automatically groups your tasks into batches for each process. Even if you don't want to waste time onto automatic adjustion process - you can just set batch_size = int(n_tasks/n_jobs) and it will work in same way as explicit slicing.

https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 15, 2015 via email

@olologin
Copy link
Contributor Author

Yes, I've checked sources now and this function is used only from two files, bagging.py and forest.py, in last case it's used only for computing minimal number of threads:
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)

Not a big deal, as it turned out.

I'll try to make some benchmarks.

@olologin
Copy link
Contributor Author

import numpy as np

from sklearn.ensemble import BaggingClassifier, BaggingRegressor

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston, load_iris
from sklearn.utils import check_random_state

from scipy.sparse import csc_matrix, csr_matrix

rng = check_random_state(0)

# also load the iris dataset
# and randomly permute it
iris = load_iris()
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]

# also load the boston dataset
# and randomly permute it
boston = load_boston()
perm = rng.permutation(boston.target.size)
boston.data = boston.data[perm]
boston.target = boston.target[perm]

reg = BaggingRegressor(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf = BaggingClassifier(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)

%timeit -n 10 clf.fit(iris.data,iris.target) 
%timeit -n 10 reg.fit(boston.data,boston.target) 
%timeit -n 10 clf.predict(iris.data)
%timeit -n 10 reg.predict(boston.data)

reg = BaggingRegressor(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf = BaggingClassifier(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)

%timeit -n 10 clf.fit(iris.data,iris.target) 
%timeit -n 10 reg.fit(boston.data,boston.target) 
%timeit -n 10 clf.predict(iris.data)
%timeit -n 10 reg.predict(boston.data)

Results of current master:

    # 100 trees
10 loops, best of 3: 265 ms per loop
10 loops, best of 3: 409 ms per loop
10 loops, best of 3: 147 ms per loop
10 loops, best of 3: 150 ms per loop
    # 400 trees
10 loops, best of 3: 635 ms per loop
10 loops, best of 3: 1.46 s per loop
10 loops, best of 3: 268 ms per loop
10 loops, best of 3: 392 ms per loop

Automatically adjusting batch sizes from this branch https://github.com/olologin/scikit-learn/blob/bagging_refactoring/sklearn/ensemble/bagging.py

    # 100 trees
10 loops, best of 3: 261 ms per loop
10 loops, best of 3: 436 ms per loop
10 loops, best of 3: 146 ms per loop
10 loops, best of 3: 188 ms per loop
    # 400 trees
10 loops, best of 3: 684 ms per loop
10 loops, best of 3: 1.33 s per loop
10 loops, best of 3: 266 ms per loop
10 loops, best of 3: 366 ms per loop

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 16, 2015 via email

@olologin
Copy link
Contributor Author

I should say that in current forest.py from master all code which works with joblib is written in the same manner (as bagging.py from branch above).
I.e:

    n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
    ... = Parallel(n_jobs=n_jobs, verbose=self.verbose,
                         backend="threading")(

@glouppe
Copy link
Contributor

glouppe commented Nov 16, 2015

Can you make the same benchmarks on much larger datasets? The reason behind writing things this way was to minimize overhead by transferring function arguments (i.e., copying X and y) only once per core.

@olologin
Copy link
Contributor Author

Can you make the same benchmarks on much larger datasets?

Digits dataset is big enough?

The reason behind writing things this way was to minimize overhead by transferring function arguments (i.e., copying X and y)

Hmm, interesting, i thought that joblib automatically serializes them only once.

@olologin
Copy link
Contributor Author

import numpy as np

from sklearn.ensemble import BaggingClassifier, BaggingRegressor

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits, load_boston
from sklearn.utils import check_random_state

from scipy.sparse import csc_matrix, csr_matrix

rng = check_random_state(0)

# also load the iris dataset
# and randomly permute it
digits = load_digits()
perm = rng.permutation(digits.target.size)
digits.data = digits.data[perm]
digits.target = digits.target[perm]

# also load the boston dataset
# and randomly permute it
boston = load_boston()
perm = rng.permutation(boston.target.size)
boston.data = boston.data[perm]
boston.target = boston.target[perm]

reg = BaggingRegressor(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf = BaggingClassifier(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)

%timeit -n 10 clf.fit(digits.data,digits.target)
%timeit -n 10 reg.fit(boston.data,boston.target)
%timeit -n 10 clf.predict(digits.data)
%timeit -n 10 reg.predict(boston.data)

reg = BaggingRegressor(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf = BaggingClassifier(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)

%timeit -n 10 clf.fit(digits.data,digits.target)
%timeit -n 10 reg.fit(boston.data,boston.target)
%timeit -n 10 clf.predict(digits.data)
%timeit -n 10 reg.predict(boston.data)

Results of current master:

# 100 trees
10 loops, best of 3: 1.41 s per loop
10 loops, best of 3: 406 ms per loop
10 loops, best of 3: 256 ms per loop
10 loops, best of 3: 148 ms per loop
# 400 trees
10 loops, best of 3: 4.87 s per loop
10 loops, best of 3: 1.35 s per loop
10 loops, best of 3: 734 ms per loop
10 loops, best of 3: 370 ms per loop

Automatically adjusted batch sizes

# 100 trees
10 loops, best of 3: 1.38 s per loop
10 loops, best of 3: 468 ms per loop
10 loops, best of 3: 360 ms per loop
10 loops, best of 3: 192 ms per loop
# 400 trees
10 loops, best of 3: 5.02 s per loop
10 loops, best of 3: 1.38 s per loop
10 loops, best of 3: 983 ms per loop
10 loops, best of 3: 393 ms per loop

I tested it both times on same machine and same number of processes/load average of CPU, but these results look like some random numbers, maybe because of very small difference between versions. I'm using IPython and python 3.4.3 to obtain those results.

@olologin
Copy link
Contributor Author

I'll add additional script which performs same test, but with timeit module (Because seems that IPython's %timeit ignores -n parameter), and i think that these results are more accurate.

import timeit
import numpy as np
setup = """
from sklearn.ensemble import BaggingClassifier, BaggingRegressor

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits, load_boston
from sklearn.utils import check_random_state

from scipy.sparse import csc_matrix, csr_matrix

rng = check_random_state(0)

# also load the iris dataset
# and randomly permute it
digits = load_digits()
perm = rng.permutation(digits.target.size)
digits.data = digits.data[perm]
digits.target = digits.target[perm]

# also load the boston dataset
# and randomly permute it
boston = load_boston()
perm = rng.permutation(boston.target.size)
boston.data = boston.data[perm]
boston.target = boston.target[perm]

reg1 = BaggingRegressor(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf1 = BaggingClassifier(n_estimators=100, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
reg2 = BaggingRegressor(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)
clf2 = BaggingClassifier(n_estimators=400, bootstrap=True, oob_score=True, n_jobs=4, random_state = 10)

"""
fits = ["reg1.fit(boston.data, boston.target)", "reg2.fit(boston.data, boston.target)",
        "clf1.fit(digits.data, digits.target)", "clf2.fit(digits.data, digits.target)"]

predicts = ["reg1.predict(boston.data)", "reg2.predict(boston.data)",
            "clf1.predict(digits.data)", "clf2.predict(digits.data)"]

for fit, predict in zip(fits, predicts):
    print("{0}\t: {1}".format(
        np.min(timeit.Timer(setup=setup, stmt=fit).repeat(repeat=10, number=3)),
        fit
    ))
    print("{0}\t: {1}".format(
        np.min(timeit.Timer(setup=setup+fit, stmt=predict).repeat(repeat=10, number=3)),
        predict
    ))

Results:

Master

1.1000740239396691  : reg1.fit(boston.data, boston.target)
0.4071497058030218  : reg1.predict(boston.data)
3.166529825888574   : reg2.fit(boston.data, boston.target)
0.9816317800432444  : reg2.predict(boston.data)
3.070393343223259   : clf1.fit(digits.data, digits.target)
0.7177243148908019  : clf1.predict(digits.data)
11.313239770941436  : clf2.fit(digits.data, digits.target)
1.6799056760501117  : clf2.predict(digits.data)

After removing batches:

1.0993680758401752  : reg1.fit(boston.data, boston.target)
0.4194249368738383  : reg1.predict(boston.data)
3.1378495171666145  : reg2.fit(boston.data, boston.target)
1.0702078870963305  : reg2.predict(boston.data)
3.3732776790857315  : clf1.fit(digits.data, digits.target)
1.0525918728671968  : clf1.predict(digits.data)
11.62876814394258   : clf2.fit(digits.data, digits.target)
2.387485134182498   : clf2.predict(digits.data)

Ratio Before/After:

[ 0.99935827,  1.03014918,  0.99094267,  1.09023354,  1.09864675, 1.4665685 ,  1.02789019,  1.4212019]

We see from it that some degradation of time occurs in clf1.predict and clf2.predict, I don't know why, maybe because these operations are relatively fast while comparing to some internal joblib routines (batch_size adjusting, maybe some serialization overhead).

@amueller
Copy link
Member

I don't think this is relevant any more, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants