[MRG] Adds documentation for parallelisation of custom scorer #12813

fx86 · 2018-12-18T10:19:03Z

Reference Issues/PRs

Adds documentation on parallelisation of custom scorers. Without importing from a module,
joblib is unable to run custom scoring functions in parallel.

Prompted from this discussion - #10054 (comment)

ping @jnothman @amueller

albertcthomas · 2018-12-18T10:30:30Z

I know this is a restriction if pickle is used but is this still needed now that joblib is using cloudpickle?

doc/modules/model_evaluation.rst

fx86 · 2018-12-19T18:18:05Z

I know this is a restriction if pickle is used but is this still needed now that joblib is using cloudpickle?

@albertcthomas would it make sense to add this for pre-cloudpickle versions of sklearn ? Could you let me know which version onwards has cloudpickle ?

albertcthomas · 2018-12-19T18:47:19Z

I was just wondering. It might be relevant even with cloudpickle. cc @tomMoral

amueller · 2018-12-19T21:05:07Z

Thanks. Can you test if the issue persists in 0.20.1? If so, then cloudpickle doesn't help.

fx86 · 2018-12-20T05:53:22Z

Installed 0.20.1 version of scikit-learn and this code seems to run fine without loading the custom scorer from an external module.

from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, mean_squared_error

def rmse_cv(y_true, y_pred) :
	assert len(y_true) == len(y_pred)
	y_pred = np.exp(y_pred)
	y_true = np.exp(y_true)
	return np.sqrt(mean_squared_error(y_true,y_pred))


rkfold = RepeatedKFold(n_splits=5,n_repeats=5)

url = 'https://github.com/GinoWoz1/Learnings/raw/master/'

X_train = pd.read_csv(url + 'X_trainGA.csv',index_col= 'Unnamed: 0')
y_train = pd.read_csv(url +'y_trainGA.csv',header=None,index_col=0)

X_train.rename(columns={'Constant Term':'tax'},inplace=True)

elnet_final = ElasticNet()

elnet_pipe = Pipeline([('std',StandardScaler()),
('elnet',elnet_final)])

cross_val = cross_val_score(elnet_pipe,
						X_train,
						y_train,
						scoring=make_scorer(rmse_cv, greater_is_better=False),
						cv=rkfold,
						n_jobs=-1)

albertcthomas · 2018-12-21T11:16:07Z

So now that joblib is using cloudpickle I think that it should work out of the box for most of the cases (if the loky backend is used). Instead of the example we can maybe add a remark in the doc saying that if a custom scoring function is used with n_jobs >1 then it must be serializable by joblib and we can point to the joblib documentation section explaining how serialization is done

fx86 · 2018-12-26T07:15:10Z

Thanks, @albertcthomas.

@amueller @adrinjalali - comments ?

jnothman

I think you need an example of importing.

doc/modules/model_evaluation.rst

jnothman

Now are we sure this is necessary with current joblib?

doc/modules/model_evaluation.rst

fx86 · 2019-01-02T13:23:07Z

@jnothman The new Loky backend takes care of serialising the interactively-defined functions, but I'm not sure if older versions of sklearn will work with this new backend. Can I suggest this to be kept for reasons of backward compatibility ?

albertcthomas · 2019-01-02T13:54:29Z

Can I suggest this to be kept for reasons of backward compatibility ?

I think it would be better to say that this example is useful as other backends than loky (multiprocessing, custom backends) can be used.

fx86 · 2019-01-02T19:00:11Z

Roger that, @albertcthomas.

albertcthomas · 2019-01-02T19:03:36Z

Well it would be nice if someone could confirm that this is the expected behavior when loky is used and when a different backend is used

doc/modules/model_evaluation.rst

albertcthomas

See my two comments below. Also as rmse is not defined anywhere in the example and as the example is not run, why do you think of using custom_scoring_function or custom_scorer instead of rmse?

doc/modules/model_evaluation.rst

albertcthomas · 2019-01-14T09:04:00Z

Regarding the rest: rmse is only a placeholder-name for a custom function which is
not available as a custom scorer via sklearn, directly.

IMO rmse would be fine if it was defined in the example and I agree that most of the users will know that rmse stands for root mean square error but better explicit than implicit.

fx86 · 2019-01-14T09:35:18Z

Regarding the rest: rmse is only a placeholder-name for a custom function which is
not available as a custom scorer via sklearn, directly.

IMO rmse would be fine if it was defined in the example and I agree that most of the users will know that rmse stands for root mean square error but better explicit than implicit.

We could keep an abstract function name, like 'custom_scoring_function`, probably ?

albertcthomas · 2019-01-14T12:21:04Z

We could keep an abstract function name, like 'custom_scoring_function`, probably ?

Yes

albertcthomas

Thanks @fx86! Almost there :)

To make things even clearer I would also suggest putting your addition in a Notes section with a well-chosen title such as

.. notes:: **Using custom scorers in functions where `n_jobs > 1`**

doc/modules/model_evaluation.rst

albertcthomas

One last comment. Otherwise LGTM! Thanks @fx86

doc/modules/model_evaluation.rst

jnothman

I doubt this will be a frequent issue with loky/cloudpickle, but I'm okay to merge. Thanks @fx86

This reverts commit d551ff2.

Adds doc, for custom scorer parallelisation

43980ff

adrinjalali reviewed Dec 18, 2018

View reviewed changes