Skip to content

[MRG] Adds documentation for parallelisation of custom scorer #12813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 15, 2019
Merged

[MRG] Adds documentation for parallelisation of custom scorer #12813

merged 12 commits into from
Jan 15, 2019

Conversation

fx86
Copy link
Contributor

@fx86 fx86 commented Dec 18, 2018

Reference Issues/PRs

Adds documentation on parallelisation of custom scorers. Without importing from a module,
joblib is unable to run custom scoring functions in parallel.

Prompted from this discussion - #10054 (comment)

ping @jnothman @amueller

@albertcthomas
Copy link
Contributor

albertcthomas commented Dec 18, 2018

I know this is a restriction if pickle is used but is this still needed now that joblib is using cloudpickle?

@fx86
Copy link
Contributor Author

fx86 commented Dec 19, 2018

I know this is a restriction if pickle is used but is this still needed now that joblib is using cloudpickle?

@albertcthomas would it make sense to add this for pre-cloudpickle versions of sklearn ? Could you let me know which version onwards has cloudpickle ?

@albertcthomas
Copy link
Contributor

I was just wondering. It might be relevant even with cloudpickle. cc @tomMoral

@amueller
Copy link
Member

Thanks. Can you test if the issue persists in 0.20.1? If so, then cloudpickle doesn't help.

@fx86
Copy link
Contributor Author

fx86 commented Dec 20, 2018

Installed 0.20.1 version of scikit-learn and this code seems to run fine without loading the custom scorer from an external module.

from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, mean_squared_error

def rmse_cv(y_true, y_pred) :
	assert len(y_true) == len(y_pred)
	y_pred = np.exp(y_pred)
	y_true = np.exp(y_true)
	return np.sqrt(mean_squared_error(y_true,y_pred))


rkfold = RepeatedKFold(n_splits=5,n_repeats=5)

url = 'https://github.com/GinoWoz1/Learnings/raw/master/'

X_train = pd.read_csv(url + 'X_trainGA.csv',index_col= 'Unnamed: 0')
y_train = pd.read_csv(url +'y_trainGA.csv',header=None,index_col=0)

X_train.rename(columns={'Constant Term':'tax'},inplace=True)

elnet_final = ElasticNet()

elnet_pipe = Pipeline([('std',StandardScaler()),
('elnet',elnet_final)])

cross_val = cross_val_score(elnet_pipe,
						X_train,
						y_train,
						scoring=make_scorer(rmse_cv, greater_is_better=False),
						cv=rkfold,
						n_jobs=-1)

@albertcthomas
Copy link
Contributor

So now that joblib is using cloudpickle I think that it should work out of the box for most of the cases (if the loky backend is used). Instead of the example we can maybe add a remark in the doc saying that if a custom scoring function is used with n_jobs >1 then it must be serializable by joblib and we can point to the joblib documentation section explaining how serialization is done

@fx86
Copy link
Contributor Author

fx86 commented Dec 26, 2018

Thanks, @albertcthomas.

@amueller @adrinjalali - comments ?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need an example of importing.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now are we sure this is necessary with current joblib?

@fx86
Copy link
Contributor Author

fx86 commented Jan 2, 2019

@jnothman The new Loky backend takes care of serialising the interactively-defined functions, but I'm not sure if older versions of sklearn will work with this new backend. Can I suggest this to be kept for reasons of backward compatibility ?

@albertcthomas
Copy link
Contributor

albertcthomas commented Jan 2, 2019

Can I suggest this to be kept for reasons of backward compatibility ?

I think it would be better to say that this example is useful as other backends than loky (multiprocessing, custom backends) can be used.

@fx86
Copy link
Contributor Author

fx86 commented Jan 2, 2019

Roger that, @albertcthomas.

@albertcthomas
Copy link
Contributor

Well it would be nice if someone could confirm that this is the expected behavior when loky is used and when a different backend is used

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my two comments below. Also as rmse is not defined anywhere in the example and as the example is not run, why do you think of using custom_scoring_function or custom_scorer instead of rmse?

@albertcthomas
Copy link
Contributor

Regarding the rest: rmse is only a placeholder-name for a custom function which is
not available as a custom scorer via sklearn, directly.

IMO rmse would be fine if it was defined in the example and I agree that most of the users will know that rmse stands for root mean square error but better explicit than implicit.

@fx86
Copy link
Contributor Author

fx86 commented Jan 14, 2019

Regarding the rest: rmse is only a placeholder-name for a custom function which is
not available as a custom scorer via sklearn, directly.

IMO rmse would be fine if it was defined in the example and I agree that most of the users will know that rmse stands for root mean square error but better explicit than implicit.

We could keep an abstract function name, like 'custom_scoring_function`, probably ?

@albertcthomas
Copy link
Contributor

albertcthomas commented Jan 14, 2019

We could keep an abstract function name, like 'custom_scoring_function`, probably ?

Yes

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fx86! Almost there :)

To make things even clearer I would also suggest putting your addition in a Notes section with a well-chosen title such as

.. notes:: **Using custom scorers in functions where `n_jobs > 1`**

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment. Otherwise LGTM! Thanks @fx86

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this will be a frequent issue with loky/cloudpickle, but I'm okay to merge. Thanks @fx86

@jnothman jnothman merged commit 9928713 into scikit-learn:master Jan 15, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants