Skip to content

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Oct 11, 2018

as joblib 0.12 may introduce issues for some users, we should continue to support 0.11, at least in 0.20.X. (Let me know if you think I should base this on that branch rather than master.)

This adds a travis run with older joblib.

@jnothman jnothman added this to the 0.20.1 milestone Oct 11, 2018
@rth
Copy link
Member

rth commented Oct 11, 2018

I also think it's an important use case. @ogrisel mentioned that the main technical difficulty is that the following options (currently used) are not supported in joblib 0.11,

  • Parallel(..., n_jobs=None) extensively used, which could fall back to n_jobs=1 on 0.11. Backporting n_jobs=None to joblib 0.11.1 joblib/joblib#786 could be a way to make supporting this less painful from the scikit-learn perspective.
  • Parallel(.., prefer='threads') is currently only used in a few places and should be fairly to straightforward to make conditional depending on the joblib version.

I think it would make sense to start treating joblib as dependency that it is. Also related conda-forge/scikit-learn-feedstock#75

@rth
Copy link
Member

rth commented Oct 13, 2018

@jnothman I'll push a few commits to at least have an environment working so we can see what is the situation with joblib 0.11 support. Please don't force push on top :)

There was an install issue because joblib 0.11 is not built on conda for python 3.4. I added a python 3.5 build to the matrix, since anyway it looks like we are not testing it in any of the CI. Not sure if we want to keep it, but it can be a temporary solution to see where things are in this PR.

@jnothman
Copy link
Member Author

Thanks for that @rth. (And thanks for pushing on 0.20.1 issues.)

require='sharedmem' is currently breaking this.

@rth
Copy link
Member

rth commented Oct 16, 2018

require='sharedmem' is currently breaking this.

hmm it's strange that I wasn't seeing that locally. We can probably generalize,

Parallel(...,
         **_joblib_parallel_params(prefer='threads', require='sharedmem'))

and map those for joblib 0.11 as needed. Another thing I don't understand is why we are not seeing failures due to the new default n_jobs=None (haven't looked at it properly yet)..

@amueller
Copy link
Member

We still support 3.4 in 0.20.1.

@rth
Copy link
Member

rth commented Oct 16, 2018

We still support 3.4 in 0.20.1.

We do -- this adds a 3.5 build, doesn't affect the existing 3.4 one.

@amueller
Copy link
Member

ah, right, sorry.
I have a couple of deadlines this week and haven't really been able to catch up with all the joblib stuff that's happening.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 1621987 into a1d0e96 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@rth
Copy link
Member

rth commented Oct 17, 2018

So CI is green, and n_jobs=None might not need anything special since it appears to work with joblib 0.11 joblib/joblib#786 (comment).

@rth
Copy link
Member

rth commented Oct 17, 2018

@lesteve Would you be able to have a look at the diff (particularly in utils.fixes) to see if I got the mapping of parameters between joblib 0.11 and 0.12+ right.. Thanks!

Copy link
Member Author

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.fixes looks right as long as prefer!='processes' when requires='sharedmem'... I don't think we need to handle this case but an assert wouldn't hurt.

But I'm hardly the joblib expert around here.

@rth
Copy link
Member

rth commented Oct 17, 2018

utils.fixes looks right as long as prefer!='processes' when requires='sharedmem'... I don't think we need to handle this case but an assert wouldn't hurt.

So currently, if prefer='processes' and requires='sharedmem' with joblib 0.11 it will select backend='threading', as that's how I understood the requires docstring,

require: ‘sharedmem’ or None, default None
Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with parallel_backend.

but a confirmation from someone more familiar with joblib would be good to have maybe Loic or @tomMoral

Copy link
Member Author

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @rth. There's no approve button for me to press... It needs a what's new, and perhaps we should state supported versions where we talk about the SITE_JOBLIB setting

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping from prefer+require to backend looks right to me.

def _joblib_parallel_args(prefer=None, require=None):
"""Set joblib.Parallel arguments in a compatible way for 0.11 and 0.12+"""
from . import _joblib
if prefer not in ['threads', 'processes', None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this argument checks to the case joblib <= 0.11 (in the 0.12+ case just forward the arguments without checks). Also you may want to check backend too.

raise NotImplementedError('prefer=%s is not supported'
% prefer)
args = {}
if _joblib.__version__ >= LooseVersion('0.12.0'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using LooseVersion('0.12') (i.e. only major.minor). There are a few caveats with using LooseVersion with PEP0444 versions. In my experience, the fewer numbers you use the better off you are (for example):

In [1]: LooseVersion('0.12.dev') < LooseVersion('0.12.0')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-f2e012a14c0d> in <module>()
----> 1 LooseVersion('0.12.dev') < LooseVersion('0.12.0')

~/miniconda3/lib/python3.6/distutils/version.py in __lt__(self, other)
     50 
     51     def __lt__(self, other):
---> 52         c = self._cmp(other)
     53         if c is NotImplemented:
     54             return c

~/miniconda3/lib/python3.6/distutils/version.py in _cmp(self, other)
    335         if self.version == other.version:
    336             return 0
--> 337         if self.version < other.version:
    338             return -1
    339         if self.version > other.version:

TypeError: '<' not supported between instances of 'str' and 'int'
In [1]: LooseVersion('0.12.dev') < LooseVersion('0.12.0')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-f2e012a14c0d> in <module>()
----> 1 LooseVersion('0.12.dev') < LooseVersion('0.12.0')

~/miniconda3/lib/python3.6/distutils/version.py in __lt__(self, other)
     50 
     51     def __lt__(self, other):
---> 52         c = self._cmp(other)
     53         if c is NotImplemented:
     54             return c

~/miniconda3/lib/python3.6/distutils/version.py in _cmp(self, other)
    335         if self.version == other.version:
    336             return 0
--> 337         if self.version < other.version:
    338             return -1
    339         if self.version > other.version:

TypeError: '<' not supported between instances of 'str' and 'int'

@lesteve
Copy link
Member

lesteve commented Oct 18, 2018

I haven't followed the situation very closely about the joblib problems that motivate this change, but I have to admit that in my mind SKLEARN_SITE_JOBLIB is a developer thing: basically you are on your own if you use it and you have no guarantee that it will work. In my mind it was meant more to use with joblib development versions rather than allow to go back to older joblib versions.

@jnothman
Copy link
Member Author

jnothman commented Oct 18, 2018 via email

if prefer is not None:
args['prefer'] = prefer
if require is not None:
args['require'] = require
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply args = {'prefer': prefer, 'require': require'}?
The default value for both argument is None so this check is not useful.

To avoid having to re-writte this part if we add some arguments to joblib.Parallel, I would suggest a function such as:

def _joblib_parallel_args(**kwargs):
    if _joblib.__version__ >= LooseVersion('0.12'):
        return kwargs
    
    # If version is earlier than 0.12, mock the prefer and request arguments
    args = {'backend': None}
    prefer = kwargs.get('prefer')
    if prefer not in ['threads', 'processes', None]:
        raise NotImplementedError('prefer=%s is not supported'
                                  % prefer)
    args['backend'] = {'threads': 'threading',
                       'processes': 'multiprocessing'
                       None: None}[prefer]

    if kwargs.get('sharedmem', False):
        args['backend'] = 'threading'
    return args

@rth
Copy link
Member

rth commented Oct 22, 2018

Thanks for the reviews!

I adapted a bit @tomMoral 's version to not make assumptions about default values of parameters in joblib and explicitly fail on unhanded parameters.

Also added tests that should check all cases in sklearn.utils.fixes._joblib_parallel_args, independently of the joblib version installed.

This should address remaining comments, I think.

@jnothman
Copy link
Member Author

We should add something to what's new...

@rth
Copy link
Member

rth commented Oct 23, 2018

Added a what's new..

@rth
Copy link
Member

rth commented Nov 6, 2018

Any other comments on this? I'll count Joel as +1, but we need a second review.

Maybe @amueller ? @lesteve I think I addressed all your comments. Thanks!

@amueller
Copy link
Member

amueller commented Nov 6, 2018

I'll try today or tomorrow maybe?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass and it looks great. +1 for merge when CI is green.

Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just a small nitpick on the name of the function.

@@ -332,3 +334,51 @@ def _object_dtype_isnan(X):
from collections import Iterable as _Iterable # noqa
from collections import Mapping as _Mapping # noqa
from collections import Sized as _Sized # noqa


def _joblib_parallel_args(**kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just change it to _joblib_parallel_kwargs as you only set kwargs in it?

It would be clearer.

@rth
Copy link
Member

rth commented Nov 6, 2018

Thanks for the reviews @ogrisel and @tomMoral !

Pushed another commit that fixes some issues with the merge conflict (flake8 and a new test that was added meanwhile to fix CI). CI is green now.

Maybe just change it to _joblib_parallel_kwargs as you only set kwargs in it?

I kind of prefer the name _joblib_parallel_args as it's more general: it maps the arguments for joblib parallel, the input is indeed kwargs only and the output is a dict. Currently the output is passed to kwargs, but it doesn't strictly have to be.
Also I find the signature _joblib_parallel_kwargs(**kwargs) a bit confusing as it's unclear if the kwargs in the name have something to do with the kwargs parameters.

Merging as it is, it's a private function, we can always change it later.

@rth rth merged commit 1128094 into scikit-learn:master Nov 6, 2018
@rth rth mentioned this pull request Nov 6, 2018
4 tasks
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 7, 2018
* upstream/master:
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 9, 2018
…ybutton

* upstream/master:
  FIX YeoJohnson transform lambda bounds (scikit-learn#12522)
  [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset  (scikit-learn#12541)
  ENH Prefer threads for IsolationForest (scikit-learn#12543)
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants