Skip to content

TST Actually compare feature importances when computed in parallel #25445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Jan 20, 2023

What does this implement/fix? Explain your changes.

Looks like the test was meant to compare feature importances with n_jobs=1 and n_jobs=2 but we were actually not calling .fit ...

Unless I am very tired and I am missing something subtle of course ...

@lesteve lesteve added No Changelog Needed Quick Review For PRs that are quick to review labels Jan 20, 2023
@lesteve
Copy link
Member Author

lesteve commented Jan 20, 2023

There is an failure on ARM which does not seem related to this PR ...

__________________________ test_logreg_l1_sparse_data __________________________
[gw3] linux -- Python 3.9.15 /miniconda/envs/testenv/bin/python3.9

    def test_logreg_l1_sparse_data():
        # Because liblinear penalizes the intercept and saga does not, we do not
        # fit the intercept to make it possible to compare the coefficients of
        # the two models at convergence.
        rng = np.random.RandomState(42)
        n_samples = 50
        X, y = make_classification(n_samples=n_samples, n_features=20, random_state=0)
        X_noise = rng.normal(scale=0.1, size=(n_samples, 3))
        X_constant = np.zeros(shape=(n_samples, 2))
        X = np.concatenate((X, X_noise, X_constant), axis=1)
        X[X < 1] = 0
        X = sparse.csr_matrix(X)
    
        lr_liblinear = LogisticRegression(
            penalty="l1",
            C=1.0,
            solver="liblinear",
            fit_intercept=False,
            multi_class="ovr",
            tol=1e-10,
        )
>       lr_liblinear.fit(X, y)

/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/tests/test_logistic.py:1018: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1216: in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X = <50x25 sparse matrix of type '<class 'numpy.float64'>'
	with 149 stored elements in Compressed Sparse Row format>
y = array([1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 0, 0])
C = 1.0, fit_intercept = False, intercept_scaling = 1, class_weight = None
penalty = 'l1', dual = False, verbose = 0, max_iter = 100, tol = 1e-10
random_state = None, multi_class = 'ovr', loss = 'logistic_regression'
epsilon = 0.1
sample_weight = array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

    def _fit_liblinear(
        X,
        y,
        C,
        fit_intercept,
        intercept_scaling,
        class_weight,
        penalty,
        dual,
        verbose,
        max_iter,
        tol,
        random_state=None,
        multi_class="ovr",
        loss="logistic_regression",
        epsilon=0.1,
        sample_weight=None,
    ):
        if loss not in ["epsilon_insensitive", "squared_epsilon_insensitive"]:
            enc = LabelEncoder()
            y_ind = enc.fit_transform(y)
            classes_ = enc.classes_
            if len(classes_) < 2:
                raise ValueError(
                    "This solver needs samples of at least 2 classes"
                    " in the data, but the data contains only one"
                    " class: %r"
                    % classes_[0]
                )
    
            class_weight_ = compute_class_weight(class_weight, classes=classes_, y=y)
        else:
            class_weight_ = np.empty(0, dtype=np.float64)
            y_ind = y
        liblinear.set_verbosity_wrap(verbose)
        rnd = check_random_state(random_state)
        if verbose:
            print("[LibLinear]", end="")
    
        # LinearSVC breaks when intercept_scaling is <= 0
        bias = -1.0
        if fit_intercept:
            if intercept_scaling <= 0:
                raise ValueError(
                    "Intercept scaling is %r but needs to be greater "
                    "than 0. To disable fitting an intercept,"
                    " set fit_intercept=False." % intercept_scaling
                )
            else:
                bias = intercept_scaling
    
        libsvm.set_verbosity_wrap(verbose)
        libsvm_sparse.set_verbosity_wrap(verbose)
        liblinear.set_verbosity_wrap(verbose)
    
        # Liblinear doesn't support 64bit sparse matrix indices yet
        if sp.issparse(X):
            _check_large_sparse(X)
    
        # LibLinear wants targets as doubles, even for classification
        y_ind = np.asarray(y_ind, dtype=np.float64).ravel()
        y_ind = np.require(y_ind, requirements="W")
    
        sample_weight = _check_sample_weight(sample_weight, X, dtype=np.float64)
    
        solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
        raw_coef_, n_iter_ = liblinear.train_wrap(
            X,
            y_ind,
            sp.isspmatrix(X),
            solver_type,
            tol,
            bias,
            C,
            class_weight_,
            max_iter,
            rnd.randint(np.iinfo("i").max),
            epsilon,
            sample_weight,
        )
        # Regarding rnd.randint(..) in the above signature:
        # seed for srand in range [0..INT_MAX); due to limitations in Numpy
        # on 32-bit platforms, we can't get to the UINT_MAX limit that
        # srand supports
        n_iter_max = max(n_iter_)
        if n_iter_max >= max_iter:
>           warnings.warn(
                "Liblinear failed to converge, increase the number of iterations.",
                ConvergenceWarning,
            )
E           sklearn.exceptions.ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.

/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/svm/_base.py:1244: ConvergenceWarning
=============================== warnings summary ===============================

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 20, 2023
@thomasjpfan
Copy link
Member

As for the random ARM failure, I opened #25446 to stabilize the tests in test_logistic.

@lesteve
Copy link
Member Author

lesteve commented Jan 23, 2023

Actually from Olivier's comment in another issue, feature_importances_ is actually a property computing stuff in parallel, so this is fine ... closing this one.

The code is here.

@lesteve lesteve closed this Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble No Changelog Needed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants