Skip to content

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Jan 5, 2023

What does this implement/fix? Explain your changes.

With Cython 3.0 approaching it would be great to test scipy-dev against latest Cython rather than some alpha that we don't really know what it corresponds to. For example latest Cython alpha is from July 31 2022 https://pypi.org/project/Cython/#history.

Any other comments?

This adds 2-3 minutes to the build, I don't think this is an issue since this is mostly run in scheduled builds and some specific PRs.

@lesteve lesteve added Build / CI Quick Review For PRs that are quick to review labels Jan 5, 2023
@glemaitre
Copy link
Member

Since we already do the same for Pillow and joblib, I think that it makes sense.
Maybe we can switch back if Cython speed up the release procedure once 3.0 is out.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given that the CI passes.

Thank you, @lesteve.

@jjerphan jjerphan changed the title CI: use latest Cython sources in scipy-dev build CI Use latest Cython sources in scipy-dev build Jan 5, 2023
@lesteve
Copy link
Member Author

lesteve commented Jan 5, 2023

Turns out there are plenty of compilation errors with latest Cython development version 😢. See https://dev.azure.com/scikit-learn/d9c6f05f-7f58-4a17-b143-2a4e7ff015af/_build/results?buildId=50541&view=logs&jobId=dfe99b15-50db-5d7b-b1e9-4105c42527cf.

I'll try to have a closer look at one point.

@lesteve lesteve added the cython label Jan 5, 2023
@glemaitre
Copy link
Member

It seems that we were relying on some casting that are not allowed anymore ;)

@glemaitre
Copy link
Member

glemaitre commented Jan 5, 2023

And we have some new warnings :) (seems to be about the nogil keyword always).

@jjerphan
Copy link
Member

jjerphan commented Jan 5, 2023

Starting Cython 3.0, exceptions will be propagated by default to Python if C interfaces are not noexcept-qualified; hence making the builds scream here.

Still, cython/cython#5094 adds a flag to be able to keeps the previous behavior, easing the migrating to Cython 3.0 (for some context, see cython/cython#5088).

@lesteve lesteve removed the Quick Review For PRs that are quick to review label Jan 5, 2023
@lesteve
Copy link
Member Author

lesteve commented Jan 5, 2023

@lesteve
Copy link
Member Author

lesteve commented Jan 5, 2023

Not sure whether the 2 ** exponent different types is an expected change in Cython 3.

This program does not compile with latest development Cython:

cdef int exponent = 2
cdef int result

result = 2 ** exponent

with the following error:

Error compiling Cython file:
------------------------------------------------------------
...
cdef int exponent = 2
cdef int result

result = 2 ** exponent
           ^
------------------------------------------------------------

/tmp/test.pyx:4:11: Cannot assign type 'double' to 'int'

This compiles fine with Cython latest release (0.29.32)

@jjerphan
Copy link
Member

jjerphan commented Jan 6, 2023

From cython/cython#5016:

<int>**<int> now usually returns a double (unless it knows the exponent is non-negative)

@lesteve
Copy link
Member Author

lesteve commented Jan 6, 2023

Nice thanks, it still seems a bit weird that the float is not cast to int automatically. Do you agree?

In other words, in the following snippet ivalue = fvalue works fine but not ivalue = 2 ** exponent yields a compilation error Cannot assign type 'double' to 'int'.

cdef int exponent = 2
cdef int ivalue
cdef fvalue = 2.

ivalue = fvalue

ivalue = 2 ** exponent

@betatim
Copy link
Member

betatim commented Jan 6, 2023

I don't know what the default type of a cdef'ed variable is, but if you explicitly type it as cdef double fvalue = 2. you get the same "can't assign type double to int" error message.

I guess not allowing doubles/floats to be assigned to an int makes some sense. You lose precision (3.141 becomes 3) but also not every real number (as in calligraphic R) can be represented by a double/float. So I guess if you are pedantic then you should ask the user to explicitly state "yes, I understand the danger" when performing the conversions.

@lesteve
Copy link
Member Author

lesteve commented Jan 6, 2023

Oops indeed I forgot the double in the cdef ...

OK let's say it kind of makes sense then.

@lesteve
Copy link
Member Author

lesteve commented Jan 9, 2023

Not sure about the linux-arm64 failure a ConvergenceWarning in sklearn/linear_model/tests/test_logistic.py test_multinomial_binary_probabilities see build log.

This does not seem related to the changes in this PR. I restarted the CI to see if this still exists.

Full error
____________________ test_multinomial_binary_probabilities _____________________
[gw0] linux -- Python 3.9.15 /home/circleci/miniconda/envs/testenv/bin/python3.9

    def test_multinomial_binary_probabilities():
        # Test multinomial LR gives expected probabilities based on the
        # decision function, for a binary problem.
        X, y = make_classification()
        clf = LogisticRegression(multi_class="multinomial", solver="saga", tol=1e-3)
>       clf.fit(X, y)

/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/tests/test_logistic.py:254: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1291: in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/parallel.py:1085: in __call__
    if self.dispatch_one_batch(iterator):
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/parallel.py:901: in dispatch_one_batch
    self._dispatch(tasks)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/parallel.py:819: in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/_parallel_backends.py:208: in apply_async
    result = ImmediateResult(func)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/_parallel_backends.py:597: in __init__
    self.results = batch()
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/parallel.py:288: in __call__
    return [func(*args, **kwargs)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/joblib/parallel.py:288: in <listcomp>
    return [func(*args, **kwargs)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/utils/fixes.py:130: in __call__
    return self.function(*args, **kwargs)
/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:524: in _logistic_regression_path
    w0, n_iter_i, warm_start_sag = sag_solver(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X = array([[-5.07052023e-01, -2.90901696e-01,  1.30130372e+00, ...,
        -3.70734951e-01, -1.65279250e-01, -8.97548514e...  [ 1.72112777e-03,  1.17575582e+00,  2.39366727e+00, ...,
        -1.49670827e+00, -8.28092631e-01, -5.06782103e-01]])
y = array([1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0.,
       0., 1., 1., 0., 1., 1., 0., 1., 1., ...0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 0.])
sample_weight = array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., ...1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
loss = 'multinomial', alpha = 1.0, beta = 0.0, max_iter = 100, tol = 0.001
verbose = 0, random_state = RandomState(MT19937) at 0xFFFFA773AE40
check_input = False, max_squared_sum = 38.88976567607865
warm_start_mem = {'coef': array([[-0.30446823,  0.30446823],
       [-0.18628909,  0.18628909],
       [ 0.31009769, -0.31009769],
    ...0.0371788 ],
       [-0.0087767 ,  0.0087767 ],
       [-0.07712977,  0.07712977],
       [-0.43543505,  0.43543505]])}
is_saga = True

    def sag_solver(
        X,
        y,
        sample_weight=None,
        loss="log",
        alpha=1.0,
        beta=0.0,
        max_iter=1000,
        tol=0.001,
        verbose=0,
        random_state=None,
        check_input=True,
        max_squared_sum=None,
        warm_start_mem=None,
        is_saga=False,
    ):
        """SAG solver for Ridge and LogisticRegression.
    
        SAG stands for Stochastic Average Gradient: the gradient of the loss is
        estimated each sample at a time and the model is updated along the way with
        a constant learning rate.
    
        IMPORTANT NOTE: 'sag' solver converges faster on columns that are on the
        same scale. You can normalize the data by using
        sklearn.preprocessing.StandardScaler on your data before passing it to the
        fit method.
    
        This implementation works with data represented as dense numpy arrays or
        sparse scipy arrays of floating point values for the features. It will
        fit the data according to squared loss or log loss.
    
        The regularizer is a penalty added to the loss function that shrinks model
        parameters towards the zero vector using the squared euclidean norm L2.
    
        .. versionadded:: 0.17
    
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            Training data.
    
        y : ndarray of shape (n_samples,)
            Target values. With loss='multinomial', y must be label encoded
            (see preprocessing.LabelEncoder).
    
        sample_weight : array-like of shape (n_samples,), default=None
            Weights applied to individual samples (1. for unweighted).
    
        loss : {'log', 'squared', 'multinomial'}, default='log'
            Loss function that will be optimized:
            -'log' is the binary logistic loss, as used in LogisticRegression.
            -'squared' is the squared loss, as used in Ridge.
            -'multinomial' is the multinomial logistic loss, as used in
             LogisticRegression.
    
            .. versionadded:: 0.18
               *loss='multinomial'*
    
        alpha : float, default=1.
            L2 regularization term in the objective function
            ``(0.5 * alpha * || W ||_F^2)``.
    
        beta : float, default=0.
            L1 regularization term in the objective function
            ``(beta * || W ||_1)``. Only applied if ``is_saga`` is set to True.
    
        max_iter : int, default=1000
            The max number of passes over the training data if the stopping
            criteria is not reached.
    
        tol : float, default=0.001
            The stopping criteria for the weights. The iterations will stop when
            max(change in weights) / max(weights) < tol.
    
        verbose : int, default=0
            The verbosity level.
    
        random_state : int, RandomState instance or None, default=None
            Used when shuffling the data. Pass an int for reproducible output
            across multiple function calls.
            See :term:`Glossary <random_state>`.
    
        check_input : bool, default=True
            If False, the input arrays X and y will not be checked.
    
        max_squared_sum : float, default=None
            Maximum squared sum of X over samples. If None, it will be computed,
            going through all the samples. The value should be precomputed
            to speed up cross validation.
    
        warm_start_mem : dict, default=None
            The initialization parameters used for warm starting. Warm starting is
            currently used in LogisticRegression but not in Ridge.
            It contains:
                - 'coef': the weight vector, with the intercept in last line
                    if the intercept is fitted.
                - 'gradient_memory': the scalar gradient for all seen samples.
                - 'sum_gradient': the sum of gradient over all seen samples,
                    for each feature.
                - 'intercept_sum_gradient': the sum of gradient over all seen
                    samples, for the intercept.
                - 'seen': array of boolean describing the seen samples.
                - 'num_seen': the number of seen samples.
    
        is_saga : bool, default=False
            Whether to use the SAGA algorithm or the SAG algorithm. SAGA behaves
            better in the first epochs, and allow for l1 regularisation.
    
        Returns
        -------
        coef_ : ndarray of shape (n_features,)
            Weight vector.
    
        n_iter_ : int
            The number of full pass on all samples.
    
        warm_start_mem : dict
            Contains a 'coef' key with the fitted result, and possibly the
            fitted intercept at the end of the array. Contains also other keys
            used for warm starting.
    
        Examples
        --------
        >>> import numpy as np
        >>> from sklearn import linear_model
        >>> n_samples, n_features = 10, 5
        >>> rng = np.random.RandomState(0)
        >>> X = rng.randn(n_samples, n_features)
        >>> y = rng.randn(n_samples)
        >>> clf = linear_model.Ridge(solver='sag')
        >>> clf.fit(X, y)
        Ridge(solver='sag')
    
        >>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
        >>> y = np.array([1, 1, 2, 2])
        >>> clf = linear_model.LogisticRegression(
        ...     solver='sag', multi_class='multinomial')
        >>> clf.fit(X, y)
        LogisticRegression(multi_class='multinomial', solver='sag')
    
        References
        ----------
        Schmidt, M., Roux, N. L., & Bach, F. (2013).
        Minimizing finite sums with the stochastic average gradient
        https://hal.inria.fr/hal-00860051/document
    
        :arxiv:`Defazio, A., Bach F. & Lacoste-Julien S. (2014).
        "SAGA: A Fast Incremental Gradient Method With Support
        for Non-Strongly Convex Composite Objectives" <1407.0202>`
    
        See Also
        --------
        Ridge, SGDRegressor, ElasticNet, Lasso, SVR,
        LogisticRegression, SGDClassifier, LinearSVC, Perceptron
        """
        if warm_start_mem is None:
            warm_start_mem = {}
        # Ridge default max_iter is None
        if max_iter is None:
            max_iter = 1000
    
        if check_input:
            _dtype = [np.float64, np.float32]
            X = check_array(X, dtype=_dtype, accept_sparse="csr", order="C")
            y = check_array(y, dtype=_dtype, ensure_2d=False, order="C")
    
        n_samples, n_features = X.shape[0], X.shape[1]
        # As in SGD, the alpha is scaled by n_samples.
        alpha_scaled = float(alpha) / n_samples
        beta_scaled = float(beta) / n_samples
    
        # if loss == 'multinomial', y should be label encoded.
        n_classes = int(y.max()) + 1 if loss == "multinomial" else 1
    
        # initialization
        sample_weight = _check_sample_weight(sample_weight, X, dtype=X.dtype)
    
        if "coef" in warm_start_mem.keys():
            coef_init = warm_start_mem["coef"]
        else:
            # assume fit_intercept is False
            coef_init = np.zeros((n_features, n_classes), dtype=X.dtype, order="C")
    
        # coef_init contains possibly the intercept_init at the end.
        # Note that Ridge centers the data before fitting, so fit_intercept=False.
        fit_intercept = coef_init.shape[0] == (n_features + 1)
        if fit_intercept:
            intercept_init = coef_init[-1, :]
            coef_init = coef_init[:-1, :]
        else:
            intercept_init = np.zeros(n_classes, dtype=X.dtype)
    
        if "intercept_sum_gradient" in warm_start_mem.keys():
            intercept_sum_gradient = warm_start_mem["intercept_sum_gradient"]
        else:
            intercept_sum_gradient = np.zeros(n_classes, dtype=X.dtype)
    
        if "gradient_memory" in warm_start_mem.keys():
            gradient_memory_init = warm_start_mem["gradient_memory"]
        else:
            gradient_memory_init = np.zeros(
                (n_samples, n_classes), dtype=X.dtype, order="C"
            )
        if "sum_gradient" in warm_start_mem.keys():
            sum_gradient_init = warm_start_mem["sum_gradient"]
        else:
            sum_gradient_init = np.zeros((n_features, n_classes), dtype=X.dtype, order="C")
    
        if "seen" in warm_start_mem.keys():
            seen_init = warm_start_mem["seen"]
        else:
            seen_init = np.zeros(n_samples, dtype=np.int32, order="C")
    
        if "num_seen" in warm_start_mem.keys():
            num_seen_init = warm_start_mem["num_seen"]
        else:
            num_seen_init = 0
    
        dataset, intercept_decay = make_dataset(X, y, sample_weight, random_state)
    
        if max_squared_sum is None:
            max_squared_sum = row_norms(X, squared=True).max()
        step_size = get_auto_step_size(
            max_squared_sum,
            alpha_scaled,
            loss,
            fit_intercept,
            n_samples=n_samples,
            is_saga=is_saga,
        )
        if step_size * alpha_scaled == 1:
            raise ZeroDivisionError(
                "Current sag implementation does not handle "
                "the case step_size * alpha_scaled == 1"
            )
    
        sag = sag64 if X.dtype == np.float64 else sag32
        num_seen, n_iter_ = sag(
            dataset,
            coef_init,
            intercept_init,
            n_samples,
            n_features,
            n_classes,
            tol,
            max_iter,
            loss,
            step_size,
            alpha_scaled,
            beta_scaled,
            sum_gradient_init,
            gradient_memory_init,
            seen_init,
            num_seen_init,
            fit_intercept,
            intercept_sum_gradient,
            intercept_decay,
            is_saga,
            verbose,
        )
    
        if n_iter_ == max_iter:
>           warnings.warn(
                "The max_iter was reached which means the coef_ did not converge",
                ConvergenceWarning,
            )
E           sklearn.exceptions.ConvergenceWarning: The max_iter was reached which means the coef_ did not converge

/home/circleci/miniconda/envs/testenv/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning
=============================== warnings summary ===============================

@lesteve
Copy link
Member Author

lesteve commented Jan 9, 2023

The linux-arm64 error did not happen this time 🤷

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

}

if not Cython.__version__.startswith("0."):
compiler_directives["legacy_implicit_noexcept"] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this flag will be removed in the future, may we include a TODO comment here that links to cython/cython#5088 and states that we need to fix this?

"cdivision": True,
}

if not Cython.__version__.startswith("0."):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting will not allow scikit-learn to build with any of Cython's alpha releases because legacy_implicit_noexcept has not been released in any of the alphas. I wanted to suggestion this:

    if parse(Cython.__version__) >= parse("3.0.0a11"):

But the dev branch of Cython is still "3.0.0a11": https://github.com/cython/cython/blob/master/Cython/Shadow.py#L4, which is the same the most recent PyPi alpha release.

I am okay with the compromise of only allowing scikit-learn to build with Cython 3 on the dev branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I have opened cython/cython#5204 to ask for a support of Development release segment in public version identifiers as specified by PEP 440.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that cython/cython#5204 has been addressed, we should be able to use this:

Suggested change
if not Cython.__version__.startswith("0."):
if not Cython.__version__.startswith("3.0.0a12.dev0"):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR was likely merged too quickly and broke installing cython from the master branch, see cython/cython#5204 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cython/cython#5208 should fix that.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment. This is still mergeable to me if the CI jobs pass.

# TODO: once Cython 3 is released and we require Cython>=3 we should get
# rid of the `legacy_implicit_noexcept` directive.
# This should mostly consist in:
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan thomasjpfan merged commit f5ae73d into scikit-learn:main Jan 11, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 12, 2023
@lesteve lesteve deleted the use-dev-cython branch January 16, 2023 09:09
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants