Skip to content

RFC Guideline for usage of Cython types #25572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lorentzenchr opened this issue Feb 9, 2023 · 16 comments · Fixed by #28865
Closed

RFC Guideline for usage of Cython types #25572

lorentzenchr opened this issue Feb 9, 2023 · 16 comments · Fixed by #28865

Comments

@lorentzenchr
Copy link
Member

Goal

Have a documented consensus on which types to use in Cython code.

Types

We should distinguish between floating point numbers and integers. We may also split the use cases of integers: As data value and as index for pointers and memoryviews.

Linked issues

#24153, #25555 (comment), #23865 (comment), #23865 (comment)

Also note the Cython bug with const types: cython/cython#5230

@github-actions github-actions bot added the Needs Triage Issue requires triage label Feb 9, 2023
@lorentzenchr lorentzenchr changed the title Guidline for usage of Cython types RFC Guidline for usage of Cython types Feb 9, 2023
@jjerphan jjerphan added RFC cython and removed Needs Triage Issue requires triage labels Feb 9, 2023
@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

I am +1 on stopping the use of DTYPE and ITYPE aliases whose meaning is not well defined. When a given piece of code works for a given precision, let's directly use the explicit Cython names cnp.float64_t or cnp.float32_t.

When this is a fused type to work with several precision levels, then I think our usual floating convention is fine, but we could change it when we want to make it explicit that this a numerical feature values (or result of arithmetic computations derived from them and internal model parameters that typically use the same type).

For tempita variable names, maybe we could also reuse the floating name (if it does not conflict with the fused type name in the same Cython source file). Alternatively we could use float_precision in {32, 64} as template variable and the use cnp.float{float_precision}_t for the type declarations.

For integers, when they are used for data in our Cython code, they are mostly used as contiguous category or class indices among a precomputed list of possible values, typically the classes_ fit attribute computed by a LabelEncoder or similarly the outcome of calling OrdinalEncoder on categorical features. But they might also be used to represent count data (e.g. bin counts in HGBDT histograms & in KBinsDiscretizer in text / dict / hashing vectorizers).

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

I am against using type names with implicit precision levels (float, double, int, long...) as it's harder to reason about and less explicit whether or not the code behavior ( numerical precision of the computation, maximum count values before integer overflows) becomes platform dependent as a result.

@ogrisel ogrisel changed the title RFC Guidline for usage of Cython types RFC Guideline for usage of Cython types Feb 10, 2023
@thomasjpfan
Copy link
Member

I also want to stop using all the DTYPE, ITYPE, etc. aliases and use names that represents the precision.

My preference is to avoid cimporting numpy when we can avoid it and used the C types directly. Functionally, that means copying the typedefs we need from numpy defined here, but with our names:

# sklearn/utils/_typedefs.pyd

ctypedef Py_ssize_t         intp_t
ctypedef unsigned char      bool_t
ctypedef signed int         int32_t
ctypedef signed long long   int64_t
ctypedef float              float32_t
ctypedef double             float64_t
# etc

and use those typedefs everywhere. This means our Cython code would look like:

import numpy as np
from sklearn.utils._typedefs cimport float32_t

cdef float32_t[:] X = np.zeros(10, dtype=np.float32)

I like having the precision levels explicit in the type, but I also want to avoid including NumPy headers just for the types. The cost is copying some of the typedefs from NumPy.

@jjerphan
Copy link
Member

jjerphan commented Feb 13, 2023

I agree with what @ogrisel and @thomasjpfan propose.

In sklearn.tree._tree, types definitions are misleading as some names are clashing with other types definitions somewhere else. For instance, DTYPE_t in this module is float:

ctypedef cnp.npy_float32 DTYPE_t # Type of X
ctypedef cnp.npy_float64 DOUBLE_t # Type of y, sample_weight
ctypedef cnp.npy_intp SIZE_t # Type for indices and counters
ctypedef cnp.npy_int32 INT32_t # Signed 32 bit integer
ctypedef cnp.npy_uint32 UINT32_t # Unsigned 32 bit integer

Yet, it is double in all other places (see the the global search in the codebase).

Moreover, should we create our own fused types?

@lorentzenchr
Copy link
Member Author

I think we have an agreement with @thomasjpfan proposal in #25572 (comment). Now, we need a volunteer...

One remark about predefined fused types like cython.floating. This works fine as long as everything has the same type. But consider

def fun(floating a[:], floating b[:]):
    cdef:
        int i
        int n = a.shape[0]
    for i in range(n):
        a[i] += b[i]

This enforces a and b to always have the same (d)type. And there are situations where we do not want that.

@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 24, 2023

For fused types, I prefer using Cython's floating when we can. But as mentioned in #25572 (comment) there are cases where we have to create our own.

For commonly fused types, I can see doing something like scikit-image:

https://github.com/scikit-image/scikit-image/blob/main/skimage/_shared/fused_numerics.pxd

@Micky774
Copy link
Contributor

I agree with @thomasjpfan's proposal, and regarding fused types I'd like to echo my sentiment from a prior discussion: #24153 (comment)

TL;DR
Explicit is better than implicit. Explicit library-level aliases as suggested by @thomasjpfan are very helpful. I am not a fan of working with custom library-defined fused types. I think we should avoid making global fused types and instead defer to creating per-file/module fused types as needed.

With that being said, I think Cython standard fused types (e.g. floating) are fine since they're usually pretty helpful, and very obvious.

@jeremiedbb
Copy link
Member

I'd also be glad to get rid of all the DTYPE_t and friends.
I opened #25739 to start the work.

@jeremiedbb
Copy link
Member

jeremiedbb commented Mar 2, 2023

Note that we won't be able to complete this until we require cython 3 due to a bug I just found where cython doesn't properly infer types:

%%cython
cimport numpy as cnp

ctypedef Py_ssize_t intp_t  # like npy_intp and intp_t are defined in numpy

cdef void f(cnp.npy_intp* x):
    return x

cdef intp_t a = 0

f(&a)  # cython error: Cannot assign type 'intp_t *' to 'npy_intp *'

The function expects pointer to cnp.npy_intp, but we pass pointer to intp_t and it crashes although they are defined the same way.
It does work if we don't use pointers, and it does work if the function expects a pointer to a cnp.intp_t

I checked and it's fixed in the master branch of cython, but not in the 0.29.x branch.
We can however keep using the numpy types in the mean time where the problem occurs (in _binary_tree for instance)

@jeremiedbb
Copy link
Member

I opened #25942 to finalize the removal of old defined types in utils._typedefs.

The next step will be to use the new defined types in utils._typedefs in cython modules that used to not import types from there but defined DTYPE_t, ITYPE_t, ... locally.

@lorentzenchr
Copy link
Member Author

What's left to do? A short entry in the dev doc? Anything else?

@jeremiedbb
Copy link
Member

There are still several files that use custom typedefs defined locally that should now use the types defined in _typedefs.
We also should have an entry in the dev docs indeed.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented May 1, 2023

Some are useful. For HGBT, for instance, you can easily play around with the effect of different types only because it uses its own typedefs.

@adam2392
Copy link
Member

adam2392 commented Oct 12, 2023

I noticed this when building a downstream application that leverages Cython code from scikit-learn.

In #27352, I changed cnp.npy_uint32 (i.e. UINT32_t) to unsigned int uint32_t from _typedefs.pxd. However, I think this may not be valid on specifically Windows machines.

https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.uintc
https://stackoverflow.com/questions/76155091/np-uint32-np-uintc-on-windows

It seems that numpy's uint32 is an alias for either uintc (aka unsigned int) on Mac/Linux, which is fine for the new code, but is an alias for uint (aka unsigned long) on Windows. Lmk if my analysis is incorrect? I can propose a few solutions to fix this.

cc: @lorentzenchr and @jjerphan

@jjerphan
Copy link
Member

jjerphan commented Oct 12, 2023

@ogrisel
Copy link
Member

ogrisel commented Mar 19, 2024

The above links no longer work with the latest version of Cython.

But I found:

https://github.com/cython/cython/blob/3.0.9/Cython/Includes/libc/stdint.pxd#L15

and on numpy's side:

https://github.com/numpy/numpy/blob/v1.26.4/numpy/__init__.pxd#L280

which are both consistent with aliasing unsigned int as uint32_t as we do in _typedefs.pxd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants