-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Dtype Brainstorming
A brainstorming session at the SciPy 2018 sprints July 14, 2018
- units
- unyt (ndarray subclass)
- Pint (wrapper via array_prepare and array_wrap, see comment)
- astropy.units (ndarray subclass)
- several others, see https://www.youtube.com/watch?v=N-edLdxiM40
- enumerations / categorical
- text
- encoded fixed width text (utf8, latin1, ...)
- variable width
- datetime
- 360 day calendar
- Ora: https://github.com/alexhsamuel/ora
- shapely/GEOS geometries
- does this include jagged arrays of polygons?
- Numerics
- Novel floating point formats
- Decimal (arbitrary precision)
- Big int
- finite fields
- Rationals https://github.com/numpy/numpy-dtypes
- float16?
- Missing values
- sentinels
- bitmask
- record-like array
- optional
- quaternion
- https://github.com/martinling/numpy_quaternion (outdated)
- https://github.com/moble/quaternion (maintianed)
- a general pointer dtype that does memory packagement
- xdress (https://github.com/xdress/xdress)
- ndtypes (https://ndtypes.readthedocs.io/)
Replacing subclassing, which is quite fragile.
-
Hold data (e.g., categories, datetime64, units)
-
Needs to be able to override dtype specific functionality:
- Arithmetic
- Ufuncs
- Sorting
- Coercion rules
-
Handle life cycle (e.g. GEOS/shapely)
-
Push API up to the ndarray
- For example can a unit dtype push a method
.to
up to the ndarray class to convert to a different unit - Or can a datetime dtype push a
.year
or.dayofweek
up to the ndarray class - This can be done -- but should it be done?
- For example can a unit dtype push a method
-
Two use cases: writing high-level dtypes in Python and low-level dtypes in C:
- We need new capabilities for C dtypes:
- At the C level, the current interface is quite cumbersome. It would be nice to have something easier for use with C/C++/Cython.
- At a low-level, ufunc loops need access to dtype metadata (e.g., this is why we don't have ufuncs for strings in NumPy)
- A new primitive data type for pointers would be broadly useful (e.g., for managing strings or geometric objects).
- We need to be able to write custom dtypes in Python
- This would be particularly useful for high level dtypes like units or categorical, which can be written in terms of a primitive data types plus some metadata.
- Ideally, custom dtypes would reuse existing protocols for duck arrays, e.g.,
__array_ufunc__
and__array_function__
.
- We need new capabilities for C dtypes:
-
Mechanism for extended dtypes to go from strings to dtypes
- Parse
dtype='my_dtype[options]'
into the dtype constructor somehow. - DSL? Are we parsing?
- Handle conflicting names by convention (maybe raise a warning)
- Possibly need a registration mechanism (so
np.array([1, 2, 3], dtype='my_dtype')
would work)
- Parse
-
Scalar types should not need to be NumPy scalars?
-
Should it allow for mix-in like paradigms (say have
mydtype
based off ofnp.float64)
? -
Should we have some thing like
isinstance(dtype, (np.float64, np.float32))
? -
Should not require every
.dtype
attribute to be a NumPy dtype (e.g.,pandas_series.dtype == np.dtype(np.float64)
current breaks)
Suggestion: strawman proposal for what writing a dtype should look like.
From Nathan, based on unyt_array
import numpy as np
class float64_with_unit(np.dtype):
array_dtype = np.float64
unit = None
def __init__(self, unit):
self.unit = unit
def __array_ufunc_proxy__(self, ufunc, method, *input_dtypes, **kwargs):
# do ufunc dispatch
raise NotImplementedError
def __array_function_proxy__(self, function, *input_dtypes, **kwargs):
# do function dispatch
raise NotImplementedError
def __setstate__(self, state):
# do pickle serialization
raise NotImplementedError
- Do we want to give dtypes the ability to change all functions, or just ufuncs
-
__array_func__
should call dtype.coerce
From Ryan, out of thin air:
import numpy as np
class UnitDType(np.dtype):
_ndarray_api = ['convert']
def __init__(self, unit, baseType=np.float64):
self._unit = unit
self._base = baseType
def convert(self, unit):
# astype()?
pass
def __add__(self, other):
self._check(other):
self._base.add(self, other)
def __mul__(self, other):
self._base.mul(self, other)
self._update_dimensionality(other)
def _check(self, other):
if self._dimensions != other._dimensions:
raise UnitError
def _update_dimensionality(self, other)
self._dimensions[...]
a = np.ones((5,), dtype=UnitDtype('meters'))
b = np.ones((5,), dtype=UnitDType('seconds'))
a + b # UnitError
a * b == np.ones((5,), dtype=UnitDType('meters/seconds'))
-
.astype('units[ft]')
could work, but it would be nice to specify just.convert('ft')
-
__add__
etc should be handled by__array_ufunc__
- Units here might be a specific case of something more general
From Stephan:
class VariableLengthString(np.LogicalDtype):
physical_dtype = np.object
name = 'String'
def __array_ufunc__(self, ufunc, method, args, **kwargs):
if any(not isinstance(a.dtype, VariableLengthString)
for a in args):
return NotImplemented
physical_args = tuple(a.astype(object) for a in args)
result = getattr(ufunc, method)(*physical_args, **kwargs)
return result.astype(VariableLengthString)
def __array_function__(self, func, types, args, kwargs):
# can't do it! types only exposes type information, not dtype
def __dtype_promote__(self, dtypes):
if all(d in [VariableLengthString, np.unicode_, np.string_]
for d in dtypes):
return VariableLengthString()
return NotImplemented
def __array_coerce__(self, array, casting):
if array.dtype.kind == 'U':
result = array.astype(object)
result.dtype = VariableLengthString()
elif array.dtype.kind == 'S':
# decode as ascii? raise?
elif array.dtype.kind == 'O':
# check for all string object
else:
raise TypeError
return result
I used LogicalDtype above to say that this is based off of another dtype so that numpy knows how to handle it. I just want to implement a little bit on top of that.
The __array__function__
protocol doesn't work that well because the dtype wasn't explicitly provided for all of the arrays.
From Joris
class CategoricalDtype():
def __init__(self, categories, ordered=False):
self.categories = categories
self.ordered = ordered
@classmethod
def _construct_dtype_from_string()
def _array_constructor(self, values):
# convert values to codes
codes = ...
# update self to reflect values
np.array(codes, dtype=self)
def _array_repr(self):
# override the repr of the array with this dtype
def _validate_scalar(self):
# validate if scalar can be stored in the array
np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype())
np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype(categories=['Red', 'Green', 'Blue', 'Yellow']))
- Don't need to implement
__add__
, etc.. due to__array_ufunc__
- Should we limit the functions that can to go in
__array_function__
for dtypes? Do we need__array_function__
? - Mixins for units -- don't want to write a separate dtype for each variation.
- Should dtypes specify width
- protocols
__array_ufunc__
,__array_function__
- inheritence - subclassing dtype. Probably not a good idea
- duckdtype - what is the minumum viable methods and attributes a dtype needs?
- creating a dtype tutorial https://github.com/stefanv/teaching/tree/master/2013_scipy_austin_dive_into_numpy/slides
- It would be great to get past subclassing
- It would be nice to write something in Python
- It would be nice to be able to interoperate between different array duck types using the same dtype