-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
TSK: Follow-up things for stringdtype #25693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Added a few things I had on my personal list. Thanks for opening this and for all the review this week! |
@asmeurer pointed out to me that we should add a proper scalar type. Right now stringdtype's scalar is We can fix this by defining a proper scalar type that wraps a single packed string entry and exposes an API that is duck-compatible with This may also lead to performance improvements, since scalar indexing won't need to copy to python's internal string representation. I don't think this needs to happen for 2.0 since this is something we can improve later and I doubt it will have major user-facing impacts, since object string arrays also effectively have a |
I personally don't see this as a big issue. Although it might be friendly to not convret to scalar as often as we currently do if it is a string scalar (bad timeline too though). |
I think the main problem may be that people expect generally that they can treat If we're going to be "not quite right" for 2.0, should we perhaps err on the side of not creating a |
Are you saying that we should create a 0D array instead? |
If so, I think it's much more important for the scalar to be duck-compatible with |
Yes, I was, I really dislike array scalars and wish everything was 0-D arrays instead. But you make a good point that we want to make sure that object arrays can easily be replaced... EDIT: because I dislike how the type that comes out of |
I also completely agree that scalars are terrible for a dozen different reasons, and wish numpy just had 0-d arrays. But I guess this was too ambitious for NumPy 2.0. As it stands, scalars exist. Nathan makes a good point that object scalars are already kind of specially broken because they aren't even numpy types, and this new dtype replaces object
Making stringdtype scalars subclass from both |
Even subclassing from generic gives you crazy things, like indexing must be string indexing and not array indexing! In other words: IMO, part of the problem with scalars is that they pretend to be array likes, even though it is currently vital for much code that they do (because we create them too often). I envisioned we could create a new In short, I don't care about the hierarchy at all, but I agree there are two things:
I suggest moving discussion about point 2 to gh-24897. I could imagine making my try-0d-preservation opt-in more agressively for all but current dtypes (minus object maybe), even if we don't dare change it otherwise without 2.0. |
I'm exited about But this wouldn't now work with I also did some basic performance tests in print(numpy.__version__) # '2.0.0rc2'
options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype=StringDType(), count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))
print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000)) # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000)) # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000)) # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000)) # 0.078 |
Supporting structured dtypes is possible in principle but we punted on it for the initial version because it would add a bunch of complexity. The structured dtype code inside numpy (let alone in external libraries) makes strong assumptions about dtype instances being interchangeable, which aren't valid for stringdtype. One path forward would be adding a mode to stringdtype that always heap-allocates. This isn't great for performance or cache locality, but does mean that the packed strings always store heap pointers and the dtype instance does not store any sidecar data. I'm not sure how hard that would be - the structured dtype code is quite complicated and I'm not sure how easy it is to determine whether a stringdtype value lives inside a structured dtype vs a stand-alone stringdtype instance. Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much. Thanks for the report about unique being slow, I opened #26510 to track that. Please also open followup issues with reports about performance if you have others, it's very appreciated, especially with a script to reproduce what you're seeing. |
Thanks, and thanks for your great work!
Variable-width strings are the thing we are interested in :) Essentially improving performance with some columnar type of data processing that may have free form strings. There is the |
FWIW, I think we should allow putting stringdtype into structs. I think the main issue is that string dtype requires a new dtype instance to be created in some places and that property must be "inherited" by the structured dtype. But I am not sure if anyone prioritize working on this soon, if anyone is interested, I think we will definitely help out to get started.
|
I just opened #26747 which is likely a duplicate/part of this issue here. TLDR: add |
Hi, may I ask what the status of the scalar-type issue is, since I do not see it in the task list at the top of this issue. I still seem unable to correctly type a |
Thanks for the reminder. I think this should be doable before NumPy 2.2 gets branched if @jorenham can give me some code review. I'll try to poke at this tomorrow. Also contributions are always welcome, the main blocker here is my lack of facility with python static typing. |
I'd be happy to :) |
Unfortunately I'm a little out of my depth. I defined _ArrayLikeString_co: TypeAlias = _SupportsArray[np.dtypes.StringDType] And then if I try to use it in Str_co = _ArrayLikeString_co
...
@overload
def equal(x1: Str_co, x2: Str_co) -> NDArray[np.bool]: ... This for some reason causes the typing tests to fail:
I don't actually know why that's considered a type error, NumPy is perfectly happy to accept arrays of diferent dtypes like that at runtime. But regardless, I also don't understand why adding one overload would break overloads for entirely different types. |
I pushed my branch here. |
My guess is that the error is about So before being able to properly use |
In the meantime, I'll have a look at |
In numpy/numtype#333 I noticed that the >>> import numpy as np
>>> np.isnan(np.array("nan", "T"))
np.False_ But everything I tried seems to be So was this backdoor intentionally left open for some use-case I'm missing, or should the door be closed, after all? |
I’m not near a computer so can’t copy-paste an example, but if you create a StringnDType object with Something like this:
Unless I’m completely misremembering, that should print The reason for this is because object string arrays behave the same way if any of the values is |
Also you’re not creating a StringDType array in that example, you’re creating a |
I explicitly passed the |
Ah, sorry, I’m not used to seeing dtype passed as a positional argument, never mind. |
Over at h5py/h5py#2557 (comment) we found a case where it's annoying that the It would be easiest to provide both APIs if we moved to store null-terminated strings internally. You could then create a Of course the downside is an extra byte per string in the heap allocation, at least in the most straightforward implementation. @mhvk I'm curious what you think about this. Of course anyone else too :) |
@ngoldbaum - I'd hesitate to do start guaranteeing zero terminated strings. My hesitation is partially that I'm not sure I understand the use case all that well: for something like But maybe more importantly, I worry that by guaranteeing a zero-terminated string, we close off potential future speed-ups that may be quite useful. E.g., with the current scheme, it is possible to have a read-only I guess a compromise would be to have an option in Aside, just thinking through: if one of the main cases of interest is one where one is handing over the array (i.e., the numpy array is no longer needed, and all associated memory can be reused), then given that the arena has at least one byte in front of each string for the length, for all but the last string one can in fact make them zero-terminated by overwriting the length of the next string. Even the last string would be OK unless the arena happens to be exactly full. Similarly, even for 15-character long short strings one can make them zero terminated by overwriting the flags byte of the next string (except for the last string, if it is exactly 15 characters long; need to check byte ordering, though!). So, if this type of transfer is common theme, we could consider allocating 1 extra byte for anything on the heap so that almost all strings can be done with zero-copy. Though of course, this would mostly be to reduce memory usage; writing all those zero bytes may not be all that much faster than just copying... I guess overall my tendency would be to provide some helper functions that convert to/from |
Correct;
If the option is opt-in, in the case of writing to
This is definitely not the Unless of course you have a reverse function that undoes the conversion ndarray of npystrings <->null-terminated |
I think that a low-hanging fruit would be to offer a variant of |
@crusaderky - isn't there a problem for any type of string? Fixed-width "S" and "U" also do not store zero-terminated strings (if they are of the fixed width). But having a helper function seems a good idea, and something that we can improve on as need be. |
'S' fixed-width strings are not zero-terminated in HDF5. |
Ah, yes, makes sense for 'S'. Maybe by having a helper function it may become possible to support "U" as well, if indirectly, via Anyway, for the helper function, given that files are typically read much more often than written, the reverse function, to get a |
I'm not sure it's a good idea to try to tamper with H5DRead. The key problem is that the data is read into temporary memory managed by HDF5 as a monolithic chunk read, and then potentially decompressed in bulk also into memory managed by HDF5, and finally converted into a |
Yes, actually dealing with ownership may well be tricky, though if that were possible, I think the current StringDType machinery might well be able to interpret the uncompressed bit of memory. In any case, if a copy on read is basically required, the case for changing null-terminated strings for |
I think I agree with that. Maybe we’ll find other arguments for the trailing nulls but this may not be compelling enough. That said, I also think we can probably avoid copies, particularly for many short strings, if we provide the |
Am I correct that currently NumPy does not support representing a list of varlen strings where the strings are stored concatenated in the array's buffer (e.g. represented as UTF-8 bytes)? I read through https://numpy.org/devdocs/user/basics.strings.html, and it seems that varlen strings are only supported via np.object (btw np.object not mentioned in https://numpy.org/devdocs/user/basics.strings.html) or via StringDType, but both just store pointers to some string objects stored on Python object heap or in some string heap? But wouldn't it make sense to also support some packed buffer formats for readonly string arrays? Or does stringdtype does it already? Then the NumPy could offer zero-copy conversions from Arrow, and even mmap directly a super-large text-file (and e.g. find the string lengths via scanning for newlines; and do no decoding if the user asserts it's already utf-8 encoded, or let the user provide the string lengths separately if they have this info already), and also make a better memory layout for sequential vectorized processing... This also makes it simpler for serializing/deserializing such an array, and for sharing it between processes, etc |
Fair point, we could probably document that np.object is still a thing, but we really wanted to point people to StringDType rather than object string arrays. Our hope is that StringDType provides a superset of functionality of object strings.
Very much agree. If StringDType had a mode where the array is immutable and it exposed exactly the arrow format that would be really cool. This can’t be the default though because then StringDType can’t replace object string arrays. |
I stumbled on np.object as representation from strings when doing sth like Was looking of a modern NumPy way of representing a large dataset with three string columns which plays nicely with serialization (so three independent columns of packed strings, or somehow an array of varlen records, where each record packs three strings) I hoped there was a zero-copy way of representing a readonly column of strings, obtained from Arrow or HDF5 disk formats (and maybe of some other raw formats) to be representable zero-copy by NumPy - or at least to not leave millions of Python string objects around |
Yes unfortunately we can’t switch the default inferred representation until NumPy 3.0 probably ☹ . Yet another reason that page should definitely mention object strings. When I was updating those docs to talk about StringDType I wasn’t thinking about it from that perspective. PyArrow might certainly be able to do something smarter on newer NumPy versions that support StringDType, I haven’t followed PyArrow development recently or tried to push that. |
I also discussed these matters in PyTorch (and various useful internal representations):
Basically, there are currently problems with sharing large readonly string arrays across processes - and packed memory layouts are much more beneficial than sharing tons of objects (and it appears there still are problems even with NumPy arrays in some cases): So if NumPy proposes new modern designs here, it's more likely that PyTorch would follow and implement some support for string readonly arrays and some GPU-accelerated string processing. Also, np.object as strings appear in Triton Inference Server Python bindings: |
@vadimkantorov - I think the new |
Regarding the "default" internal representation, I think there could be several supported internal representations concurrently - for various usecases, and the default might be switched later. E.g. it can make sense to support both utf-8 and utf-32 (for fixed-size char and indexing simplicity), and support zero-byte termination or "termination" with newline (or another custom char). At least for readonly ways of representing an existing dataset without copies and indexing into it - these can be useful. Some more discussion in the first pytorch issue I linked. Another nice usecase would be parallelized transcoding of the whole array of strings into another encoding (also, NumPy should just copy the buffer of Python string object if it's already encoded in the correct encoding) For PyTorch, there is now in core a concept of NestedTensor which is a tuple of a data storage and an indexes tensor, so I thought maybe this structure could represent varlen strings (downsides are that currently indexes is always int64 which sometimes is wasteful for small strings), e.g. int8 could be interpreted by stringproc functions as ascii, uint8 as utf-8 and uint32/int32 as utf-32)... |
We intentionally made the packed string an opaque pointer so the heap allocation can be changed in the future. I’m also planning to add support for encodings besides UTF-8 and a fixed-width mode to replace the fixed -width legacy string dtypes, but am focusing on other projects for at least the short to medium term. If you’re interested in working on this I’m happy help. |
Also, maybe studying some of arrow binary formats for string columns might be useful, as then zero-copy conversion from arrow can be supported https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout seems quite tricky and has some optimizations (and even paddings) for short strings |
Also, currently these import numpy as np
a = b = np.array(['abc', 'def'], dtype = object)
np.save('a.npy', a)
_a = np.load('a.npy')
# np.load('a.npy')
# File "/home/inferencer/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 488, in load
# return format.read_array(fid, allow_pickle=allow_pickle,
# File "/home/inferencer/.local/lib/python3.10/site-packages/numpy/lib/format.py", line 822, in read_array
# raise ValueError("Object arrays cannot be loaded when "
#ValueError: Object arrays cannot be loaded when allow_pickle=False
#res = np.rec.fromarrays([a, b], names = ['a', 'b'])
#np.save('res.npy', res)
#_ab = np.load('res.npy') Basically I was looking for a way to load a bunch of string columns from parquet and save them to a plain |
I would suggest, maybe both saving/loading an array of str objects and saving/loading an StringDType-d array should not require pickling for saving/loading - similar to how PyTorch disabled now pickle by default for saving/loading... Is there currently a way of saving/loading an array of varlen strings without requiring pickle? |
Unfortunately that requires updating the npz format, since npz doesn’t have any support for sidecar files or non-strided data. There’s earlier discussion about this on the issue tracker. I’d love to fix that, it just takes time and effort to get it right, since any update to the npz format can’t break existing readers of the existing npz format. |
Hopefully packing metadata + packed string data could circumvent any npz buffer format restrictions? (e.g. |
See #24110 (comment). FWIW @seberg and I are working on a PEP to make StringDType work with the buffer protocol by allowing libraries to define their own formats. That might help with this sort of thing, or at least codifying something that a new npz format could use. See https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/13. If you’re interested in working on any of this I’d be happy to provide feedback or help. As I said upthread I’m focusing on other things (ecosystem free-threaded support mostly) for the short to medium term. |
Proposed new feature or change:
A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.
StringDType
has a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...)add
ufunc needs a promoter so addition with a python string works.new_stringdtype_instance
intotp_init
#define
incasts.c
with templating (or.c.src
)*auxdata
(see here, here, here, and here) [e.g.,minimum
andmaximum
; the various comparison functions; the templated unary functions;find
,rfind
and maybecount
;startswith
andendswith
;lstrip
,rstrip
andstrip
, pluswhitespace
versions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796array2string
formatter overrides.a.view('2u8')
currently errors with"Cannot change data-type for object array."
).StringDType
ufuncs to useout
arguments, also for in-place operations.Things possibly for longer term
object
).NpyString
API. This will depend on feedback from users.longdouble
to string, which is listed as broken in aTODO
item incasts.c
. isn'tdragon
able to deal with it?PyArray_FromAny
finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.size
argument...)..view(StringDType())
could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).str
scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897. ENH: add a StringDType scalar type that wraps a UTF-8 string #28165Small warts, possibly not solvable
StringDType
is added toadd_dtype_helper
late in the initialization ofmultiarraymodule
; can this be easier?load_non_nullable_string
with itshas_gil
argument.dtype.hasobject
be true is logical but not quite the right name.The text was updated successfully, but these errors were encountered: