Skip to content

TSK: Follow-up things for stringdtype #25693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 of 14 tasks
mhvk opened this issue Jan 25, 2024 · 59 comments
Open
5 of 14 tasks

TSK: Follow-up things for stringdtype #25693

mhvk opened this issue Jan 25, 2024 · 59 comments

Comments

@mhvk
Copy link
Contributor

mhvk commented Jan 25, 2024

Proposed new feature or change:

A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.

  1. The doc for StringDType has a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...)
  2. The add ufunc needs a promoter so addition with a python string works.
  3. Add a cython interface for the stringdtype C API.
  4. It is likely better to use a flag for strings "long strings" (stored outside of the numpy array proper) instead of one for short ones (stored inside), so that an all-zero entry correctly implies a zero-length string (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  5. Refactor the flags in the vstring implementation to use bitfields. This will improve clarity and eliminate complicated bitflag math.
  6. Possibly, the arena should have more recoverable flags/size information (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  7. Investigate refactoring new_stringdtype_instance into tp_init
  8. Replace the long #define in casts.c with templating (or .c.src)
  9. Replace ufunc wrappers with passing functions into *auxdata (see here, here, here, and here) [e.g., minimum and maximum; the various comparison functions; the templated unary functions; find, rfind and maybe count; startswith and endswith; lstrip, rstrip and strip, plus whitespace versions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796
  10. Check array2string formatter overrides.
  11. Adjust error messages referring to "object array" (e.g., a.view('2u8') currently errors with "Cannot change data-type for object array.").
  12. Have some helper functions that make it easy for StringDType ufuncs to use out arguments, also for in-place operations.
  13. See whether null-handling code in ufunc loops and casts can be consolidated into a helper function to reduce code duplication.
  14. Add checks for very long strings, see MAINT: Ensure correct handling for very large unicode strings #27875

Things possibly for longer term

  • Support in structured arrays (perhaps not super useful, but could be seen as similar to object).
  • Expose more of the currently private NpyString API. This will depend on feedback from users.
  • Fix longdouble to string, which is listed as broken in a TODO item in casts.c. isn't dragon able to deal with it?
  • Add a DType API callback that triggers when the initial filling of a newly created array (e.g. after PyArray_FromAny finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.
  • Might it make sense on 64bit systems, where normally the size is 16 bytes, to have a 8-byte version (short strings up to 7, only arena allocations for long ones; might use the size argument...).
  • In principle, .view(StringDType()) could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).
  • Dealing with array scalars vs str scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897. ENH: add a StringDType scalar type that wraps a UTF-8 string #28165

Small warts, possibly not solvable

  • StringDType is added to add_dtype_helper late in the initialization of multiarraymodule; can this be easier?
  • Can the cases of having and not having gil be factored out, so that one doesn't get the kind of hybrid stuff in load_non_nullable_string with its has_gil argument.
  • to have dtype.hasobject be true is logical but not quite the right name.
@ngoldbaum
Copy link
Member

Added a few things I had on my personal list. Thanks for opening this and for all the review this week!

@ngoldbaum ngoldbaum changed the title Follow-up things for stringdtype TSK: Follow-up things for stringdtype Jan 31, 2024
@ngoldbaum
Copy link
Member

@asmeurer pointed out to me that we should add a proper scalar type. Right now stringdtype's scalar is str, which doesn't inherit from np.generic, so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

We can fix this by defining a proper scalar type that wraps a single packed string entry and exposes an API that is duck-compatible with str, making use of the ufunc implementations we're adding.

This may also lead to performance improvements, since scalar indexing won't need to copy to python's internal string representation.

I don't think this needs to happen for 2.0 since this is something we can improve later and I doubt it will have major user-facing impacts, since object string arrays also effectively have a str scalar.

@seberg
Copy link
Member

seberg commented Feb 6, 2024

so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

I personally don't see this as a big issue. Although it might be friendly to not convret to scalar as often as we currently do if it is a string scalar (bad timeline too though).

@mhvk
Copy link
Contributor Author

mhvk commented Feb 6, 2024

I think the main problem may be that people expect generally that they can treat array[something] as array-ish, with a .shape, etc. In that sense we probably do need a scalar type (or, perhaps preferably, just not drop to a scalar in the first place...).

If we're going to be "not quite right" for 2.0, should we perhaps err on the side of not creating a str object?

@ngoldbaum
Copy link
Member

Are you saying that we should create a 0D array instead?

@ngoldbaum
Copy link
Member

If so, I think it's much more important for the scalar to be duck-compatible with str than with ndarray. Especially if the goal is replace object string arrays in downstream packages.

@mhvk
Copy link
Contributor Author

mhvk commented Feb 6, 2024

Yes, I was, I really dislike array scalars and wish everything was 0-D arrays instead. But you make a good point that we want to make sure that object arrays can easily be replaced...

EDIT: because I dislike how the type that comes out of __getitem__ or .sum() depends on the arguments. Why should axis=None give me a different instance than axis=-1? And it is just a hassle if one subclasses ndarray...

@asmeurer
Copy link
Member

asmeurer commented Feb 7, 2024

I also completely agree that scalars are terrible for a dozen different reasons, and wish numpy just had 0-d arrays. But I guess this was too ambitious for NumPy 2.0. As it stands, scalars exist. Nathan makes a good point that object scalars are already kind of specially broken because they aren't even numpy types, and this new dtype replaces object str arrays for various use-cases (tbf, it also replaces np.str_ arrays for many use-cases too). But at least if you are working with object arrays you probably (or at least hopefully) are doing it on purpose and can be careful about this. For every other dtype, the scalar is at least array-like in most respects. And this is documented too https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_:

The object type is also special because an array containing object_ items does not return an object_ object on item access, but instead returns the actual object that the array item refers to.

Making stringdtype scalars subclass from both str and generic seems like the best option from a usability point of view, although if actually making it a real str subclass is not good for performance, just making it duck type and defining __instancecheck__ is probably good enough.

@seberg
Copy link
Member

seberg commented Feb 7, 2024

Even subclassing from generic gives you crazy things, like indexing must be string indexing and not array indexing! In other words: IMO, part of the problem with scalars is that they pretend to be array likes, even though it is currently vital for much code that they do (because we create them too often).
Plus, the ABI is likely odd, so care has to be taken (numpy strings pull it off, so not sure if it is easy or not).

I envisioned we could create a new np.NumPyScalar, fully abstract, class you can register with. But not sure it matters here.

In short, I don't care about the hierarchy at all, but I agree there are two things:

  1. You cannot write np.array(scalar) and np.array(str_array[0]) without a dtype because we do not pick the new dtype for those for BC reasons.
  2. Just like for object dtype, the fact that we don't preserve 0-D arrays is more harmful for this DType compared to numerical dtypes where the scalars somewhat duck-type as arrays.

I suggest moving discussion about point 2 to gh-24897. I could imagine making my try-0d-preservation opt-in more agressively for all but current dtypes (minus object maybe), even if we don't dare change it otherwise without 2.0.
And yes, maybe we should never convert to scalars and make that something users always would have to do explicitly with .item() or so, but not sure I think that is as realistic.

@MarkusSintonen
Copy link

MarkusSintonen commented May 23, 2024

Support in structured arrays (perhaps not super useful, but could be seen as similar to object)

I'm exited about StringDType but noticed that its not working with structure arrays which is a bummer. We are using structured arrays for efficiently identifying "good values" from a boolean mask. This boolean mask is carried together with the actual data to efficiently detect which are good values in the array. It looks something like this:
numpy.dtype([("values", some_type), ("is_valid", numpy.dtype("?"))])

But this wouldn't now work with some_type=StringDType() 😞 I see there is StringDType(na_object=None) way to define something like this but it wont work for efficiently detecting positions of good/bad values compared to being able to carry it with a mask in a struct.

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype=StringDType(), count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

@ngoldbaum
Copy link
Member

ngoldbaum commented May 23, 2024

Supporting structured dtypes is possible in principle but we punted on it for the initial version because it would add a bunch of complexity. The structured dtype code inside numpy (let alone in external libraries) makes strong assumptions about dtype instances being interchangeable, which aren't valid for stringdtype.

One path forward would be adding a mode to stringdtype that always heap-allocates. This isn't great for performance or cache locality, but does mean that the packed strings always store heap pointers and the dtype instance does not store any sidecar data. I'm not sure how hard that would be - the structured dtype code is quite complicated and I'm not sure how easy it is to determine whether a stringdtype value lives inside a structured dtype vs a stand-alone stringdtype instance.

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Thanks for the report about unique being slow, I opened #26510 to track that. Please also open followup issues with reports about performance if you have others, it's very appreciated, especially with a script to reproduce what you're seeing.

@MarkusSintonen
Copy link

MarkusSintonen commented May 23, 2024

Thanks, and thanks for your great work!

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Variable-width strings are the thing we are interested in :) Essentially improving performance with some columnar type of data processing that may have free form strings. There is the "O" type for this, but it has some issues. Eg the above shows how numpy.unique (for str-O) is quite slow compared to Pythons own set. Potentially the more native StringDType could help there.

@seberg
Copy link
Member

seberg commented May 23, 2024

FWIW, I think we should allow putting stringdtype into structs.

I think the main issue is that string dtype requires a new dtype instance to be created in some places and that property must be "inherited" by the structured dtype.
That might be as easy as implementing the hook, but may also need a new dtype flag and auditing some other paths (I am not sure, things tend to turn out harder than you think).

But I am not sure if anyone prioritize working on this soon, if anyone is interested, I think we will definitely help out to get started.

unique in NumPy doesn't use hashes (for the time being at least), but a custom sorting would presumably get it to comparable to the old strings (dunno if faster or slower). But comparing to Python has to be done with cares: Python strings cache their hash and can "cheat" because you have identical objects which are known to be equal without even looking at the string itself!

@bersbersbers
Copy link
Contributor

I just opened #26747 which is likely a duplicate/part of this issue here.

TLDR: add StringDType to type stubs (dtypes.pyi, strings.pyi) and documentation. Especially type hints is something I find important - with type-checking in CI nowadays, people run the risk of spending more time of getting the CI to pass than what they save by using the new API...

@bersbersbers
Copy link
Contributor

Hi, may I ask what the status of the scalar-type issue is, since I do not see it in the task list at the top of this issue. I still seem unable to correctly type a StringDType array, see #26747 (comment). Given that #26528 and #26747 are closed and #27008 has been merged, I wonder where else this may be tracked.

@ngoldbaum
Copy link
Member

Thanks for the reminder. I think this should be doable before NumPy 2.2 gets branched if @jorenham can give me some code review. I'll try to poke at this tomorrow.

Also contributions are always welcome, the main blocker here is my lack of facility with python static typing.

@jorenham
Copy link
Member

I think this should be doable before NumPy 2.2 gets branched if @jorenham can give me some code review.

I'd be happy to :)

@ngoldbaum
Copy link
Member

Unfortunately I'm a little out of my depth. I defined

_ArrayLikeString_co: TypeAlias = _SupportsArray[np.dtypes.StringDType]

And then if I try to use it in strings.pyi to add an type annotation for np.strings.equal:

Str_co = _ArrayLikeString_co

...


@overload
 def equal(x1: Str_co, x2: Str_co) -> NDArray[np.bool]: ...

This for some reason causes the typing tests to fail:

E           AssertionError: Error mismatch at line 7
E
E           Expression: np.strings.equal(AR_U, AR_S)
E           Expected error:
E           Observed error: 'incompatible type'

I don't actually know why that's considered a type error, NumPy is perfectly happy to accept arrays of diferent dtypes like that at runtime. But regardless, I also don't understand why adding one overload would break overloads for entirely different types.

@ngoldbaum
Copy link
Member

I pushed my branch here.

@jorenham
Copy link
Member

Unfortunately I'm a little out of my depth. I defined

_ArrayLikeString_co: TypeAlias = _SupportsArray[np.dtypes.StringDType]

And then if I try to use it in strings.pyi to add an type annotation for np.strings.equal:

Str_co = _ArrayLikeString_co

...


@overload
 def equal(x1: Str_co, x2: Str_co) -> NDArray[np.bool]: ...

This for some reason causes the typing tests to fail:

E           AssertionError: Error mismatch at line 7
E
E           Expression: np.strings.equal(AR_U, AR_S)
E           Expected error:
E           Observed error: 'incompatible type'

I don't actually know why that's considered a type error, NumPy is perfectly happy to accept arrays of diferent dtypes like that at runtime. But regardless, I also don't understand why adding one overload would break overloads for entirely different types.

My guess is that the error is about numpy.dtypes.StringDType currently being incorrect, and that's because, well, it is. Currently StringDType is annotated as a subtype of dtype[str], which is invalid because str isn't assignable to numpy.generic.

So before being able to properly use StringDType, it's annotations in numpy.dtypes need to be updated, so that it's a subtype of dtype[{{ insert_string_scalar_type_here }}]. Otherwise, mypy will reject every expression that uses it.

@jorenham
Copy link
Member

In the meantime, I'll have a look at StringDType to see if I can find a workaround so that mypy considers it valid, without it becoming (too) incorrect.

@jorenham
Copy link
Member

In numpy/numtype#333 I noticed that the np.isnan ufunc secretly also accepts StringDType:

>>> import numpy as np
>>> np.isnan(np.array("nan", "T"))
np.False_

But everything I tried seems to be False_, implying that a string is not not a number 🤔.

So was this backdoor intentionally left open for some use-case I'm missing, or should the door be closed, after all?
It's a cheesy analogy, but I can't resist an open door, apparently

@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 19, 2025

But everything I tried seems to be False_, implying that a string is not not a number 🤔.

I’m not near a computer so can’t copy-paste an example, but if you create a StringnDType object with na_object=np.nan, np.isnan will return True for values that are set to be NaN.

Something like this:

import numpy as np

dt = np.dtypes.StringDType(na_object=np.NaN)
print(np.isnan(np.array([“hello”, “world”, np.NaN], dtype=dt)

Unless I’m completely misremembering, that should print [False, False, True]

The reason for this is because object string arrays behave the same way if any of the values is np.NaN.

@ngoldbaum
Copy link
Member

Also you’re not creating a StringDType array in that example, you’re creating a U3 unicode fixed-width DType.

@jorenham
Copy link
Member

Also you’re not creating a StringDType array in that example, you’re creating a U3 unicode fixed-width DType.

I explicitly passed the T typecode as dtype?

@ngoldbaum
Copy link
Member

I explicitly passed the T typecode as dtype?

Ah, sorry, I’m not used to seeing dtype passed as a positional argument, never mind.

@ngoldbaum
Copy link
Member

Over at h5py/h5py#2557 (comment) we found a case where it's annoying that the NpyString C API doesn't supply strings that are guaranteed to be null-terminated. There was also a need for a function that returns a char** pointer array of null-terminated strings, which also seems like a useful thing that people will often want when working with StringDType arrays in native code.

It would be easiest to provide both APIs if we moved to store null-terminated strings internally. You could then create a char** array of C strings with no copies - just some integer math to generate pointers in the arena allocation.

Of course the downside is an extra byte per string in the heap allocation, at least in the most straightforward implementation.

@mhvk I'm curious what you think about this. Of course anyone else too :)

@mhvk
Copy link
Contributor Author

mhvk commented Apr 8, 2025

@ngoldbaum - I'd hesitate to do start guaranteeing zero terminated strings. My hesitation is partially that I'm not sure I understand the use case all that well: for something like h5py, converting numpy->h5py presumably means one will in the end want to store the strings in some file, in which case a copy is needed anyway and it is not obvious there is much gain from having zero-terminated strings. Conversely, for reading (h5py->numpy), it should already be possible to have a no-copy version, by simply setting appropriate pointers and lengths.

But maybe more importantly, I worry that by guaranteeing a zero-terminated string, we close off potential future speed-ups that may be quite useful. E.g., with the current scheme, it is possible to have a read-only copy view of subsets of all the strings without making a copy of the arena and heap, by simply copying over the array of pointers and adjusting the lengths (of course, the short strings would need to be adjusted). However, a full copy would have to be made if the subset strings have to be zero terminated.

I guess a compromise would be to have an option in StringDType for strings to be zero terminated, but whether that is useful depends on how big the benefits are...

Aside, just thinking through: if one of the main cases of interest is one where one is handing over the array (i.e., the numpy array is no longer needed, and all associated memory can be reused), then given that the arena has at least one byte in front of each string for the length, for all but the last string one can in fact make them zero-terminated by overwriting the length of the next string. Even the last string would be OK unless the arena happens to be exactly full. Similarly, even for 15-character long short strings one can make them zero terminated by overwriting the flags byte of the next string (except for the last string, if it is exactly 15 characters long; need to check byte ordering, though!). So, if this type of transfer is common theme, we could consider allocating 1 extra byte for anything on the heap so that almost all strings can be done with zero-copy. Though of course, this would mostly be to reduce memory usage; writing all those zero bytes may not be all that much faster than just copying...

I guess overall my tendency would be to provide some helper functions that convert to/from char**, with perhaps the option to re-use memory.

@crusaderky
Copy link
Contributor

for something like h5py, converting numpy->h5py presumably means one will in the end want to store the strings in some file, in which case a copy is needed anyway and it is not obvious there is much gain from having zero-terminated strings.

libhdf5 writes variable-width strings by acquiring in input a char** of null-terminated strings, on memory allocated by the user, externally to libhdf5. After H5Dwrite is finished, the user needs to release their own memory. On everything but variable-width strings, it means that the write() syscall reads directly from memory managed by NumPy.

Conversely, for reading (h5py->numpy), it should already be possible to have a no-copy version, by simply setting appropriate pointers and lengths.

Correct; H5Dread returns a char** of null-terminated strings managed internally by libhdf5 so, after NpyString_pack internally deep-copies the strings into memory managed by NumPy, one can can call H5Dvlen_reclaim to release the temporary memory managed by libhdf5.

I guess a compromise would be to have an option in StringDType for strings to be zero terminated, but whether that is useful depends on how big the benefits are...

If the option is opt-in, in the case of writing to h5py the benefits would be very unlikely, as most of the data will be created by the end users without that option so h5py would need to call .astype(StringDType(null_terminated=True) ahead of writing, which would probably not be any faster (just marginally more convenient) than the manual copy to the null-terminated temporary char** we do today.

Aside, just thinking through: if one of the main cases of interest is one where one is handing over the array (i.e., the numpy array is no longer needed, and all associated memory can be reused)

This is definitely not the h5py use case unless the library were to make a temporary deep-copy of the numpy array internally (again something it already does); more in general Python doesn't have the concept of "this function consumes its parameters" like e.g. Rust does. On the contrary there is a frequent source of bugs where a function accidentally writes back to its inputs and most unit tests won't check for it.

Unless of course you have a reverse function that undoes the conversion ndarray of npystrings <->null-terminated char**, to be called after H5DWrite, but honestly that seems quite convoluted to me?

@crusaderky
Copy link
Contributor

I think that a low-hanging fruit would be to offer a variant of NpyString_load that returns a null-terminated char*, which you then need to "release" (quotes are necessary) with another API call after you're done with it. For strings of 0~15 bytes, it could do so in-place and the "release" would be a no-op; for everything else it would need to see if there is a free byte laying around in the arena (again resulting in no-ops) and, if all else fails, make a temporary(?) copy.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 9, 2025

@crusaderky - isn't there a problem for any type of string? Fixed-width "S" and "U" also do not store zero-terminated strings (if they are of the fixed width).

But having a helper function seems a good idea, and something that we can improve on as need be.

@crusaderky
Copy link
Contributor

'S' fixed-width strings are not zero-terminated in HDF5.
'U' is unsupported.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 9, 2025

'S' fixed-width strings are not zero-terminated in HDF5. 'U' is unsupported.

Ah, yes, makes sense for 'S'. Maybe by having a helper function it may become possible to support "U" as well, if indirectly, via StringDType!

Anyway, for the helper function, given that files are typically read much more often than written, the reverse function, to get a StringDType array out of a hdf5 file, is perhaps the one to think most about. Of particular interest is, I think, read-only access. Does hdf5 know about the lengths of all the strings? (From my reading of the docs, it didn't seem the length was strored, but is it perhaps guaranteed that strings are stored sequentially and that thus one can calculate it trivially?) If so, might one consider having everything initially be on the heap, with copy-on-access for short strings? If lengths are not known, one would seem stuck with having to scan the whole heap, and I guess one might then as well copy all the short strings, but otherwise keep things on the heap. I guess the question then is whether it is worth determining the length lazily (I think an empty string currently cannot be on the heap, so having a heap pointer and length zero (or "-1") could be the flag for determining the length on first access.

@crusaderky
Copy link
Contributor

I'm not sure it's a good idea to try to tamper with H5DRead. The key problem is that the data is read into temporary memory managed by HDF5 as a monolithic chunk read, and then potentially decompressed in bulk also into memory managed by HDF5, and finally converted into a char**. I'm not familiar with the internal disk representation of HDF5, but I suspect that the memory pointed to by the char** is the raw output of the read/decompress, so there's no way you can pass it to numpy.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 9, 2025

Yes, actually dealing with ownership may well be tricky, though if that were possible, I think the current StringDType machinery might well be able to interpret the uncompressed bit of memory.

In any case, if a copy on read is basically required, the case for changing null-terminated strings for StringDType becomes quite weak; writing on its own would not seem worth the increased inefficiency for the general case.

@ngoldbaum
Copy link
Member

I think I agree with that. Maybe we’ll find other arguments for the trailing nulls but this may not be compelling enough.

That said, I also think we can probably avoid copies, particularly for many short strings, if we provide the gather API for h5py to consume. There may also be a way to encode the size prefix for arena strings in such a way that the first byte is almost always null, so arena strings happen to nearly always be null-terminated for free.

@vadimkantorov
Copy link

vadimkantorov commented Apr 15, 2025

Am I correct that currently NumPy does not support representing a list of varlen strings where the strings are stored concatenated in the array's buffer (e.g. represented as UTF-8 bytes)?

I read through https://numpy.org/devdocs/user/basics.strings.html, and it seems that varlen strings are only supported via np.object (btw np.object not mentioned in https://numpy.org/devdocs/user/basics.strings.html) or via StringDType, but both just store pointers to some string objects stored on Python object heap or in some string heap?

But wouldn't it make sense to also support some packed buffer formats for readonly string arrays? Or does stringdtype does it already?

Then the NumPy could offer zero-copy conversions from Arrow, and even mmap directly a super-large text-file (and e.g. find the string lengths via scanning for newlines; and do no decoding if the user asserts it's already utf-8 encoded, or let the user provide the string lengths separately if they have this info already), and also make a better memory layout for sequential vectorized processing... This also makes it simpler for serializing/deserializing such an array, and for sharing it between processes, etc

@ngoldbaum
Copy link
Member

ngoldbaum commented Apr 15, 2025

btw np.object not mentioned in https://numpy.org/devdocs/user/basics.strings.htm

Fair point, we could probably document that np.object is still a thing, but we really wanted to point people to StringDType rather than object string arrays. Our hope is that StringDType provides a superset of functionality of object strings.

But wouldn't it make sense to also support some packed buffer formats for readonly string arrays?

Very much agree. If StringDType had a mode where the array is immutable and it exposed exactly the arrow format that would be really cool. This can’t be the default though because then StringDType can’t replace object string arrays.

@vadimkantorov
Copy link

vadimkantorov commented Apr 15, 2025

but we really wanted to point people to StringDType rather than object string arrays

I stumbled on np.object as representation from strings when doing sth like pyarrow.parquet.read_table("some.parquet")["string_col"].to_numpy().dtype, although I wish it did not create tons of Python string objects (which itself is maybe not nice if dealing with millions or billions of strings)

Was looking of a modern NumPy way of representing a large dataset with three string columns which plays nicely with serialization (so three independent columns of packed strings, or somehow an array of varlen records, where each record packs three strings)

I hoped there was a zero-copy way of representing a readonly column of strings, obtained from Arrow or HDF5 disk formats (and maybe of some other raw formats) to be representable zero-copy by NumPy - or at least to not leave millions of Python string objects around

@ngoldbaum
Copy link
Member

Yes unfortunately we can’t switch the default inferred representation until NumPy 3.0 probably ☹ . Yet another reason that page should definitely mention object strings. When I was updating those docs to talk about StringDType I wasn’t thinking about it from that perspective.

PyArrow might certainly be able to do something smarter on newer NumPy versions that support StringDType, I haven’t followed PyArrow development recently or tried to push that.

@vadimkantorov
Copy link

vadimkantorov commented Apr 15, 2025

I also discussed these matters in PyTorch (and various useful internal representations):

Basically, there are currently problems with sharing large readonly string arrays across processes - and packed memory layouts are much more beneficial than sharing tons of objects (and it appears there still are problems even with NumPy arrays in some cases):

So if NumPy proposes new modern designs here, it's more likely that PyTorch would follow and implement some support for string readonly arrays and some GPU-accelerated string processing.


Also, np.object as strings appear in Triton Inference Server Python bindings:

@mhvk
Copy link
Contributor Author

mhvk commented Apr 15, 2025

@vadimkantorov - I think the new StringDType could in principle come close - although how numpy operates on arrays means that there always will have to be pointers (that elements are separated by a fixed stride is non-negotiable in the numpy design), but at least those pointers can in principle be to within a previously allocated block of memory (though that needs new transfer code to be written).

@vadimkantorov
Copy link

vadimkantorov commented Apr 15, 2025

Regarding the "default" internal representation, I think there could be several supported internal representations concurrently - for various usecases, and the default might be switched later.

E.g. it can make sense to support both utf-8 and utf-32 (for fixed-size char and indexing simplicity), and support zero-byte termination or "termination" with newline (or another custom char). At least for readonly ways of representing an existing dataset without copies and indexing into it - these can be useful. Some more discussion in the first pytorch issue I linked.

Another nice usecase would be parallelized transcoding of the whole array of strings into another encoding (also, NumPy should just copy the buffer of Python string object if it's already encoded in the correct encoding)

For PyTorch, there is now in core a concept of NestedTensor which is a tuple of a data storage and an indexes tensor, so I thought maybe this structure could represent varlen strings (downsides are that currently indexes is always int64 which sometimes is wasteful for small strings), e.g. int8 could be interpreted by stringproc functions as ascii, uint8 as utf-8 and uint32/int32 as utf-32)...

@ngoldbaum
Copy link
Member

We intentionally made the packed string an opaque pointer so the heap allocation can be changed in the future.

I’m also planning to add support for encodings besides UTF-8 and a fixed-width mode to replace the fixed -width legacy string dtypes, but am focusing on other projects for at least the short to medium term. If you’re interested in working on this I’m happy help.

@vadimkantorov
Copy link

vadimkantorov commented Apr 16, 2025

Also, maybe studying some of arrow binary formats for string columns might be useful, as then zero-copy conversion from arrow can be supported

https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

seems quite tricky and has some optimizations (and even paddings) for short strings

@vadimkantorov
Copy link

vadimkantorov commented Apr 16, 2025

Also, currently these object-typed arrays (sourced from parquet/arrow or triton-server) cannot be np.load-ed without pickle (although np.save somehow passes okay). I would suggest that the np.save should also then require allow_pickle=True (to make it less unexpected for np.load)

import numpy as np

a = b = np.array(['abc', 'def'], dtype = object)
np.save('a.npy', a)
_a = np.load('a.npy')

#    np.load('a.npy')
#  File "/home/inferencer/.local/lib/python3.10/site-packages/numpy/lib/_npyio_impl.py", line 488, in load
#    return format.read_array(fid, allow_pickle=allow_pickle,
#  File "/home/inferencer/.local/lib/python3.10/site-packages/numpy/lib/format.py", line 822, in read_array
#    raise ValueError("Object arrays cannot be loaded when "
#ValueError: Object arrays cannot be loaded when allow_pickle=False

#res = np.rec.fromarrays([a, b], names = ['a', 'b'])
#np.save('res.npy', res)
#_ab = np.load('res.npy')

Basically I was looking for a way to load a bunch of string columns from parquet and save them to a plain .npy file. Seems not so easy :) What would be the recommended way of doing so (preferably without allow_pickle=True - for safety)?

@vadimkantorov
Copy link

vadimkantorov commented Apr 16, 2025

import numpy as np; np.save('a.npy', np.array(['abc', 'def'], dtype = object).astype(np.dtypes.StringDType())) warns with:

/home/inferencer/.local/lib/python3.10/site-packages/numpy/lib/format.py:382: UserWarning: Custom dtypes are saved as python objects using the pickle protocol. Loading this file requires allow_pickle=True to be set.
  d['descr'] = dtype_to_descr(array.dtype)

I would suggest, maybe both saving/loading an array of str objects and saving/loading an StringDType-d array should not require pickling for saving/loading - similar to how PyTorch disabled now pickle by default for saving/loading...

Is there currently a way of saving/loading an array of varlen strings without requiring pickle?

@ngoldbaum
Copy link
Member

Unfortunately that requires updating the npz format, since npz doesn’t have any support for sidecar files or non-strided data. There’s earlier discussion about this on the issue tracker.

I’d love to fix that, it just takes time and effort to get it right, since any update to the npz format can’t break existing readers of the existing npz format.

@vadimkantorov
Copy link

vadimkantorov commented Apr 16, 2025

Hopefully packing metadata + packed string data could circumvent any npz buffer format restrictions? (e.g. number of strings + lens/indices/extra metadata | raw packed string data - at least for serialization to disk... or maybe it could directly adopt some existing serialization formats like arrow for non-compressed / parquet for compressed)

@ngoldbaum
Copy link
Member

See #24110 (comment).

FWIW @seberg and I are working on a PEP to make StringDType work with the buffer protocol by allowing libraries to define their own formats. That might help with this sort of thing, or at least codifying something that a new npz format could use. See https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/13.

If you’re interested in working on any of this I’d be happy to provide feedback or help. As I said upthread I’m focusing on other things (ecosystem free-threaded support mostly) for the short to medium term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants