Skip to content

Fix #5339: cache dtype.__hash__. #5343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2015
Merged

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 3, 2014

Fix #5339: cache dtype.__hash__

@pitrou
Copy link
Member Author

pitrou commented Dec 3, 2014

Small benchmarks:

python3.4 -m timeit -s "import numpy as np; t=np.dtype('uint64')" "hash(t)"
  • before patch: 1000000 loops, best of 3: 0.498 usec per loop
  • after patch: 10000000 loops, best of 3: 0.0616 usec per loop
python3.4 -m timeit -s "import numpy as np; t=np.dtype([(s, 'f') for s in 'abcdefghij'])" "hash(t)"
  • before patch: 100000 loops, best of 3: 4.43 usec per loop
  • after patch: 10000000 loops, best of 3: 0.0603 usec per loop

/* Cached hash value (-1 if not yet computed).
* This was added for NumPy 2.0.0.
*/
npy_hash_t hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unfortunately breaks the ABI and causes forward compatibility issues with cython.
though I am playing with breaking ABI by hiding the ufunc structure (at least partially) maybe we could break this too.
But if we do we should do it properly and add a pointer to a private object or something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't adding a field at the end be fine, ABI-wise? We don't allow
subclassing. What cython issue are you thinking of?

I doubt there's any legitimate reason for anyone outside numpy to access
anything besides a few fields at the beginning anyway though (in
particularly itemsize), and I've also been pondering plans to rewrite the
dtype struct. So I wouldn't rule that out either. Rather than using a
pimpl, though, it might be sufficient to just hide the private fields
behind a private struct definition that only we can cast to.
On 4 Dec 2014 06:47, "Julian Taylor" notifications@github.com wrote:

In numpy/core/include/numpy/ndarraytypes.h:

@@ -619,6 +620,10 @@ typedef struct _PyArray_Descr {
* for NumPy 1.7.0.
*/
NpyAuxData *c_metadata;

  •    /\* Cached hash value (-1 if not yet computed).
    
  •     \* This was added for NumPy 2.0.0.
    
  •     */
    
  •    npy_hash_t hash;
    

this unfortunately breaks the ABI and causes forward compatibility issues
with cython.
though I am playing with breaking ABI by hiding the ufunc structure (at
least partially) maybe we could break this too.
But if we do we should do it properly and add a pointer to a private
object or something similar.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5343/files#r21288459.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we do we should do it properly and add a pointer to a private object or something similar.

That sounds reasonable. Do you have a patch for that somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extending structs breaks programs embedding this struct into other structures (unlikely) and cython checks the size of these types and complains if they are different than what the program was built with, thats just an annoyance but one that comes up very often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished my ufunc patch yet, though it goes along the lines that njsmith suggested, keep the current structure and internally use a larger one. Additionally it might be interesting to mangle the public part in the development version to trigger failures for people that use the internals. For release the mangling can be reverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less knowledgeable than you guys :) What should be the way forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way would be to implement the hiding of the object internals, another wait until someone else does it, but that might take a while.
Unfortunately the dtype internals are probably far more commonly used than the ufunc internals I want to hide. This probably gives us much less wiggle room for breaking the ABI.
Unless someone finds a way to do so without breaking it, maybe some kind of magic number in one of the existing fields could work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am also just overly cautious, in light of no better idea if you want to work on it I think the approach of adding a second private struct that embeds the public struct is worthwhile. Its only problem is that we technically can't trust the hash to be correct if others do unusual stuff with the public part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the general scheme work like PyArray_Fields currently does? Which dtype fields should be private vs. public exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on my questions above :)

@pitrou
Copy link
Member Author

pitrou commented Mar 24, 2015

Re-re-ping.

@charris
Copy link
Member

charris commented Mar 26, 2015

@juliantaylor @njsmith Thoughts? I agree with Julian that the descr internals are accessed in a lot of code, and that embedding the structure might be a way forward, i.e., dividing it into private and public parts. Long term, we might want to add functions for access, but I don't think we can require them in any time frame less than several/many years.

@njsmith
Copy link
Member

njsmith commented Mar 26, 2015

I'm all for doing some more general rearrangements to how the dtype struct
works, but that's another discussion. I am totally unbothered by adding a
field to the end of the struct now and deferring any more cleanups until
later.

I'd worry more about correctness in face of the fact that dtypes are
mutable in general. (This is horrible but I'm pretty sure it's true that
e.g. you can modify the field names of a structured dtype.) One option
would be to only cache the hashes of known-safe dtypes like the atomic
types. Another would be to tell people sorry, if you go mutating dtypes in
place then you deserve whatever you get.

Making the uninitialized value 0 as discussed above might be a minor
improvement.
On Mar 25, 2015 5:44 PM, "Charles Harris" notifications@github.com wrote:

@juliantaylor https://github.com/juliantaylor @njsmith
https://github.com/njsmith Thoughts? I agree with Julian that the descr
internals are accessed in a lot of code, and that embedding the
structure might be a way forward, i.e., dividing it into private and public
parts. Long term, we might want to add functions for access, but I don't
think we can require them in any time frame less than several/many years.


Reply to this email directly or view it on GitHub
#5343 (comment).

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

Could you show me an example of how to mutate a dtype?

@pv
Copy link
Member

pv commented Mar 26, 2015

x = np.dtype([('a', 'f8')])
x.names = ('foo',)

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

Thanks. Apparently names is the only settable attribute on dtype, so we can invalidate the cached hash there (and in __setstate__ as well).

@charris
Copy link
Member

charris commented Mar 26, 2015

Hmm, is there a good reason for the names being mutable? Should we consider removing that ability?

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

As a matter of fact, there's the following comment in the setter for the names attribute:

    /*
     * This deprecation has been temporarily removed for the NumPy 1.7
     * release. It should be re-added after the 1.7 branch is done,
     * and a convenience API to replace the typical use-cases for
     * mutable names should be implemented.
     *
     * if (DEPRECATE("Setting NumPy dtype names is deprecated, the dtype "
     *                "will become immutable in a future version") < 0) {
     *     return -1;
     * }
     */

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

(the deprecation was added in 935c5b0 and commented out in ff4aef4)

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

I've updated the PR to invalidate the cache when necessary, and add tests.

@charris
Copy link
Member

charris commented Mar 26, 2015

Why the merge from master? Few comments

  • Should probably uncomment the deprecation and test for it, maybe as a separate PR. I assume it was commented out for a reason, so someone must be using the feature. Can views be used instead?
  • What was the circular include dependency? Is there a cleaner way to fix it? In any case, the name of the new include file doesn't sound right, it is less than minimal ;)

@pitrou
Copy link
Member Author

pitrou commented Mar 26, 2015

What was the circular include dependency?

npy_3kcompat.h includes ndarraytypes.h includes ndarrayobject.h includes npy_3kcompat.h.

Is there a cleaner way to fix it?

Depends what you call "cleaner" :-) Perhaps move the "npy_hash_t" typedef into npy_common.h?

@charris
Copy link
Member

charris commented Mar 26, 2015

Mailing list thread on the topic http://thread.gmane.org/gmane.comp.python.numeric.general/43047/focus=43119. Note the comment that the dtype was made mutable to work around the inablility to do the renames with a *.view. We might want to take a look at that, as it seems that would be the preferred alternative.

@charris
Copy link
Member

charris commented Mar 28, 2015

@pitrou Yeah, let's put npy_hash_t in npy_common.h. It's a late addition to the compat file and that file contains no other typedefs. After that I think you can rebase cleanly on master and squash the commits into one. Please use the ENH: tag on the commit summary line, add a note to the 1.10 release notes mentioning the change in the struct (although I don't think it will cause any problems except for Cython warnings). Please expand the rest of the commit message, saying what you have done and include the benchmark results. Finish the commit message with Closes #5339. After all that, do a force push ;)

@pitrou pitrou force-pushed the cache_dtype_hash branch 2 times, most recently from 30e4313 to 6a373cb Compare April 4, 2015 16:40
Computing the type of a dhash can be slow for complex (e.g. structured)
dtypes.  Hashing dtypes can be useful in some applications, such as
when doing type-based dispatching, and speed can be critical in those
cases.  This enhancement caches the once-computed hash value in the
dtype structure, so as to save time on further lookups.  The cached
value is invalidated in the rare cases where the dtype is mutated.

Benchmarks numbers:

python3.4 -m timeit -s "import numpy as np; t=np.dtype('uint64')" "hash(t)"
* before patch: 1000000 loops, best of 3: 0.498 usec per loop
* after patch: 10000000 loops, best of 3: 0.0616 usec per loop

python3.4 -m timeit -s "import numpy as np; t=np.dtype([(s, 'f') for s in 'abcdefghij'])" "hash(t)"
* before patch: 100000 loops, best of 3: 4.43 usec per loop
* after patch: 10000000 loops, best of 3: 0.0603 usec per loop

Closes numpy#5339.
@pitrou pitrou force-pushed the cache_dtype_hash branch from 6a373cb to cca2c1a Compare April 4, 2015 16:45
@pitrou
Copy link
Member Author

pitrou commented Apr 4, 2015

@charris, ok, I think I have done everything you asked for.

charris added a commit that referenced this pull request Apr 4, 2015
@charris charris merged commit e05b758 into numpy:master Apr 4, 2015
@charris
Copy link
Member

charris commented Apr 4, 2015

LGTM, thanks @pitrou. @juliantaylor @njsmith This simply extends the struct at the end. I don't think this will break anything in practice. If we intend to hide some stuff we should settle how to do that soon.

@juliantaylor
Copy link
Contributor

I don't like starting the 1.7 cython warning nonsense again and a (minor) abi break for a performance improvement most users are never going to see ...
sorting this out once and for all needs to be a 1.10 blocker

@pitrou
Copy link
Member Author

pitrou commented Apr 4, 2015

Perhaps Cython simply needs fixing? What is the warning about exactly?

@juliantaylor
Copy link
Contributor

change of structure size, aka abi break. its probably too noisy as the abi is only broken for (hopefully) very uncommon cases. Changing cython will not help much as the current structure size is already included in existing binaries which you'd need to recompile to remove it.

@rgommers
Copy link
Member

rgommers commented Apr 4, 2015

Details at #432. That was/is indeed annoying as hell.

@njsmith
Copy link
Member

njsmith commented Apr 4, 2015

The change in #432 to make cython shut up about this nonsense is still in
effect, right? Doesn't that take care of things?

On Sat, Apr 4, 2015 at 4:21 PM, Ralf Gommers notifications@github.com
wrote:

Details at #432 #432. That was/is
indeed annoying as hell.


Reply to this email directly or view it on GitHub
#5343 (comment).

Nathaniel J. Smith -- http://vorpus.org

@pitrou
Copy link
Member Author

pitrou commented Apr 4, 2015

PyObjects should only be allocated dynamically and passed by reference (the exception is static definition of a PyTypeObject), so I don't know why a size change would be a problem. CPython can change the size of builtin objects from one version to another: does Cython complain about that?

@njsmith
Copy link
Member

njsmith commented Apr 4, 2015

It's never made a huge amount of sense to me either. You might bring it up
with the Cython devs...

(Does CPython really change the size of builtin objects in point releases?
IIRC Cython extensions don't restrict themselves to the stable ABI, so they
have to be recompiled for major releases anyway.)

On Sat, Apr 4, 2015 at 4:26 PM, Antoine Pitrou notifications@github.com
wrote:

PyObjects should only be allocated dynamically and passed by reference
(the exception is static definition of a PyTypeObject), so I don't know why
a size change would be a problem. CPython can change the size of builtin
objects from one version to another: does Cython complain about that?


Reply to this email directly or view it on GitHub
#5343 (comment).

Nathaniel J. Smith -- http://vorpus.org

@rgommers
Copy link
Member

rgommers commented Apr 4, 2015

@njsmith I just checked, the filters in numpy/__init__.py are still there and the string it filters on is still unchanged. So should be OK for this PR. There's no filter for numpy.ndarray though, maybe we should add that proactively just in case.

@pitrou
Copy link
Member Author

pitrou commented Apr 4, 2015

Does CPython really change the size of builtin objects in point releases?

No, normally not. Only in feature releases.

@njsmith
Copy link
Member

njsmith commented Apr 4, 2015

There's no filter for numpy.ndarray though, maybe we should add that proactively just in case.

I agree :-)

rgommers pushed a commit to rgommers/numpy that referenced this pull request Apr 5, 2015
…n noise.

See numpygh-432 for details.  Motivation for adding this now is the discussion
on numpygh-5343.
@rgommers
Copy link
Member

rgommers commented Apr 5, 2015

Okay, done in gh-5750.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cache dtype.__hash__
7 participants