-
Notifications
You must be signed in to change notification settings - Fork 45
Implement test_*_like
tests
#18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great. Let me know if you have any questions or issues. |
So currently this PR:
Good news is that I didn't catch any bugs in You might be happy with just these items for the scope of ths PR? Otherwise next week I can get started going over the whole test suite to see where you'd like nd-array support. In regards to how you've structured strategies, I didn't want to touch Interestingly in Also, would it be appropriate to rename Beyond this PR please feel free to assign me any work. I may be misremembering this, but if you wanted to use Black then obviously that could uglify some current conventions (e.g. the I would also like a small list of "house style" things (e.g. "test function signatures should mirror the respective Array API function signatures") in the README's contributing section. To this point, I personally would like to see the repo use explicit namespaces for both the top-level Hypothesis strategies and the stubbed array module, i.e. |
I wouldn't say there's a house style so much as just my own style, since I'm the only author of the code here so far. I would say you're worried about style too much. I don't really care that much about the way things look beyond very basic PEP 8 (which you seem to be following just fine). For instance you use @given with keyword arguments and I generally use it with positional arguments, but either way is fine. I'm much more concerned about the behavior, like if strategies properly test the full sample space. The only somewhat stylistic thing I would say is that really long or duplicated strategies should be factored out into variables (in the hypothesis_helpers.py file) so that the @given decorators are easier to read. Other than that, pyflakes should pass (the linting test on CI). pyflakes only checks for logical errors, not stylistic ones. The generate stubs build should start working if you merge master. It looks like NumPy is going to fail until I update upstream with some new functions, so maybe I should add some XFAILs for now so we can make sure everything else passes. I'm not really a huge fan of black, so I'd prefer we don't use it.
I'm not exactly sure what you mean about the signatures thing. I agree about namespaces. I'm not sure why I did it separately for the creation function tests, but you can see the elementwise function tests do it that way. Actually there is a little bit of a distinction which maybe isn't clear, which is that some array functions are exported in array_helpers.py. These are supposed to be a subset of functions that need to be implemented for some tests to work, because they are themselves used to test other functions (the list of exported functions has unfortunately grown quite large, and we should probably try to cull it down). After this PR, I think the next place to update stuff is in the elementwise tests. All those tests should use full arrays. Some of the tests aren't properly written with full arrays in mind, so they will need to be updated. You might want to start from the bottom of the file (ignore the tests that aren't written yet), as I was a little more considerate of this more recently. I don't expect any major issues since any tests that already work on scalars should work on full arrays. |
Ah, having spent a month getting headaches trying to adhere to "Black++", I didn't know what to do with myself 😅 Thanks for clarifying what you do expect!
Just to clarify, this goes back to our discussion on shared strategies vs |
I guess I see what you mean. You might have a better idea of what these "conventions" are than I do. Stuff like using draw() is something I would never really consider doing in the first place. So maybe if you think such a note in the README would be helpful you can write up something yourself and I can let you know if they all make sense. |
I'm happy for another review. Will likely open a (draft) elementwise PR tomorrow. Didn't get round to using promotable dtypes in |
pytestmark = [pytest.mark.skipif(UNDEFINED_DTYPES, reason="undefined dtypes")] | ||
|
||
|
||
def test_promotable_dtypes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never thought about how to test strategies before. It's probably a good idea for some of the more complex strategies. Is this how strategies are typically tested in hypothesis itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, although I'll clean it up now as it could just be top-level @given()
and assert in ...
in the method body, instead of whatever set stuff I'm doing.
@@ -129,8 +141,10 @@ def two_broadcastable_shapes(draw, shapes=shapes): | |||
|
|||
nonbroadcastable_ones_array_two_args = hypotheses_tuples(ones_arrays, ones_arrays) | |||
|
|||
# TODO: Generate general arrays here, rather than just scalars. | |||
numeric_arrays = builds(full, just((1,)), floats()) | |||
numeric_arrays = xps.arrays( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note here: we're going to want to split this across the different dtype categories. I guess this strategy is currently used in the special cases tests which are generated automatically, so we would have to also update the generate_stubs script to parse said categories from the spec. I think at present every function that has special cases accepts floating-point inputs which is why this works. This might end up being complicated, so it's something I'll take a look at myself at some point. We are also going to need to consider how things will work with complex dtypes once those start being added later this year.
dtype=shared_dtypes, | ||
shape=xps.array_shapes(), | ||
), | ||
fill_value=shared_dtypes.flatmap(promotable_dtypes).flatmap(xps.from_dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't see that from_dtype
was added. That is useful, and can probably replace some of the strategies in the hypothesis helpers (assuming it is implemented the same way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it works just like hypothesis.extra.numpy.from_dtype
, except you can't give abbreviated dtype names e.g. xp.int8
and "int8"
works, but "i8"
does not. You could replace hypothesis_helpers.scalars
with it, or redefine scalars
like so:
def scalars(dtype_strat, finite=False):
return dtype_strat.flatmap(lambda d: xps.from_dtype(d, allow_nan=not finite, allow_infinity=not finite))
@given( | ||
a=xps.arrays( | ||
dtype=shared_dtypes, | ||
shape=xps.array_shapes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to keep the shapes strategy as the one defined in hypothesis helpers. This makes it easy to globally limit the array shape, e.g., if the default strategy generates shapes that use too much memory or are too slow. It also is necessary to work around this NumPy bug https://github.com/data-apis/array-api-tests/blob/master/array_api_tests/hypothesis_helpers.py#L102-L105. Actually, it would be much more convenient if the hypothesis array_shapes strategy had a max_size argument, which controlled the maximum prod(shape).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolved this, thanks.
Just noting I personally wouldn't want to implement a max_size
arg in xps.array_shapes()
, especially if one can just use .filter(prod(s) > max_size)
. First-party Hypothesis strategies try to avoid scenarios where a user can define "unsatisfiable" strategies like xps.array_shapes(min_dims=5, min_side=5, max_size=24)
.
I added support for both promotable dtypes and broadcastable shapes in Happy for a review again. |
Looks like CI is going to be failing until I can get NumPy updated. Maybe I should push a temporary fix to xfail the known tests. |
dtype=shared_dtypes, | ||
shape=shapes, | ||
), | ||
fill_value=promotable_dtypes(shared_dtypes).flatmap(xps.from_dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So according to the spec, the dtype of a
is irrelevant. When dtype=None
the dtype should be inferred from fill_value
(this is the TODO in the test body). We should actually be using
@given(xps.arrays(dtype=dtypes, shape=shapes), scalars(shared_dtypes), one_of(none(), shared_dtypes))
similar to full()
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the case for the other _like functions as well. If dtype=None, the dtype is the same as x
. Otherwise the dtype is dtype
, and the dtype of x
is irrelevant. So in all these tests, the dtype of x
(or a
as we're calling it here, but we should change it. I like your idea of making the test signatures exactly match the signatures in the spec) should be drawn from all dtypes and be completely independent of the dtype
parameter. So I think actually shared_optional_promotable_dtypes
is not needed. We can just use arrays(dtype=dtypes)
and dtype=one_of(none(), dtypes)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now! I think I've fixed this now.
Using the suggested @given(...)
I got a falsifying example with (x=Array([], dtype=float32), fill_value=False, dtype=None)
, so I now generate fill_value
from x.dtype
if dtype == None
.
For reference this is how I interpreted the spec:
>>> from numpy import array_api as xp
>>> x = xp.zeros(1, dtype=xp.uint8)
>>> xp.full_like(x, 24) # specified behaviour
Array([24], dtype=xp.uint8)
>>> xp.full_like(x, 4.2) # unspecified behaviour
Array([4], dtype=xp.uint8)
>>> xp.full_like(x, 256) # unspecified behaviour
Array([0], dtype=uint8)
Let me know if the updated test_full_like
is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the spec for full_like https://data-apis.org/array-api/latest/API_specification/creation_functions.html#full-like-x-fill-value-dtype-none-device-none.
For the dtype
parameter, it says "output array data type. If dtype is None , the output array data type must be inferred from fill_value (see full(shape, fill_value, *, dtype=None, device=None) ). Default: None ." The full
spec says "output array data type. If dtype is None , the output array data type must be inferred from fill_value . If the fill value is an int , the output array data type must be the default integer data type. If the fill value is a float , the output array data type must be the default floating-point data type. If the fill value is a bool , the output array must have boolean data type. Default: None ."
In both cases, the dtype of x
is never mentioned. Only the value of dtype
, or an inferred dtype from fill_value
if dtype=None
. So I think in your examples, the results should be
>>> full_like(x, 24) # dtype=None
array(24, dtype=int64) OR array(24, dtype=int32)
>>> full_like(x, 4.2) # dtype=None
array(4.2, dtype=float64) OR array(4.2, dtype=float32)
>>> full_like(x, 2**100) # dtype=None
UNSPECIFIED
The dtype of x
doesn't matter here. It could be anything.
See https://data-apis.org/array-api/latest/API_specification/data_types.html#default-data-types. If dtype=None, the data type is the default floating-point dtype for float fill_value and the default integer dtype for int fill_value, which is either float32 or float64 and either int32 or int64, depending on the library. The only unspecified case is if the fill_value doesn't fit in the dtype. There's not really a clean way to infer the default dtype in the strategy, so we should just assume that the default is 32-bits for the purposes of generating fill_values within range when dtype=None.
So you're right that my @given isn't completely correct, because it doesn't account for unspecified fill values in the dtype=None case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're still unsure about this, let's discuss it on Slack. I want to make sure the spec is clear about what behavior it requires and that we are matching that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation! I think I understand now, and have updated the test case accordingly.
So if I'm correct in saying, this is currently a bug in NumPy:
>>> from numpy import array_api as xp
>>> x = xp.zeros(5, dtype=xp.int8)
>>> x
Array([0, 0, 0, 0, 0], dtype=int8)
>>> x_like = xp.full_like(x, fill_value=False)
>>> x_like
Array([0, 0, 0, 0, 0], dtype=int8)
i.e. x_like
should be Array([False, False, False, False, False, dtype=bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right. This is why we have the test suite. I asked on slack to double check but it looks like this needs to be changed in NumPy. I'll make sure to include it in my next PR there.
), | ||
x2=xps.arrays( | ||
dtype=promotable_dtypes(shared_dtypes), | ||
shape=shared(two_broadcastable_shapes(), key="shape_pair").map(lambda pair: pair[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complicated enough that we should factor it out into a helper function. Maybe something like
@composite
def two_arrays(draw, dtypes=dtypes):
# return two arrays with promotable dtypes drawn from dtypes and mutually broadcastable shapes
and we can then use
@given(*two_arrays(dtypes)) # replacing dtypes with the appropriate strategy
Also note that this should be two_mutually_broadcastable_shapes, not two_broadcastable_shapes. The latter only creates shapes that broadcast to the first shape. We should probably rename it to something clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea thanks. I think I'll revert/drop all the I updated this PR so there's now the test_elementwise_functions.py
changes in this PR, as it'll get annoying working on my other one.shared_arrays<1|2>
strategies (keeping in line with the matching signatures idea).
and not isinstance(j, _UndefinedStub)] | ||
return dtype_pairs | ||
|
||
def promotable_dtypes(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drawing a dtype then using promotable_dtypes will have a different sampling distribution than mutually_promotable_dtype_pairs. I don't know if that matters. The asymmetry of promotable_dtypes
does feel a little weird. I think we might want to instead try to use mutually_promotable_dtypes in the elementwise strategies so that they are clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, fixed!
Just noting that if you could figure out a different place for dtype_mapping
and promotion_table
, you could remove the otherwise-unnecessary composite wrapper for mutually_promotable_dtype_pairs
(see the comment I wrote).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The various data bits in the promotion test file should indeed be refactored and cleaned up. It's already used in quite a few places outside of that file. It's something I intend to do eventually, but if you see something that can help then go ahead.
I think I also may have made a mistake by using the string values of the dtypes as the main keys in the dictionaries rather than the dtype objects themselves, leading to a lot of code where you have multiple dict lookups to get at what you want. My only concern is that the dtype objects themselves might not actually be hashable, but I think this isn't a problem in practice, so we perhaps shouldn't worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #21 for this.
I just noticed your strategies for the _like functions don't exactly match what the spec says, and are actually more complicated than they need to be. Let's go ahead and work on the elementwise stuff on a separate PR. That way we can just merge this straight away once the other issues are resolved. |
I updated the xfail tests on the NumPy CI so the CI should pass now if all the tests here are working properly. |
I'm happy to see |
6.21.5 contains a useful patch for dtype stuff
So I've updated the |
test_*_like
tests
Introduced a |
assert all(equal(a, full((), ONE, **kwargs))), "ones() array did not equal 1" | ||
assert out.dtype == dtype, f"{dtype=!s}, but ones() returned an array with dtype {out.dtype}" | ||
assert out.shape == shape, "ones() produced an array with incorrect shape" | ||
assert all(equal(out, full((), make_one(dtype), dtype=dtype))), "ones() array did not equal 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You used asarray() above but full() everywhere else. We're actually not very consistent about this in other places in the test suite, but it might be a good idea to pick only one. We want to keep the dependency graph of the tests simpler when possible, so that people can get meaningful test results if they've only implemented a subset of the API. It's not always possible, but it's something to strive for (and even the present tests don't do this as well as it could be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, makes sense. asarray()
seems like easier to use, so I think I prefer that as the defacto standard for these kind of assertions.
This is looking good now. I only have a few minor comments. The CI is unfortunately bust again because I've started implemented linalg in NumPy. It might not work again until numpy/numpy#19980 is merged, so we'll just have to check things manually. I hadn't planned on bothering updating the XFAIL list because it will all be in flux until I finish up the linalg work, but if it's an issue for you we can. |
Great, hopefully tomorrow I can make this PR up to standard.
Yep no problem, I'm always running pytest locally (its just pyflakes I forgot heh). |
Fixed the remaining issues I think. I didn't get round to updating the other creation tests with the Was trying to get an "assertions refactor" for You might of encountered this already, but just noting that |
I actually was originally thinking the kwargs stuff should be a separate PR, so that's fine. I think this PR's already been open for too long, so let's just get it merged.
I haven't noticed that. I know that float shrinking is weird in general because the subset of floats that can break a function is often some fractal subset. I'm not too worried about it because we have to be very minimal about tests that rely on floating point behavior, since we shouldn't assume too much about implementation details about how libraries implement certain functions.
How does this work? The issue with the types is that the tests pick a dtype then a value for the dtype, but hypothesis doesn't know that the value picked may also have worked for a more minimal dtype. So you end up with results where the shrunk (dtype, value) is something like (uint32, 0), even though it could have been (int8, 0). I think this could happen even for a single dtype, although I haven't seen it yet. It's definitely worse when you draw two dtypes, because the strategies have multiple levels of dependencies on each other. If you know how to fix this that would be great. |
Let's merge this. If there were any other things you wanted to do here, open new PRs. I don't want you to have PRs open for too long, because they will get stuck in review hell and even though it's just me who's also working on this repo, it still opens the possibilities of unnecessary merge conflicts if I start touching files you're working on (if it makes your life easier, I can switch to doing my own development in PRs). So let's focus on merging stuff in even if it's only partial, as long as it works. I also want to make sure you aren't even blocked on my reviews, as I may not always be able to look at things in a timely fashion. |
@asmeurer Oops forgot to respond to this. So as you might of gathered, for list-y strategies Hypothesis just see what's the first element (in this case, tuple containing dtype pair) that fails. Once we can get a test for minimisation behaviour in, it'll be easier to experiment with this (low priority for now I imagine). |
This PR introduces use of
hypothesis.extra.array_api
to generate (multi-dimensional) arrays for some existing tests, and enable the writing some new ones. I made a rudimentary proof-of-concept in #17, but this should be a serious attempt (and my main focus for the next two weeks).For now my priority is getting these tests up and running, and then iteratively I can follow the styles and idioms of the compliance suite properly.