Skip to content

Missingdata - add NA mask, multitude of other improvements #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 167 commits into from

Conversation

mwiebe
Copy link
Member

@mwiebe mwiebe commented Aug 19, 2011

Normally I would have preferred to do lots of smaller merges as I was developing this, but after the extensive design discussion phase, I thought it would be better to first write something that's in good, testable shape. Read the 2.0.0 release notes file for some details about the changes.

I believe this set of patches is in good shape to get merged, so needs the requisite testing and review to ensure things work well on a wide variety of platforms and with other dependent software.

@charris
Copy link
Member

charris commented Aug 19, 2011

Fails to build here:

numpy/core/src/multiarray/nditer_constr.c:2373:1: error: redefinition of ‘intp_abs’

@mwiebe
Copy link
Member Author

mwiebe commented Aug 19, 2011

Ah yes, single file build mode strikes again.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 19, 2011

That should fix it.

@rgommers
Copy link
Member

Description of the new functionality sounds promising!

I noticed one problem after a first try: some test files have the executable bit set and are therefore not run. With np.test(extra_argv=['--exe']) I get 3541 tests run vs. 3072 for np.test(). The former number still seems to be low, since in 1.6.1 there are 3542.

@rgommers
Copy link
Member

Thanks for fixing all the test noise (deprecation warnings) as well. You also masked the message about putmask however, I thought that deprecation would be undone?

>>>     >>> x = np.arange(6).reshape(2, 3)
>>>     >>> np.putmask(x, x>2, x**2)
/Library/Frameworks/Python.framework/Versions/2.6/bin/ipython:1: DeprecationWarning: 
putmask has been deprecated. Use copyto with 'where' as the mask instead

@charris
Copy link
Member

charris commented Aug 19, 2011

Python 3 build still fails:

In file included from numpy/core/src/multiarray/multiarraymodule_onefile.c:53:0:
numpy/core/src/multiarray/na_singleton.c: At top level:
numpy/core/src/multiarray/na_singleton.c:708:25: error: ‘Py_TPFLAGS_CHECKTYPES’ undeclared here (not in a function)

Looks like the python 3 change in flags.

@charris
Copy link
Member

charris commented Aug 19, 2011

Ralph, I see the same number of tests both ways. I applied Mark's branch as patches, so I expect previous permissions to remain the same. How did you get the branch? The patch approach is nice as it applies on top of the current master and removes trailing whitespace.

@rgommers
Copy link
Member

I just fetched and checked it out from Mark's git repo. Then

$ git clean -xdf
$ python setup.py build_ext -i

I just rebased it on master and tried again, same result. What number of tests do you get?

@rgommers
Copy link
Member

Never mind my putmask comment; your branch is just older than the putmask deprecation.

@charris
Copy link
Member

charris commented Aug 19, 2011

Ralph, I see

Ran 3420 tests in 21.679s

OK (KNOWNFAIL=3, SKIP=1)

The way I handle pull requests is like:

 1002  git co -b pull-141
 1003  curl https://github.com/numpy/numpy/pull/141.patch | git am --whitespace=strip

And someday I'll even figure out how to make a git macro for that ;)

@mwiebe
Copy link
Member Author

mwiebe commented Aug 19, 2011

There is trickiness with the number of tests.

There are points where I've renamed the test .py file (test_iterator.py to test_nditer.py), and if you don't wipe out the destination install path, the installer doesn't remove test_iterator.py, giving an inflated number of tests. That would increase the number of tests, though.

Whenever I've encountered yield-based tests, I've perhaps cursed a little bit, then changed them to not use yield so it's actually possible to track down where the failure is happening and what went wrong. This typically reduces the number of tests.

I don't have an accounting for the number of tests, but it probably needs a closer look to see what precisely is causing the discrepencies.

@rgommers
Copy link
Member

I've cleaned out my site-packages and reinstalled directly from the released installer, to be sure I don't have leftover files hanging around. The correct number for 1.6.1 is 3541 tests (on OS X).

The ~500 tests missing is due to the executable bit being set. After that's fixed we can look for smaller discrepancies if necessary.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 19, 2011

I've just done the following to try and find the files you refer to:

$ find . -iname test_*.py | xargs ls -l > test_files.txt

and all 201 of them are '-rw-rw-r--', no execute bit set anywhere. I also get 3064 tests on my machines (Fedora 15 64-bit)

@charris
Copy link
Member

charris commented Aug 19, 2011

Hmm, I'm also running Fedora 15 64 bit and git status doesn't show any untracked test files. Might be because the missing data branch is rebased to current master.

I don't see any executable test files either.

@charris
Copy link
Member

charris commented Aug 19, 2011

If I run numpy.test(label='full') I get

Ran 3439 tests in 40.484s

FAILED (KNOWNFAIL=3, SKIP=1, failures=1)

The failure is one of the f2py tests.

@charris
Copy link
Member

charris commented Aug 19, 2011

Just in case: nose-1.0.0-py2.7.egg

@rgommers
Copy link
Member

It seems to be even worse on some of the buildbots: http://buildbot.scipy.org/builders/Windows_XP_x86/builds/1109/steps/shell_1/logs/stdio

I've got 3619 tests run for commit c649139 (before the start of the deprecate_array_access_fields branch) and 3373 when that branch was merged in again. So 250 tests got lost on the way. I'll open a ticket and test this more thoroughly in some automated way. The problem is not (just) the missingdata branch anyway.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 19, 2011

I have a feeling most or all of this is accounted for where I got rid of 'yield' in some of the tests. When a 'yield'-based test fails, its virtually impossible to track down where the failure is coming from, so in order to debug things I converted such tests to use asserts instead. Each 'yield' counts as a test, whereas each assert does not.

@rgommers
Copy link
Member

You could be right. Still, this isn't the first time that the test count is inconsistent. We've had bugs with executable files before, and with yield tests in classes, etc. It's not hard to write a script that goes back in time and checks per X commits what the number of executed tests is.

@charris
Copy link
Member

charris commented Aug 19, 2011

The NEP still has references to 'hasmaskna' the need to be updated to 'maskna', so you might want to update that separately from the rest of this branch.

I think the common methods of array creation -- ones, zeros, ones_like, etc. -- need to get the 'maskna' keyword. That will make it much easier for folks to play and shouldn't be a lot of work.

It would also help rationalize expectations to note that fancy indexing doesn't work for np.NA assignment at this point, boolean indexing only.

So far I haven't run into any problems with scipy or ipython .11.dev using the missing data branch.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 20, 2011

I've updated the NEP to remove 'hasmaskna'. I've added ones and ones_like. I guess that commit is still in the branch though...

empty and zeros were done, I just postponed ones_like because it was a ufunc, and adding maskna as a parameter to ufuncs didn't seem good. I've renamed the ones_like ufunc to _ones_like, and made a function matching the others. I'll take a look at some others, like random.rand, etc.

I mentioned the fancy indexing thing in the release notes, but I guess I could point it out in a future email as well.

Great to hear it's working for you!

@charris
Copy link
Member

charris commented Aug 20, 2011

Re: missing tests, I updated to current master and lost about 350 tests. Git bisect fingers 4386275 as the culprit.

@charris
Copy link
Member

charris commented Aug 20, 2011

Reverting the commit fixes the problem, but I don't immediately see why. I assume something is failing during the build without derailing the whole thing.

@87
Copy link
Contributor

87 commented Aug 22, 2011

Hi, I noticed a subtle difference between numpy 1.6.1 and the missingdata branch.

a = np.arange(10)
b = np.arange(3).astype(bool)

1.6.1:
a[b] --> [0,1,2]

2.0.0.dev
a[b] --> ValueError: operands could not be broadcast together with shapes (10) (3)

The latest behaviour seems more consistent, but I've found at least one piece of code that relied on the former..

@mwiebe
Copy link
Member Author

mwiebe commented Aug 22, 2011

There's an email thread I started about the boolean broadcasting here:

http://mail.scipy.org/pipermail/numpy-discussion/2011-July/057870.html

but it didn't generated any discussion on the issue. I've added the specific point you bring up to the release notes. If it's possible, I'd like to keep the change in, but if it's really necessary based on discussion during the alpha/early beta release cycle, a hack can be introduced to make the behavior match.

What's the code you've found that relies on the former behavior?

@87
Copy link
Contributor

87 commented Aug 23, 2011

Ah, a specific note on the issue would be the best option, I think, as it adds more consistent behavior after all.

This line generated the error in PyTables, but it is easily fixable..:
https://github.com/PyTables/PyTables/blob/master/tables/tableExtension.pyx#L922

@charris
Copy link
Member

charris commented Aug 26, 2011

The github patch no longer applies cleanly to master. I'm beginning to have doubts about github patches... Anyway, it might be worth rebasing this.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 26, 2011

I've rebased against master - the conflict was about the void dtype crash that worked in this branch and then got fixed in master later.

@charris
Copy link
Member

charris commented Aug 26, 2011

The file boolean_ops.c needs to be included in numpy/core/src/multiarray/multiarraymodule_onefile.c.

@charris
Copy link
Member

charris commented Aug 26, 2011

The imports in numpy/core/_methods.py need fixes for python3

from numpy.core import multiarray as mu
from numpy.core import umath as um
from numpy.core.numeric import asanyarray

@charris
Copy link
Member

charris commented Aug 26, 2011

Curiously, I needed to reinstall matplotlib on account of the import changes. I'm not sure what's up with that, so this note is just a heads up in case someone else runs into that.

mwiebe added 5 commits August 26, 2011 10:22
…safer for legacy code

These functions now reject inputs with NA, and there are alternative
functions PyArray_AllowNAConverter and PyArray_OutputAllowNAConverter
that functions should use when they intend to support NA.
@charris
Copy link
Member

charris commented Aug 27, 2011

It looks about time to merge this so it gets further testing and as preparation for a long release cycle. Once this is in I'll take a look at supporting NA in the various numpy fitting routines.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 27, 2011

Sounds good, is there anything you guys would like me to do/check before this get merged?

@charris
Copy link
Member

charris commented Aug 27, 2011

It would be useful to have your thoughts on the way forward following the merge/release, but that is separate from the merge.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 27, 2011

My basic thoughts are:

  • Similar to the nditer change in 1.6, the datetime + NA masks constitute a significant update to core-level functionality of NumPy. We need to be just as cautious in the release schedule for this version as we were for 1.6.
  • I'd suggest going with 1.7 as the release version, because the ABI is still compatible. I'm testing it with the SciPy 0.9.0 in Fedora 15 built against 1.5.1, for example. Just as in 1.6, compatibility tests are in order. I'm a fan of announcing that 1.7 will be the last version to support Python 2.4.
  • A lesson of the post-1.6 release period is that some of the changes we discussed a fair bit on the list didn't get proper mention in the release notes. We need to ensure that we explicitly mention these things. The feeling I got about switching to 'same_kind' default casting for ufuncs was that people generally like a change like this which will help catch programming errors, but want to be explicitly told about it.
  • I'm inclined to also change default assignment casting to 'same_kind' as well, what do people think about giving that a try? Some of the changes, like introducing new PyArray_Assign* functions, should make trying out such a change relatively easy. If we do this, it should probably be some time after the merge of this branch, so as to isolate testing of it.
  • After a 1.7 release, I think we want to go for a 1.8 to finish up an ABI/API deprecation phase. I did a big hunk of work creating a mechanism to allow hiding of the PyArrayObject details from the ABI, and the same thing needs to happen to at least PyArray_Descr and PyUFuncObject in the 1.x series before a 2.0 can be done which would throw out the deprecated access and fully hide the field details. (Of course, for people who want faster access, a NPY_DISABLE_ABI_COMPATIBILITY macro could be supported which would only allow loading of modules with that exact build of NumPy, i.e. it would enforce both the major and minor API versions)
  • There are still a bunch of NumPy features not supporting NA masks yet. Hopefully the C-API NA mask documentation is good enough to help more people than just me work on these things! I don't think these things should preclude a release, because the functionality is already useful, but it would be nice to have a continued trajectory towards full integration and support of the NA mask.
  • The next big thing for NA masks is supporting struct dtypes. I don't have any immediate plans to do that myself, but if someone skilled in C steps up to do it I'll be glad to assist them with it. After that, probably multi-NA, then the bitpattern NA dtypes.

This list looks pretty long to me, and I think the topics deserve multiple discussions on the mailing list over time. These are just my thoughts at the present, after all. ;)

@charris
Copy link
Member

charris commented Aug 27, 2011

That's a pretty long list ;) I think you should post it on the mailing list.

If no one complains, I'll merge this this evening.

@mwiebe
Copy link
Member Author

mwiebe commented Aug 27, 2011

I'll turn that into a "roadmap" email in the next few days.

@charris
Copy link
Member

charris commented Aug 28, 2011

Merged aa55ba7..9ecd91b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants