Skip to content

High memory peak when MaskedArray creates a mask #6732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saimn opened this issue Nov 26, 2015 · 5 comments
Closed

High memory peak when MaskedArray creates a mask #6732

saimn opened this issue Nov 26, 2015 · 5 comments

Comments

@saimn
Copy link
Contributor

saimn commented Nov 26, 2015

Hi,
We work with huge masked arrays, and noticed that there is a high memory peak when creating a masked array with mask=False. In the example below you can see that when using MaskedArray(mask=np.ma.nomask ...) the memory does not increase and the masked array use a view of the data array. But when using MaskedArray(mask=False ...) there is a memory peak of ~15 times the size of the boolean mask which is created !

In [1]: import numpy as np

In [2]: import ipython_memory_usage.ipython_memory_usage as imu

In [3]: a = np.zeros((10000, 10000))

In [4]: imu.start_watching_memory()
In [4] used 0.0039 MiB RAM in 16.27s, peaked 0.00 MiB above current, total RAM usage 34.67 MiB

In [5]: a[:] = 1
In [5] used 763.0234 MiB RAM in 0.20s, peaked 0.00 MiB above current, total RAM usage 797.70 MiB

In [6]: m = np.ma.MaskedArray(data=a, mask=np.ma.nomask, dtype=float, copy=False)
In [6] used 0.0000 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 797.70 MiB

In [7]: m = np.ma.MaskedArray(data=a, mask=False, dtype=float, copy=False)
In [7] used 95.4375 MiB RAM in 10.23s, peaked 1525.77 MiB above current, total RAM usage 893.13 MiB

The issue comes from the line mask = np.resize(mask, _data.shape) in https://github.com/numpy/numpy/blob/master/numpy/ma/core.py#L2770, where mask is just array([False], dtype=bool). And then np.resize call concatenate((a,)*n_copies) which causes the memory peak (https://github.com/numpy/numpy/blob/master/numpy/core/fromnumeric.py#L1149).

Without knowing the reasons of this implementation, I wonder why the mask is not created simply with a np.zeros(dtype=bool, ...) (or np.ones depending the value of the mask parameter).

@charris
Copy link
Member

charris commented Nov 26, 2015

Just for reference, could you post the numpy version as well?

@saimn
Copy link
Contributor Author

saimn commented Nov 26, 2015

Yep sorry, I'm using 1.10.1

@charris
Copy link
Member

charris commented Nov 26, 2015

The reason I asked was that 1.10.1 has a known problem with structured/record arrays that shows up as long run times and huge memory consumption. You don't mention whether structured/record arrays are being used in addition to masked arrays, so it might not be the same problem. There was also at least one problem with masked arrays. Could you try 1.10.2rc1 by any chance? It would also be good to know if numpy 1.9 was better so as to determine if this is a regression or a long standing problem.

@saimn
Copy link
Contributor Author

saimn commented Nov 26, 2015

I don't use record arrays here (but I known about this issue with astropy.io.fits). I have just forked and pulled the master branch and I get the same behavior.
With a git blame, it seems that this code almost didn't change since its origins (but I didn't check resize and concatenate), so I guess it has always been present. It may have been unnoticed since by default the mask is set to nomask, and then the peak is temporary. We are also using this for months without noticing :/

saimn added a commit to saimn/numpy that referenced this issue Nov 26, 2015
…se (numpy#6732).

When the `mask` parameter is set to True or False, create directly a `ndarray` of
boolean instead of going inside `np.resize` which was causing of memory peak of
~15 times the size of the mask.
colbych added a commit to colbych/numpy that referenced this issue Dec 6, 2015
* 'master' of git://github.com/numpy/numpy: (24 commits)
  BENCH: allow benchmark suite to run on Python 3
  TST: test f2py, fallback on f2py2.7 etc., fixes numpy#6718
  BUG: link cblas library if cblas is detected
  BUG/TST: Fix for numpy#6724, make numpy.ma.mvoid consistent with numpy.void
  BUG/TST: Fix numpy#6760 by correctly describing mask on nested subdtypes
  BUG: resizing empty array with complex dtype failed
  DOC: Add changelog for numpy#6734 and numpy#6748.
  Use integer division to avoid casting to int.
  Allow to change the maximum width with a class variable.
  Add some tests for mask creation with mask=True or False.
  Test that the mask dtype if MaskType before using np.zeros/ones
  BUG/TST: Fix for numpy#6729
  ENH: Avoid memory peak and useless computations when printing a MaskedArray.
  ENH: Avoid memory peak when creating a MaskedArray with mask=True/False (numpy#6732).
  BUG: Readd fallback CBLAS detection on linux.
  TST: Fix travis-ci test for numpy wheels.
  MAINT: Localize variables only used with relaxed stride checking.
  BUG: Fix for numpy#6719
  MAINT: enable Werror=vla in travis
  BUG: Include relevant files from numpy/linalg/lapack_lite in sdist.
  ...
@saimn
Copy link
Contributor Author

saimn commented Feb 4, 2016

Forgot to close this one (since #6734 was merged).

@saimn saimn closed this as completed Feb 4, 2016
jaimefrio pushed a commit to jaimefrio/numpy that referenced this issue Mar 22, 2016
…se (numpy#6732).

When the `mask` parameter is set to True or False, create directly a `ndarray` of
boolean instead of going inside `np.resize` which was causing of memory peak of
~15 times the size of the mask.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants