High memory peak when MaskedArray creates a mask #6732

saimn · 2015-11-26T11:07:18Z

Hi,
We work with huge masked arrays, and noticed that there is a high memory peak when creating a masked array with mask=False. In the example below you can see that when using MaskedArray(mask=np.ma.nomask ...) the memory does not increase and the masked array use a view of the data array. But when using MaskedArray(mask=False ...) there is a memory peak of ~15 times the size of the boolean mask which is created !

In [1]: import numpy as np

In [2]: import ipython_memory_usage.ipython_memory_usage as imu

In [3]: a = np.zeros((10000, 10000))

In [4]: imu.start_watching_memory()
In [4] used 0.0039 MiB RAM in 16.27s, peaked 0.00 MiB above current, total RAM usage 34.67 MiB

In [5]: a[:] = 1
In [5] used 763.0234 MiB RAM in 0.20s, peaked 0.00 MiB above current, total RAM usage 797.70 MiB

In [6]: m = np.ma.MaskedArray(data=a, mask=np.ma.nomask, dtype=float, copy=False)
In [6] used 0.0000 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 797.70 MiB

In [7]: m = np.ma.MaskedArray(data=a, mask=False, dtype=float, copy=False)
In [7] used 95.4375 MiB RAM in 10.23s, peaked 1525.77 MiB above current, total RAM usage 893.13 MiB

The issue comes from the line mask = np.resize(mask, _data.shape) in https://github.com/numpy/numpy/blob/master/numpy/ma/core.py#L2770, where mask is just array([False], dtype=bool). And then np.resize call concatenate((a,)*n_copies) which causes the memory peak (https://github.com/numpy/numpy/blob/master/numpy/core/fromnumeric.py#L1149).

Without knowing the reasons of this implementation, I wonder why the mask is not created simply with a np.zeros(dtype=bool, ...) (or np.ones depending the value of the mask parameter).

The text was updated successfully, but these errors were encountered:

charris · 2015-11-26T13:41:57Z

Just for reference, could you post the numpy version as well?

saimn · 2015-11-26T13:47:26Z

Yep sorry, I'm using 1.10.1

charris · 2015-11-26T14:16:00Z

The reason I asked was that 1.10.1 has a known problem with structured/record arrays that shows up as long run times and huge memory consumption. You don't mention whether structured/record arrays are being used in addition to masked arrays, so it might not be the same problem. There was also at least one problem with masked arrays. Could you try 1.10.2rc1 by any chance? It would also be good to know if numpy 1.9 was better so as to determine if this is a regression or a long standing problem.

saimn · 2015-11-26T14:29:20Z

I don't use record arrays here (but I known about this issue with astropy.io.fits). I have just forked and pulled the master branch and I get the same behavior.
With a git blame, it seems that this code almost didn't change since its origins (but I didn't check resize and concatenate), so I guess it has always been present. It may have been unnoticed since by default the mask is set to nomask, and then the peak is temporary. We are also using this for months without noticing :/

…se (numpy#6732). When the `mask` parameter is set to True or False, create directly a `ndarray` of boolean instead of going inside `np.resize` which was causing of memory peak of ~15 times the size of the mask.

* 'master' of git://github.com/numpy/numpy: (24 commits) BENCH: allow benchmark suite to run on Python 3 TST: test f2py, fallback on f2py2.7 etc., fixes numpy#6718 BUG: link cblas library if cblas is detected BUG/TST: Fix for numpy#6724, make numpy.ma.mvoid consistent with numpy.void BUG/TST: Fix numpy#6760 by correctly describing mask on nested subdtypes BUG: resizing empty array with complex dtype failed DOC: Add changelog for numpy#6734 and numpy#6748. Use integer division to avoid casting to int. Allow to change the maximum width with a class variable. Add some tests for mask creation with mask=True or False. Test that the mask dtype if MaskType before using np.zeros/ones BUG/TST: Fix for numpy#6729 ENH: Avoid memory peak and useless computations when printing a MaskedArray. ENH: Avoid memory peak when creating a MaskedArray with mask=True/False (numpy#6732). BUG: Readd fallback CBLAS detection on linux. TST: Fix travis-ci test for numpy wheels. MAINT: Localize variables only used with relaxed stride checking. BUG: Fix for numpy#6719 MAINT: enable Werror=vla in travis BUG: Include relevant files from numpy/linalg/lapack_lite in sdist. ...

saimn · 2016-02-04T09:25:28Z

Forgot to close this one (since #6734 was merged).

…se (numpy#6732). When the `mask` parameter is set to True or False, create directly a `ndarray` of boolean instead of going inside `np.resize` which was causing of memory peak of ~15 times the size of the mask.

This was referenced Nov 26, 2015

ENH: Avoid memory peak when creating a MaskedArray with mask=True/False. #6734

Merged

BUG: str and repr special methods blow up memory usage #3544

Closed

charris added 00 - Bug component: numpy.ma masked arrays labels Nov 26, 2015

saimn closed this as completed Feb 4, 2016

ianozsvald mentioned this issue Mar 6, 2018

Upload to PyPI ianozsvald/ipython_memory_usage#4

Closed

ianozsvald mentioned this issue May 14, 2021

Collect pandas (and other) weird memory usage cases ianozsvald/ipython_memory_usage#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory peak when MaskedArray creates a mask #6732

High memory peak when MaskedArray creates a mask #6732

saimn commented Nov 26, 2015

charris commented Nov 26, 2015

saimn commented Nov 26, 2015

charris commented Nov 26, 2015

saimn commented Nov 26, 2015

saimn commented Feb 4, 2016

High memory peak when MaskedArray creates a mask #6732

High memory peak when MaskedArray creates a mask #6732

Comments

saimn commented Nov 26, 2015

charris commented Nov 26, 2015

saimn commented Nov 26, 2015

charris commented Nov 26, 2015

saimn commented Nov 26, 2015

saimn commented Feb 4, 2016