Skip to content

np.memmap.__getitem__ leaks slice #10066

Closed
@f0k

Description

@f0k

When I index a np.memmap instance with a slice, the slice instance ends up with a too high refcount and will never be collected (numpy 1.13.3, Python 2.7.12, Ubuntu 16.04).

Demonstration (using pympler for collecting all live objects, pip install pympler, and assuming /etc/fstab is readable, any file will do):

import sys
import numpy as np
from collections import deque
from pympler import muppy

print(sys.version)
print(np.__version__)

print('Before: %d live slices' %
      (len([o for o in muppy.get_objects() if isinstance(o, slice)])))

x = np.empty(100)
deque((x[:3] for _ in range(1000)), maxlen=0)
print('1000 times __getitem__(slice) on ndarray: %d live slices' %
      (len([o for o in muppy.get_objects() if isinstance(o, slice)])))

x = np.memmap('/etc/fstab', dtype=np.int8, mode='r')
deque((x[:5] for _ in range(1000)), maxlen=0)
print('1000 times __getitem__(slice) on memmap: %d live slices' %
      (len([o for o in muppy.get_objects() if isinstance(o, slice)])))

deque((x[:7,] for _ in range(1000)), maxlen=0)
print('1000 times __getitem__(tuple) on memmap: %d live slices' %
      (len([o for o in muppy.get_objects() if isinstance(o, slice)])))

print('some of these slices:')
print([o for o in muppy.get_objects() if isinstance(o, slice)][-3:])

Output:

[GCC 5.4.0 20160609]
1.13.3
Before: 4 live slices
1000 times __getitem__(slice) on ndarray: 4 live slices
1000 times __getitem__(slice) on memmap: 1004 live slices
1000 times __getitem__(tuple) on memmap: 1004 live slices
some of these slices:
[slice(0, 5, None), slice(0, 5, None), slice(0, 5, None)]

So x[:5] for an np.memmap leaks a slice(0, 5, None), while x[:7,] does not leak the slice. We can also further inspect the refcount and references:

>>> s = [o for o in muppy.get_objects() if isinstance(o, slice)][-1]
>>> import gc
>>> sys.getrefcount(s)
4
>>> len(gc.get_referrers(s))
2

My colleague @superbock stumbled into this before and said it went away in Python 3; I haven't tested this yet. He also found the workaround of adding a comma (x[:7,] instead of x[:7]) to use a different code path.
Note that this leak can become a major problem: I'm using memmaps in my iteration code for training neural networks, and the leak ate up all available memory within 6 hours and had my process killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions