Skip to content

malloc issue with polyfit #12230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
astrofrog opened this issue Oct 20, 2018 · 24 comments
Closed

malloc issue with polyfit #12230

astrofrog opened this issue Oct 20, 2018 · 24 comments
Labels

Comments

@astrofrog
Copy link
Contributor

astrofrog commented Oct 20, 2018

I tried installing the latest developer version of numpy today and am running into an error related to memory allocation when running polyfit.

Reproducing code example:

>>> np.polyfit([1,2,3], [1.2, 3.3, 4.5],  deg=1)
python(87909,0x7fffaed02380) malloc: *** mach_vm_map(size=18446744071717433344) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
init_dgelsd failed init
__main__:1: RankWarning: Polyfit may be poorly conditioned
array([5.50113414e-313, 1.34764833e-312])

Numpy/Python version information:

Platform: MacOS 10.13.6

Python:

Python 3.7.0 (default, Jun 28 2018, 07:39:16) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin

Numpy:

>>> np.__git_revision__
'2be154408c6a088a296da3cad274473ea7d03317'

2be1544

@charris
Copy link
Member

charris commented Oct 20, 2018

I'd be very surprised if this was a NumPy problem, we don't see any problems with our mac tests. How did you install?

@astrofrog
Copy link
Contributor Author

@charris - I just did pip install git+https://github.com/numpy/numpy. I can try and git bisect to see if I see any change over time.

@eric-wieser
Copy link
Member

eric-wieser commented Oct 20, 2018

There is a bug here in the error reporting - if if (init_@lapack_func@ in umath_linalg.c.src fails, we should be reporting an error, not just leaving the outputs unitialized.

Bisecting will probably lead you to #9980, which added that malloc call and all the functions around it,

@charris
Copy link
Member

charris commented Oct 20, 2018

pip install git+https://github.com/numpy/numpy

Any idea how that works? Where does it get the source? You can find compiled development wheels at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

@astrofrog
Copy link
Contributor Author

@charris - I just git cloned the repository and did pip install . and this gives the same result (that's what that pip command was doing behind the scenes).

@eric-wieser
Copy link
Member

eric-wieser commented Oct 20, 2018

This is the function that's failing:

static inline int
init_@lapack_func@(GELSD_PARAMS_t *params,
fortran_int m,
fortran_int n,
fortran_int nrhs)
{
npy_uint8 *mem_buff = NULL;
npy_uint8 *mem_buff2 = NULL;
npy_uint8 *a, *b, *s, *work, *iwork;
fortran_int min_m_n = fortran_int_min(m, n);
fortran_int max_m_n = fortran_int_max(m, n);
size_t safe_min_m_n = min_m_n;
size_t safe_max_m_n = max_m_n;
size_t safe_m = m;
size_t safe_n = n;
size_t safe_nrhs = nrhs;
size_t a_size = safe_m * safe_n * sizeof(@ftyp@);
size_t b_size = safe_max_m_n * safe_nrhs * sizeof(@ftyp@);
size_t s_size = safe_min_m_n * sizeof(@ftyp@);
fortran_int work_count;
size_t work_size;
size_t iwork_size;
fortran_int lda = fortran_int_max(1, m);
fortran_int ldb = fortran_int_max(1, fortran_int_max(m,n));
mem_buff = malloc(a_size + b_size + s_size);
if (!mem_buff)
goto error;
a = mem_buff;
b = a + a_size;
s = b + b_size;
params->M = m;
params->N = n;
params->NRHS = nrhs;
params->A = a;
params->B = b;
params->S = s;
params->LDA = lda;
params->LDB = ldb;
{
/* compute optimal work size */
@ftyp@ work_size_query;
fortran_int iwork_size_query;
params->WORK = &work_size_query;
params->IWORK = &iwork_size_query;
params->RWORK = NULL;
params->LWORK = -1;
if (call_@lapack_func@(params) != 0)
goto error;
work_count = (fortran_int)work_size_query;
work_size = (size_t) work_size_query * sizeof(@ftyp@);
iwork_size = (size_t)iwork_size_query * sizeof(fortran_int);
}
mem_buff2 = malloc(work_size + iwork_size);
if (!mem_buff2)
goto error;
work = mem_buff2;
iwork = work + work_size;
params->WORK = work;
params->RWORK = NULL;
params->IWORK = iwork;
params->LWORK = work_count;
return 1;
error:
TRACE_TXT("%s failed init\n", __FUNCTION__);
free(mem_buff);
free(mem_buff2);
memset(params, 0, sizeof(*params));
return 0;
}

My guess is that your dgelsd is broken for workspace queries, which are these lines:

{
/* compute optimal work size */
@ftyp@ work_size_query;
fortran_int iwork_size_query;
params->WORK = &work_size_query;
params->IWORK = &iwork_size_query;
params->RWORK = NULL;
params->LWORK = -1;
if (call_@lapack_func@(params) != 0)
goto error;
work_count = (fortran_int)work_size_query;
work_size = (size_t) work_size_query * sizeof(@ftyp@);
iwork_size = (size_t)iwork_size_query * sizeof(fortran_int);
}
mem_buff2 = malloc(work_size + iwork_size);
if (!mem_buff2)
goto error;

Lapack 3.2.1 and older exhibit a bug where they do not actually report the workspace query. Perhaps we need to detect that and throw an exception telling users to use a non-broken lapack.

@tylerjereddy
Copy link
Contributor

We've seen this issue before & it has been reported on Macports & I discussed with Eric a bit too. We disable acceleration in our CI at the moment.

@astrofrog
Copy link
Contributor Author

Just in case it's helpful, I'm currently working in a conda environment with:

# packages in environment at /Users/tom/miniconda3/envs/alldev:
#
# Name                    Version                   Build  Channel
atomicwrites              1.2.1                     <pip>
attrs                     18.2.0                    <pip>
ca-certificates           2018.03.07                    0  
certifi                   2018.10.15               py37_0  
cycler                    0.10.0                    <pip>
cython                    0.29             py37h0a44026_0  
jinja2                    2.10                     py37_0  
kiwisolver                1.0.1                     <pip>
libcxx                    4.0.1                h579ed51_0  
libcxxabi                 4.0.1                hebd6815_0  
libedit                   3.1.20170329         hb402a30_2  
libffi                    3.2.1                h475c297_4  
markupsafe                1.0              py37h1de35cc_1  
matplotlib                3.0.0rc1+691.g682fe2ced           <pip>
more-itertools            4.3.0                     <pip>
ncurses                   6.1                  h0a44026_0  
nose                      1.3.7                     <pip>
numpy                     1.15.0                    <pip>
openssl                   1.0.2p               h1de35cc_0  
pip                       10.0.1                   py37_0  
pluggy                    0.8.0                     <pip>
py                        1.7.0                     <pip>
pyparsing                 2.2.2                     <pip>
pytest                    3.9.1                     <pip>
pytest-mpl                0.10                      <pip>
python                    3.7.0                hc167b69_0  
python-dateutil           2.7.3                     <pip>
readline                  7.0                  h1de35cc_5  
setuptools                40.4.3                   py37_0  
six                       1.11.0                    <pip>
sqlite                    3.25.2               ha441bb4_0  
tk                        8.6.8                ha441bb4_0  
wheel                     0.32.1                   py37_0  
xz                        5.2.4                h1de35cc_4  
zlib                      1.2.11               hf3cbc9b_2  

@eric-wieser
Copy link
Member

disable acceleration

By that , you mean the mac "Accelerate" library

@tylerjereddy
Copy link
Contributor

I mean we aggressively disable as many linalg-accelerating things as we can for the mac CI testing on Azure at the moment:

    env:
      BLAS: None
      LAPACK: None
      ATLAS: None
      ACCELERATE: None
      CC: /usr/bin/clang

Related discussion

Should perhaps try to replicate what we do for wheels -- I think Stefan said that using openblas v. something.

@astrofrog
Copy link
Contributor Author

astrofrog commented Oct 20, 2018

I ran git bisect, and the first commit to fail on my computer is f5758d6 (in #11036 by @mattip)

f5758d6fe15c2b506290bfc5379a10027617b331 is the first bad commit
commit f5758d6fe15c2b506290bfc5379a10027617b331
Author: Matti Picus <matti.picus@gmail.com>
Date:   Thu May 10 04:50:23 2018 +0300

    BUG: optimizing compilers can reorder call to npy_get_floatstatus (#11036)
    
    * BUG: optimizing compilers can reorder call to npy_get_floatstatus
    
    * alternative fix for npy_get_floatstatus, npy_clear_floatstatus
    
    * unify test with pr #11043
    
    * use barrier form of functions in place of PyUFunc_{get,clear}fperr
    
    * update doc, prevent segfault
    
    * MAINT: Do some rewrite on the 1.15.0 release notes.
    
    [ci skip]

:040000 040000 6440d3f1eb6687a50df674350c52a8c648c2a694 4d3170ecb2b1c4108685f0f9f8c5809dafec11a6 M	doc
:040000 040000 1124a0713b50836a61e8c920698943570bec2d18 f01e4d5b3deeab66685856e1e6d6b1e8dacbd070 M	numpy

@mattip
Copy link
Member

mattip commented Oct 21, 2018

The commit in question may have exposed an already-existing error state that previously was discarded. Could you try compiling without the buggy lapack library as our CI currently does?

@charris charris added this to the 1.15.4 release milestone Oct 24, 2018
@astrofrog
Copy link
Contributor Author

@mattip - what is the best way to compile without the buggy lapack?

@mattip
Copy link
Member

mattip commented Oct 27, 2018

setting all the environment variables listed in the comment above to None will disable third-party libraries and compile using a local implementation of the lapack routines (lapack-lite). So using bash, I do
ACCELERATE=None ATLAS=None OPENBLAS=None BLAS=None LAPACK=None python runtests.py

@juliantaylor
Copy link
Contributor

running with valgrind on a minimal testcase could also help finding where the broken memory handling is.

@juliantaylor
Copy link
Contributor

does mac have the madvise MADV_HUGEPAGES flag? check the file numpy/core/include/numpy/config.h in the build folder after building numpy for the flag HAVE_MADV_HUGEPAGE

that was a recent change to memory handling in the master branch, though it shouldn't be the cause.

@astrofrog
Copy link
Contributor Author

Compiling with the flags @mattip suggested works fine.

I'd be happy to use valgrind on my failing example - what is the proper valgrind incantation? (I'm not too familiar with it)

There is no MADV_HUGEPAGES in config.h - here's the full config.h file: https://gist.github.com/astrofrog/81e83535fbefdaab086f50de2d5de74b

@pv
Copy link
Member

pv commented Oct 27, 2018

PYTHONMALLOC=malloc valgrind python ...

@astrofrog
Copy link
Contributor Author

$ PYTHONMALLOC=malloc valgrind python fit.py 
valgrind: mmap-FIXED(0x7fff5f400000, 8388608) failed in UME (load_unixthread1) with error 22 (Invalid argument).

@astrofrog
Copy link
Contributor Author

astrofrog commented Oct 27, 2018

@pv - I had a buggy version of valgrind. I updated and now get:

$ PYTHONMALLOC=malloc valgrind python fit.py 
==14366== Memcheck, a memory error detector
==14366== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==14366== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==14366== Command: python fit.py
==14366== 
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/bin/python"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_heapq.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/core/_multiarray_umath.cpython-37m-darwin.so"
--14366-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--14366-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--14366-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_datetime.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_ctypes.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/libffi.6.dylib"
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_struct.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_pickle.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/core/_multiarray_tests.cpython-37m-darwin.so"
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/linalg/lapack_lite.cpython-37m-darwin.so"
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/linalg/_umath_linalg.cpython-37m-darwin.so"
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/zlib.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/libz.1.2.11.dylib"
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_bz2.cpython-37m-darwin.so"
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(blocksort.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(huffman.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(crctable.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(randtable.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(compress.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(decompress.o) unable to open object file: No such file or directory
warning: (x86_64) /Users/tom/miniconda3/envs/alldev/lib/libbz2.a(bzlib.o) unable to open object file: No such file or directory
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_lzma.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/liblzma.5.dylib"
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/grp.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_decimal.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/fft/fftpack_lite.cpython-37m-darwin.so"
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/site-packages/numpy-1.16.0.dev0+45718fd-py3.7-macosx-10.7-x86_64.egg/numpy/random/mtrand.cpython-37m-darwin.so"
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_hashlib.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/libcrypto.1.1.dylib"
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_blake2.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_sha3.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_bisect.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
--14366-- run: /usr/bin/dsymutil "/Users/tom/miniconda3/envs/alldev/lib/python3.7/lib-dynload/_random.cpython-37m-darwin.so"
warning: (x86_64) /tmp/lto.o unable to open object file: No such file or directory
warning: no debug symbols in executable (-arch x86_64)
==14366== Conditional jump or move depends on uninitialised value(s)
==14366==    at 0x1004C902D: malloc (in /usr/local/Cellar/valgrind/3.14.0/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==14366==    by 0x104160A0B: DOUBLE_lstsq (umath_linalg.c.src:2993)
==14366==    by 0x102869761: PyUFunc_GenericFunction (ufunc_object.c:2999)
==14366==    by 0x10286BC0C: ufunc_generic_call (ufunc_object.c:4657)
==14366==    by 0x10002F2A2: _PyObject_FastCallKeywords (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10016FC23: call_function (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10016D7AD: _PyEval_EvalFrameDefault (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x100161331: _PyEval_EvalCodeWithName (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10002E5A6: _PyFunction_FastCallDict (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10016D904: _PyEval_EvalFrameDefault (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10002F097: function_code_fastcall (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366==    by 0x10016FB3D: call_function (in /Users/tom/miniconda3/envs/alldev/bin/python)
==14366== 
==14366== Warning: set address range perms: large range [0x106cbe040, 0x11bfa7b50) (undefined)
==14366== Warning: set address range perms: large range [0x106cbe028, 0x11bfa7b68) (noaccess)
==14366== 
==14366== HEAP SUMMARY:
==14366==     in use at exit: 3,508,165 bytes in 22,411 blocks
==14366==   total heap usage: 207,171 allocs, 184,760 frees, 398,010,815 bytes allocated
==14366== 
==14366== LEAK SUMMARY:
==14366==    definitely lost: 6,748 bytes in 143 blocks
==14366==    indirectly lost: 0 bytes in 0 blocks
==14366==      possibly lost: 1,328,777 bytes in 6,345 blocks
==14366==    still reachable: 2,154,197 bytes in 15,772 blocks
==14366==         suppressed: 18,443 bytes in 151 blocks
==14366== Rerun with --leak-check=full to see details of leaked memory
==14366== 
==14366== For counts of detected and suppressed errors, rerun with: -v
==14366== Use --track-origins=yes to see where uninitialised values come from
==14366== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)

Does this help?

@juliantaylor
Copy link
Contributor

thanks it seems to confirm that the lapack workspace queries do not work

@charris
Copy link
Member

charris commented Oct 31, 2018

Any progress on a fix for this? I plan on releasing 1.15.4 about Nov 4. ISTR that we once had some workspace computation workarounds.

@charris
Copy link
Member

charris commented Nov 17, 2018

Pushing off to 1.17.

@charris charris modified the milestones: 1.16.0 release, 1.17.0 release Nov 17, 2018
@charris
Copy link
Member

charris commented May 22, 2019

Any reason not to close this? It doesn't seem to be a NumPy problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants