[Numpy-discussion] numpy threads crash when allocating arrays

Hi Burlen,

Post by Burlen Loring
Hi All,
I'm working on a threaded pipeline where we want the end user to be able to code up Python functions to do numerical work. Threading is all done in C++11 and in each thread we've acquired gill before we invoke the user provided Python callback and release it only when the callback returns. We've used SWIG to expose bindings to C++ objects in Python.
When run with more than 1 thread I get inconsistent segv's in various numpy routines, and occasionally see a *** Reference count error detected: an attempt was made to deallocate 11 (f) ***.
To pass data from C++ to the Python callback as numpy array we have

206 // ****************************************************************************
207 template <typename NT>
208 PyArrayObject *new_object(teca_variant_array_impl<NT> *varrt)
209 {
210 // allocate a buffer
211 npy_intp n_elem = varrt->size();
212 size_t n_bytes = n_elem*sizeof(NT);
213 NT *mem = static_cast<NT*>(malloc(n_bytes));
214 if (!mem)
215 {
216 PyErr_Format(PyExc_RuntimeError,
217 "failed to allocate %lu bytes", n_bytes);
218 return nullptr;
219 }
220
221 // copy the data
222 memcpy(mem, varrt->get(), n_bytes);
223
224 // put the buffer in to a new numpy object
225 PyArrayObject *arr = reinterpret_cast<PyArrayObject*>(
226 PyArray_SimpleNewFromData(1, &n_elem, numpy_tt<NT>::code, mem));
227 PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);
228
229 return arr;
230 }

This code would probably be much simpler if you let numpy allocate the
buffer with PyArray_SimpleNew and then did the memcpy. I doubt that's
your problem, though. Numpy should be assuming that "owned" data was
allocated using malloc(), and if it were using a different allocator
then I think you'd be seeing crashes much sooner.

Post by Burlen Loring
This is the only place we create numpy objects in the C++ side.

35 sys.stderr.write('descriptive_stats::execute MPI %d\n'%(rank))
36
37 mesh = as_teca_cartesian_mesh(data_in[0])
38
39 table = teca_table.New()
40 table.copy_metadata(mesh)
41
42 table.declare_columns(['step','time'], ['ul','d'])
43 table << mesh.get_time_step() << mesh.get_time()
44
46
47 table.declare_columns(['min '+var_name, 'avg '+var_name, \
48 'max '+var_name, 'std '+var_name, 'low_q '+var_name, \
49 'med '+var_name, 'up_q '+var_name], ['d']*7)
50
51 var = mesh.get_point_arrays().get(var_name).as_array()
52
53 table << float(np.min(var)) << float(np.average(var)) \
54 << float(np.max(var)) << float(np.std(var)) \
55 << map(float, np.percentile(var, [25.,50.,75.]))
56
57 return table
58 return execute

this callback is the only spot where numpy is used. the as_array call is implemented by new_object template above.
Further, If I remove our use of PyArray_SimpleNewFromData, by replacing line 51 in the Python code above with var = np.array(range(1, 1100), 'f'), the problem disappears. It must have something to do with use of PyArray_SimpleNewFromData.
I'm at a loss to see why things are going south. I'm using the GIL and I thought that would serialize the Python code. I suspect that numpy is using global or static variables some where internally and that it's inherently thread unsafe. Can anyone confirm/deny? maybe point me in the right direction?

Numpy does use global/static variables, and it is unsafe to call into
numpy simultaneously from different threads. But that's ok, because
you're not allowed to call numpy functions simultaneously from
different threads -- you have to hold the GIL first, and that
serializes access to all of numpy's internal state. Numpy is very
commonly used in threaded code and most people aren't seeing random
segfaults, so the problem is most likely in your code. Sorry I can't
help much more than that... I guess I'd start by triple-checking that
the code really truly does hold the GIL every time that it calls into
numpy/python APIs. I'd also try running it under valgrind in case it's
some other random memory corruption that's just showing up in a weird
way.

-n

Burlen Loring

2016-06-14 17:34:13 UTC

Nathaniel, Thanks for the feedback. Investigations made where I acquire
GIL in spots where I already hold it lead me to suspect that the issue
is caused by use of Py_BEGIN_ALLOW_THREADS or another macro
NPY_BEGIN_ALLOW_THREADS.

to put this point to rest, I have a high degree of confidence in our
code. I have run the code in valgrind, and it runs cleanly when I use
Python's valgrind suppression file. It is certainly the case the the
error is harder to reproduce with valgrind, and it runs so slowly under
valgrind that running in a bash loop (the usual way I reproduce) is not
practical. I have examined crashes with gdb and I have verified that
they occur while I hold the GIL.

However, based on your suggestion, I experimented with acquiring GIL in
certain spots in my code where it should not be necessary either because
those code pieces are invoked by the main thread when there is only one
thread running, or because the code in question is invoked while I'm
already holding the GIL. I also added trace output to stderr in those
spots as well.

Acquiring the GIL caused deadlocks as expected, except when I used it in
new_object template (the only spot we pass data to numpy) below.
Aquiring the GIL here fixes the issues!

206 //
****************************************************************************
207 template <typename NT>
208 PyArrayObject *new_object(teca_variant_array_impl<NT> *varrt)
209 {
*210 PyGILState_STATE gstate;**// experimental, I already hold
the GIL higher up in stack
**211 gstate = PyGILState_Ensure();*
212 TECA_STATUS("teca_py_array::new_object");
213
214 // allocate a buffer
215 npy_intp n_elem = varrt->size();
216 size_t n_bytes = n_elem*sizeof(NT);
217 NT *mem = static_cast<NT*>(malloc(n_bytes));
218 if (!mem)
219 {
220 PyErr_Format(PyExc_RuntimeError,
221 "failed to allocate %lu bytes", n_bytes);
222 return nullptr;
223 }
224
225 // copy the data
226 memcpy(mem, varrt->get(), n_bytes);
227
228 // put the buffer in to a new numpy object
229 PyArrayObject *arr = reinterpret_cast<PyArrayObject*>(
230 PyArray_SimpleNewFromData(1, &n_elem,
numpy_tt<NT>::code, mem));
231 PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);
232
*233 PyGILState_Release(gstate);*
234 return arr;
235 }

now, this function is only used from within the user provided Python
callback which is invoked only while I'm holding the GIL. A fact that
I've verified via gdb stack traces. This function should be running
serially due to the fact that I hold the GIL. However, it's running
concurrently, as evidenced from the random crashes when it's used, and
the garbled stderr output, both of which go away with the above
addition. I don't think that I should have to acquire the GIL here, but
the evidence is against me!

In Python docs on the GIL, I noticed that there's a macro
Py_BEGIN_ALLOW_THREADS that Python uses internally around blocking I/O
and heavy computations. much to my surprise grep reveals
Py_BEGIN_ALLOW_THREADS is used in numpy. I think this can explain the
issues I'm experiencing. where numpy uses Py_BEGIN_ALLOW_THREADS it
would let my threads, who utilize the numpy C-API via new_object
template above, run concurrently despite the fact that I hold the GIL.
It seems to me that in order for numpy to be thread safe it should not
use this macro at all. At least this is my theory, I haven't yet had a
chance to modify numpy build to verify. It's possible that
Py_BEGIN_ALLOW_THREADS maybe used elsewhere.

here's my question: given Py_BEGIN_ALLOW_THREADS is used by numpy how
can numpy be thread safe? and how can someone using the C-API know where
it's necessary to acquire the GIL? Maybe someone can explain this?

$grep BEGIN_ALLOW_THREADS ./ -rIn
./doc/source/f2py/signature-file.rst:250: Use
``Py_BEGIN_ALLOW_THREADS .. Py_END_ALLOW_THREADS`` block
./doc/source/reference/c-api.array.rst:3092: .. c:macro::
NPY_BEGIN_ALLOW_THREADS
./doc/source/reference/c-api.array.rst:3094: Equivalent to
:c:macro:`Py_BEGIN_ALLOW_THREADS` except it uses
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2109:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2125:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2143:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2159:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2177:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2193:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2211:
Py_BEGIN_ALLOW_THREADS
./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2227:
Py_BEGIN_ALLOW_THREADS
./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:419: {isthreadsafe:
'\t\t\tPy_BEGIN_ALLOW_THREADS'},
./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:457: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:495: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:541: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:583: {isthreadsafe:
'\t\tPy_BEGIN_ALLOW_THREADS'},
./numpy/f2py/rules.py:419: {isthreadsafe:
'\t\t\tPy_BEGIN_ALLOW_THREADS'},
./numpy/f2py/rules.py:457: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./numpy/f2py/rules.py:495: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./numpy/f2py/rules.py:541: {isthreadsafe:
'\tPy_BEGIN_ALLOW_THREADS'},
./numpy/f2py/rules.py:583: {isthreadsafe: '\t\tPy_BEGIN_ALLOW_THREADS'},
./numpy/fft/fftpack_litemodule.c:45: Py_BEGIN_ALLOW_THREADS;
./numpy/fft/fftpack_litemodule.c:98: Py_BEGIN_ALLOW_THREADS;
./numpy/fft/fftpack_litemodule.c:135: Py_BEGIN_ALLOW_THREADS;
./numpy/fft/fftpack_litemodule.c:191: Py_BEGIN_ALLOW_THREADS;
./numpy/fft/fftpack_litemodule.c:254: Py_BEGIN_ALLOW_THREADS;
./numpy/fft/fftpack_litemodule.c:295: Py_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/ctors.c:3237: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/ctors.c:3277: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/cblasfuncs.c:378: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/cblasfuncs.c:525: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/cblasfuncs.c:549: NPY_BEGIN_ALLOW_THREADS
./numpy/core/src/multiarray/cblasfuncs.c:576: NPY_BEGIN_ALLOW_THREADS
./numpy/core/src/multiarray/cblasfuncs.c:628: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/cblasfuncs.c:761: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:157:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:181:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:473:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:810:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:1014:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/compiled_base.c:1049:
NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/scalartypes.c.src:939:
Py_BEGIN_ALLOW_THREADS
./numpy/core/src/multiarray/scalartypes.c.src:955:
Py_BEGIN_ALLOW_THREADS
./numpy/core/src/multiarray/convert.c:98: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/convert.c:210: NPY_BEGIN_ALLOW_THREADS;
./numpy/core/src/multiarray/multiarraymodule.c:3981:
Py_BEGIN_ALLOW_THREADS;
./numpy/core/include/numpy/ndarraytypes.h:932:#define
NPY_BEGIN_ALLOW_THREADS Py_BEGIN_ALLOW_THREADS
./numpy/core/include/numpy/ndarraytypes.h:952:#define
NPY_BEGIN_ALLOW_THREADS

Post by Nathaniel Smith
Hi Burlen,

Post by Burlen Loring
This is the only place we create numpy objects in the C++ side.

Julian Taylor

2016-06-14 19:28:42 UTC

Post by Burlen Loring
here's my question: given Py_BEGIN_ALLOW_THREADS is used by numpy how
can numpy be thread safe? and how can someone using the C-API know where
it's necessary to acquire the GIL? Maybe someone can explain this?

numpy only releases the GIL when it is not accessing any python objects
or other non-threadsafe structures anymore.
That is usually during computation loops and IO.

Your problem is indeed a missing PyGILState_Ensure

I am assuming that the threads you are using are not created by python,
so you don't have a threadstate setup and no GIL.
You do set it up with that function, see
https://docs.python.org/2/c-api/init.html#non-python-created-threads

Burlen Loring

2016-06-14 19:38:32 UTC

numpy only releases the GIL when it is not accessing any python
objects or other non-threadsafe structures anymore.
That is usually during computation loops and IO.
Your problem is indeed a missing PyGILState_Ensure
I am assuming that the threads you are using are not created by
python, so you don't have a threadstate setup and no GIL.
You do set it up with that function, see
https://docs.python.org/2/c-api/init.html#non-python-created-threads

I'm already hold the GIL in each thread via the mechanism you pointed
to, and I have verified this with gdb, some how GIL is being released.
re-acquiring the GIL solves the issue, but it technically should cause a
deadlock to acquire 2x in the same thread. I suspect Numpy use of
Py_BEGIN_ALLOW_THREADS is cause of the issue. It will take some work to
verify.

Nathaniel Smith

2016-06-14 20:05:16 UTC

numpy only releases the GIL when it is not accessing any python objects

or other non-threadsafe structures anymore.

Post by Julian Taylor
That is usually during computation loops and IO.
Your problem is indeed a missing PyGILState_Ensure
I am assuming that the threads you are using are not created by python,

so you don't have a threadstate setup and no GIL.

Post by Julian Taylor
You do set it up with that function, see

https://docs.python.org/2/c-api/init.html#non-python-created-threads

I'm already hold the GIL in each thread via the mechanism you pointed to,

and I have verified this with gdb, some how GIL is being released.
re-acquiring the GIL solves the issue, but it technically should cause a
deadlock to acquire 2x in the same thread. I suspect Numpy use of
Py_BEGIN_ALLOW_THREADS is cause of the issue. It will take some work to
verify.

It's legal to call PyGILState_Ensure when you already have the GIL; the
whole point of that function is that you can use it whether you have the
GIL or not. However, if you already have the GIL, then it's a no-op, so it
shouldn't have fixed your problems. If it did help, then this strongly
suggests that you've missed something in your analysis of when you hold the
GIL.

While bugs are always possible, it's unlikely that this has anything to do
with numpy using Py_BEGIN_ALLOW_THREADS. In theory numpy's use is safe,
because it always follows the pattern of dropping the GIL, doing a chunk of
work that is careful not to touch any globals or the python api, and then
reacquiring the GIL. In practice it's possible that the code does something
else in some edge case, but if so then it's a pretty subtle issue that's
being triggered by some unusual thing about how you call into numpy.

-n

Burlen Loring

2016-06-16 18:19:50 UTC

Post by Burlen Loring
here's my question: given Py_BEGIN_ALLOW_THREADS is used by numpy how
can numpy be thread safe? and how can someone using the C-API know

where

Post by Burlen Loring
it's necessary to acquire the GIL? Maybe someone can explain this?

numpy only releases the GIL when it is not accessing any python

objects or other non-threadsafe structures anymore.

Post by Julian Taylor
That is usually during computation loops and IO.
Your problem is indeed a missing PyGILState_Ensure
I am assuming that the threads you are using are not created by

python, so you don't have a threadstate setup and no GIL.