Discussion:
[Numpy-discussion] Calling C code that assumes SIMD aligned data.
Øystein Schønning-Johansen
2016-05-05 09:38:51 UTC
Permalink
Hi!

I've written a little code of numpy code that does a neural network
feedforward calculation:

def feedforward(self,x):
for activation, w, b in zip( self.activations, self.weights,
self.biases ):
x = activation( np.dot(w, x) + b)

This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?

(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)

Thanks,
-Øystein
Francesc Alted
2016-05-05 11:55:36 UTC
Permalink
Post by Øystein Schønning-Johansen
Hi!
I've written a little code of numpy code that does a neural network
for activation, w, b in zip( self.activations, self.weights,
x = activation( np.dot(w, x) + b)
This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?
Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
array is easy:

In [15]: x = np.arange(6).reshape(2,3)

In [16]: x.ctypes.data % 16
Out[16]: 0

but:

In [17]: x.ctypes.data % 32
Out[17]: 16

so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
you will need to build your container manually. See here for an example:
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays

Francesc
Post by Øystein Schønning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Øystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
Øystein Schønning-Johansen
2016-05-05 20:10:32 UTC
Permalink
Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.

Thanks,
-Øystein
Post by Francesc Alted
Post by Øystein Schønning-Johansen
Hi!
I've written a little code of numpy code that does a neural network
for activation, w, b in zip( self.activations, self.weights,
x = activation( np.dot(w, x) + b)
This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?
Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
In [15]: x = np.arange(6).reshape(2,3)
In [16]: x.ctypes.data % 16
Out[16]: 0
In [17]: x.ctypes.data % 32
Out[17]: 16
so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays
Francesc
Post by Øystein Schønning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Øystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2016-05-05 20:32:46 UTC
Permalink
On Thu, May 5, 2016 at 2:10 PM, Øystein SchÞnning-Johansen <
Post by Øystein Schønning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.
Thanks,
-Øystein
Might take a look at how numpy handles this in
`numpy/core/src/umath/simd.inc.src`.

<snip>

Chuck
Julian Taylor
2016-05-06 20:22:38 UTC
Permalink
note that anything larger than 16 bytes alignment is unnecessary for
simd purposes on current hardware (>= haswell). 16 byte is default
malloc alignment on amd64.
And even on older ones (sandy bridge) the penalty is pretty minor.
On Thu, May 5, 2016 at 2:10 PM, Øystein Schønning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy
solution saves the work of searching for this. I've not tried the
solution described at SO, but it looks like a real performance
killer. I'll rather try to override malloc with glibs malloc_hooks
or LD_PRELOAD tricks. Do you think that will do it? I'll try it and
report back.
Thanks,
-Øystein
Might take a look at how numpy handles this in
`numpy/core/src/umath/simd.inc.src`.
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Francesc Alted
2016-05-06 13:01:32 UTC
Permalink
Post by Øystein Schønning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.
I don't think you need that much weaponry. Just create an array with some
spare space for alignment. Realize that you want a 64-byte aligned double
precision array. With that, create your desired array + 64 additional
bytes (8 doubles):

In [92]: a = np.zeros(int(1e6) + 8)

In [93]: a.ctypes.data % 64
Out[93]: 16

and compute the elements to shift this:

In [94]: shift = (64 / a.itemsize) - (a.ctypes.data % 64) / a.itemsize

In [95]: shift
Out[95]: 6

now, create a view with the required elements less:

In [98]: b = a[shift:-((64 / a.itemsize)-shift)]

In [99]: len(b)
Out[99]: 1000000

In [100]: b.ctypes.data % 64
Out[100]: 0

and voila, b is now aligned to 64 bytes. As the view is a copy-free
operation, this is fast, and you only wasted 64 bytes. Pretty cheap indeed.

Francesc
Post by Øystein Schønning-Johansen
Thanks,
-Øystein
Post by Francesc Alted
Post by Øystein Schønning-Johansen
Hi!
I've written a little code of numpy code that does a neural network
for activation, w, b in zip( self.activations, self.weights,
x = activation( np.dot(w, x) + b)
This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?
Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
In [15]: x = np.arange(6).reshape(2,3)
In [16]: x.ctypes.data % 16
Out[16]: 0
In [17]: x.ctypes.data % 32
Out[17]: 16
so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays
Francesc
Post by Øystein Schønning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Øystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
Antoine Pitrou
2016-05-07 11:02:14 UTC
Permalink
Here's an obligatory plug for the two following PRs:
https://github.com/numpy/numpy/pull/5457
https://github.com/numpy/numpy/pull/5470

Regards

Antoine.


On Fri, 6 May 2016 15:01:32 +0200
Post by Francesc Alted
Post by Øystein Schønning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.
I don't think you need that much weaponry. Just create an array with some
spare space for alignment. Realize that you want a 64-byte aligned double
precision array. With that, create your desired array + 64 additional
In [92]: a = np.zeros(int(1e6) + 8)
In [93]: a.ctypes.data % 64
Out[93]: 16
In [94]: shift = (64 / a.itemsize) - (a.ctypes.data % 64) / a.itemsize
In [95]: shift
Out[95]: 6
In [98]: b = a[shift:-((64 / a.itemsize)-shift)]
In [99]: len(b)
Out[99]: 1000000
In [100]: b.ctypes.data % 64
Out[100]: 0
and voila, b is now aligned to 64 bytes. As the view is a copy-free
operation, this is fast, and you only wasted 64 bytes. Pretty cheap indeed.
Francesc
Post by Øystein Schønning-Johansen
Thanks,
-Øystein
Post by Francesc Alted
Post by Øystein Schønning-Johansen
Hi!
I've written a little code of numpy code that does a neural network
for activation, w, b in zip( self.activations, self.weights,
x = activation( np.dot(w, x) + b)
This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?
Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
In [15]: x = np.arange(6).reshape(2,3)
In [16]: x.ctypes.data % 16
Out[16]: 0
In [17]: x.ctypes.data % 32
Out[17]: 16
so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays
Francesc
Post by Øystein Schønning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Øystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...