[Numpy-discussion] Calling C code that assumes SIMD aligned data.

Discussion:

Øystein Schønning-Johansen

2016-05-05 09:38:51 UTC

Hi!

I've written a little code of numpy code that does a neural network
feedforward calculation:

def feedforward(self,x):
for activation, w, b in zip( self.activations, self.weights,
self.biases ):
x = activation( np.dot(w, x) + b)

This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?

(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)

Thanks,
-Ãystein

Francesc Alted

2016-05-05 11:55:36 UTC

Permalink

Post by Ãystein SchÃ¸nning-Johansen
Hi!
I've written a little code of numpy code that does a neural network
for activation, w, b in zip( self.activations, self.weights,
x = activation( np.dot(w, x) + b)
This works fine when my activation functions are in Python, however I've
wrapped the activation functions from a C implementation that requires the
array to be memory aligned. (due to simd instructions in the C
implementation.) So I need the operation np.dot( w, x) + b to return a
ndarray where the data pointer is aligned. How can I do that? Is it
possible at all?

Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
array is easy:

In [15]: x = np.arange(6).reshape(2,3)

In [16]: x.ctypes.data % 16
Out[16]: 0

but:

In [17]: x.ctypes.data % 32
Out[17]: 16

so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
you will need to build your container manually. See here for an example:
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays

Francesc

Post by Ãystein SchÃ¸nning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Ãystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Francesc Alted

Øystein Schønning-Johansen

2016-05-05 20:10:32 UTC

Permalink

Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.

Thanks,
-Ãystein

Post by Francesc Alted

Yes. np.dot() does accept an `out` parameter where you can pass your
aligned array. The way for testing if numpy is returning you an aligned
In [15]: x = np.arange(6).reshape(2,3)
In [16]: x.ctypes.data % 16
Out[16]: 0
In [17]: x.ctypes.data % 32
Out[17]: 16
so, in this case NumPy returned a 16-byte aligned array which should be
enough for 128 bit SIMD (SSE family). This kind of alignment is pretty
common in modern computers. If you need 256 bit (32-byte) alignment then
http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays
Francesc

--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

2016-05-05 20:32:46 UTC

Permalink

On Thu, May 5, 2016 at 2:10 PM, Ãystein SchÃžnning-Johansen <

Post by Ãystein SchÃ¸nning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy solution
saves the work of searching for this. I've not tried the solution described
at SO, but it looks like a real performance killer. I'll rather try to
override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think
that will do it? I'll try it and report back.
Thanks,
-Ãystein

Might take a look at how numpy handles this in
`numpy/core/src/umath/simd.inc.src`.

<snip>

Chuck

Julian Taylor

2016-05-06 20:22:38 UTC

Permalink

note that anything larger than 16 bytes alignment is unnecessary for
simd purposes on current hardware (>= haswell). 16 byte is default
malloc alignment on amd64.
And even on older ones (sandy bridge) the penalty is pretty minor.

On Thu, May 5, 2016 at 2:10 PM, Øystein Schønning-Johansen
Thanks for your answer, Francesc. Knowing that there is no numpy
solution saves the work of searching for this. I've not tried the
solution described at SO, but it looks like a real performance
killer. I'll rather try to override malloc with glibs malloc_hooks
or LD_PRELOAD tricks. Do you think that will do it? I'll try it and
report back.
Thanks,
-Øystein
Might take a look at how numpy handles this in
`numpy/core/src/umath/simd.inc.src`.
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Francesc Alted

2016-05-06 13:01:32 UTC

Permalink

I don't think you need that much weaponry. Just create an array with some
spare space for alignment. Realize that you want a 64-byte aligned double
precision array. With that, create your desired array + 64 additional
bytes (8 doubles):

In [92]: a = np.zeros(int(1e6) + 8)

In [93]: a.ctypes.data % 64
Out[93]: 16

and compute the elements to shift this:

In [94]: shift = (64 / a.itemsize) - (a.ctypes.data % 64) / a.itemsize

In [95]: shift
Out[95]: 6

now, create a view with the required elements less:

In [98]: b = a[shift:-((64 / a.itemsize)-shift)]

In [99]: len(b)
Out[99]: 1000000

In [100]: b.ctypes.data % 64
Out[100]: 0

and voila, b is now aligned to 64 bytes. As the view is a copy-free
operation, this is fast, and you only wasted 64 bytes. Pretty cheap indeed.

Francesc

Post by Ãystein SchÃ¸nning-Johansen
Thanks,
-Ãystein

Post by Francesc Alted

--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Francesc Alted

Antoine Pitrou

2016-05-07 11:02:14 UTC

Permalink

Here's an obligatory plug for the two following PRs:
https://github.com/numpy/numpy/pull/5457
https://github.com/numpy/numpy/pull/5470

Regards

Antoine.

On Fri, 6 May 2016 15:01:32 +0200

Post by Francesc Alted

I don't think you need that much weaponry. Just create an array with some
spare space for alignment. Realize that you want a 64-byte aligned double
precision array. With that, create your desired array + 64 additional
In [92]: a = np.zeros(int(1e6) + 8)
In [93]: a.ctypes.data % 64
Out[93]: 16
In [94]: shift = (64 / a.itemsize) - (a.ctypes.data % 64) / a.itemsize
In [95]: shift
Out[95]: 6
In [98]: b = a[shift:-((64 / a.itemsize)-shift)]
In [99]: len(b)
Out[99]: 1000000
In [100]: b.ctypes.data % 64
Out[100]: 0
and voila, b is now aligned to 64 bytes. As the view is a copy-free
operation, this is fast, and you only wasted 64 bytes. Pretty cheap indeed.
Francesc

Post by Ãystein SchÃ¸nning-Johansen
Thanks,
-Øystein

Post by Francesc Alted

Post by Ãystein SchÃ¸nning-Johansen
(BTW: the function works correctly about 20% of the time I run it, and
else it segfaults on the simd instruction in the the C function)
Thanks,
-Øystein
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion