Discussion:
[Numpy-discussion] numpy arrays, data allocation and SIMD alignement
David Cournapeau
2007-08-03 06:06:58 UTC
Permalink
Hi,

Following an ongoing discussion with S. Johnson, one of the developer
of fftw3, I would be interested in what people think about adding
infrastructure in numpy related to SIMD alignement (that is 16 bytes
alignement for SSE/ALTIVEC, I don't know anything about other archs).
The problem is that right now, it is difficult to get information for
alignement in numpy (by alignement here, I mean something different than
what is normally meant in numpy context; whether, in my understanding,
NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I
am talking about arbitrary alignement).
For example, for fftw3, we need to know whether a given data buffer is
16 bytes aligned to get optimal performances; generally, SSE needs 16
byte alignement for optimal performances, as well as altivec. I think it
would be nice to get some infrastructure to help developers to get those
kind of information, and maybe to be able to request 16 aligned buffers.
Here is what I can think of:
- adding an API to know whether a given PyArrayObject has its data
buffer 16 bytes aligned, and requesting a 16 bytes aligned
PyArrayObject. Something like NPY_ALIGNED, basically.
- forcing data allocation to be 16 bytes aligned in numpy (eg
define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
This would mean that many arrays would be "naturally" 16 bytes aligned
without effort.

Point 2 is really easy to implement I think: actually, on some platforms
(Mac OS X and FreeBSD), malloc returning 16 bytes aligned buffers
anyway, so I don't think the wasted space is a real problem. Linux with
glibc is 8 bytes aligned, I don't know about windows. Implementing our
own 16 bytes aligned memory allocator for cross platform compatibility
should be relatively easy. I don't see any drawback, but I guess other
people will.

Point 1 is more tricky, as this requires much more changes in the code.

Do main developers of numpy have an opinion on this ?

cheers,

David
Matthew Brett
2007-08-03 13:38:36 UTC
Permalink
Hi,
Post by David Cournapeau
Following an ongoing discussion with S. Johnson, one of the developer
of fftw3, I would be interested in what people think about adding
infrastructure in numpy related to SIMD alignement (that is 16 bytes
alignement for SSE/ALTIVEC, I don't know anything about other archs).
The problem is that right now, it is difficult to get information for
alignement in numpy (by alignement here, I mean something different than
what is normally meant in numpy context; whether, in my understanding,
NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I
am talking about arbitrary alignement).
Excellent idea if practical...

Matthew
Andrew Straw
2007-08-03 15:12:44 UTC
Permalink
Dear David,

Both ideas, particularly the 2nd, would be excellent additions to numpy.
I often use the Intel IPP (Integrated Performance Primitives) Library
together with numpy, but I have to do all my memory allocation with the
IPP to ensure fastest operation. I then create numpy views of the data.
All this works brilliantly, but it would be really nice if I could
allocate the memory directly in numpy.

IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given
that fftw3 apparently wants 16 byte aligned memory, my feeling is that,
if the effort is made, the alignment width should be specified at
run-time, rather than hard-coded.

In terms of implementation of your 1st point, I'm not aware of how much
effort your idea would take (and it does sound nice), but some benefit
would be had just from a simple function numpy.is_mem_aligned( ndarray,
width=16 ) which returns a bool.

Cheers!
Andrew
Post by David Cournapeau
Hi,
Following an ongoing discussion with S. Johnson, one of the developer
of fftw3, I would be interested in what people think about adding
infrastructure in numpy related to SIMD alignement (that is 16 bytes
alignement for SSE/ALTIVEC, I don't know anything about other archs).
The problem is that right now, it is difficult to get information for
alignement in numpy (by alignement here, I mean something different than
what is normally meant in numpy context; whether, in my understanding,
NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I
am talking about arbitrary alignement).
For example, for fftw3, we need to know whether a given data buffer is
16 bytes aligned to get optimal performances; generally, SSE needs 16
byte alignement for optimal performances, as well as altivec. I think it
would be nice to get some infrastructure to help developers to get those
kind of information, and maybe to be able to request 16 aligned buffers.
- adding an API to know whether a given PyArrayObject has its data
buffer 16 bytes aligned, and requesting a 16 bytes aligned
PyArrayObject. Something like NPY_ALIGNED, basically.
- forcing data allocation to be 16 bytes aligned in numpy (eg
define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
This would mean that many arrays would be "naturally" 16 bytes aligned
without effort.
Point 2 is really easy to implement I think: actually, on some platforms
(Mac OS X and FreeBSD), malloc returning 16 bytes aligned buffers
anyway, so I don't think the wasted space is a real problem. Linux with
glibc is 8 bytes aligned, I don't know about windows. Implementing our
own 16 bytes aligned memory allocator for cross platform compatibility
should be relatively easy. I don't see any drawback, but I guess other
people will.
Point 1 is more tricky, as this requires much more changes in the code.
Do main developers of numpy have an opinion on this ?
cheers,
David
_______________________________________________
Numpy-discussion mailing list
http://projects.scipy.org/mailman/listinfo/numpy-discussion
David Cournapeau
2007-08-04 03:28:34 UTC
Permalink
Post by Andrew Straw
Dear David,
Both ideas, particularly the 2nd, would be excellent additions to numpy.
I often use the Intel IPP (Integrated Performance Primitives) Library
together with numpy, but I have to do all my memory allocation with the
IPP to ensure fastest operation. I then create numpy views of the data.
All this works brilliantly, but it would be really nice if I could
allocate the memory directly in numpy.
IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given
that fftw3 apparently wants 16 byte aligned memory, my feeling is that,
if the effort is made, the alignment width should be specified at
run-time, rather than hard-coded.
I think that doing it at runtime would be overkill, no ? I was thinking
about making it a compile option. Generally, at the ASM level, you need
16 bytes alignment (for instructions like movaps, which takes 16 bytes
in memory and put it in the SSE registers), this is not just fftw. Maybe
the 32 bytes alignment is useful for cache reasons, I don't know.

I don't think it would be difficult to implement and validate; what I
don't know at all is the implication of this at the binary level, if any.

cheers,

David
Charles R Harris
2007-08-04 05:30:46 UTC
Permalink
This post might be inappropriate. Click to display it.
Charles R Harris
2007-08-04 06:06:15 UTC
Permalink
Post by Charles R Harris
Post by Andrew Straw
Post by Andrew Straw
Dear David,
Both ideas, particularly the 2nd, would be excellent additions to
numpy.
Post by Andrew Straw
I often use the Intel IPP (Integrated Performance Primitives) Library
together with numpy, but I have to do all my memory allocation with
the
Post by Andrew Straw
IPP to ensure fastest operation. I then create numpy views of the
data.
Post by Andrew Straw
All this works brilliantly, but it would be really nice if I could
allocate the memory directly in numpy.
IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
http://www.intel.com/support/performancetools/sb/CS-021418.htm ).
Given
Post by Andrew Straw
that fftw3 apparently wants 16 byte aligned memory, my feeling is
that,
Post by Andrew Straw
if the effort is made, the alignment width should be specified at
run-time, rather than hard-coded.
I think that doing it at runtime would be overkill, no ? I was thinking
about making it a compile option. Generally, at the ASM level, you need
16 bytes alignment (for instructions like movaps, which takes 16 bytes
in memory and put it in the SSE registers), this is not just fftw. Maybe
the 32 bytes alignment is useful for cache reasons, I don't know.
I don't think it would be difficult to implement and validate; what I
don't know at all is the implication of this at the binary level, if any.
(1) Use static variables instead of dynamic (stack) variables
(2) Use in-line assembly code that explicitly aligns data
(3) In C code, use "*malloc*" to explicitly allocate variables
; procedure prologue
push ebp
mov esp, ebp
and ebp, -8
sub esp, 12
; procedure epilogue
add esp, 12
pop ebp
ret
double *p, *newp;
p = (double*)*malloc* ((sizeof(double)*NPTS)+4);
newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However,
*malloc*() may already follow Intel's recommendation that a *32*-* byte*or
greater data structures be aligned on a * 32* *byte* boundary. In that
case,
increasing the requested memory by 4 bytes and computing newp are
superfluous.
I think that for numpy arrays it should be possible to define the offset so
that the result is 32 byte aligned. However, this might break some peoples'
code if they haven't payed attention to the offset. Another possibility is
to allocate an oversized array, check the pointer, and take a range out of
it. For instance:

In [32]: a = zeros(10)

In [33]: a.ctypes.data % 32
Out[33]: 16

The array alignment is 16 bytes, consequently

In [34]: a[2:].ctypes.data % 32
Out[34]: 0

Voila, 32 byte alignment. I think a short python routine could do this,
which ought to serve well for 1D fft's. Multidimensional arrays will be
trickier if you want the rows to be aligned. Aligning the columns just isn't
going to work.

Chuck
David Cournapeau
2007-08-04 06:25:38 UTC
Permalink
Post by Charles R Harris
(1) Use static variables instead of dynamic (stack) variables
(2) Use in-line assembly code that explicitly aligns data
(3) In C code, use "*malloc*" to explicitly allocate variables
; procedure prologue
push ebp
mov esp, ebp
and ebp, -8
sub esp, 12
; procedure epilogue
add esp, 12
pop ebp
ret
double *p, *newp;
p = (double*)*malloc* ((sizeof(double)*NPTS)+4);
newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However,
*malloc*() may already follow Intel's recommendation that a *32*-*
byte* or
greater data structures be aligned on a *32* *byte* boundary. In that case,
increasing the requested memory by 4 bytes and computing newp are
superfluous.
I think that for numpy arrays it should be possible to define the
offset so that the result is 32 byte aligned. However, this might
break some peoples' code if they haven't payed attention to the offset.
Why ? I really don't see how it can break anything at the source code
level. You don't have to care about things you didn't care before: the
best proof of that if that numpy runs on different platforms where the
malloc has different alignment guarantees (mac OS X already aligned to
16 bytes, for the very reason of making optimizing with SIMD easier,
whereas glibc malloc only aligns to 8 bytes, at least on Linux).
Post by Charles R Harris
Another possibility is to allocate an oversized array, check the
In [32]: a = zeros(10)
In [33]: a.ctypes.data % 32
Out[33]: 16
The array alignment is 16 bytes, consequently
In [34]: a[2:].ctypes.data % 32
Out[34]: 0
Voila, 32 byte alignment. I think a short python routine could do
this, which ought to serve well for 1D fft's. Multidimensional arrays
will be trickier if you want the rows to be aligned. Aligning the
columns just isn't going to work.
I am not suggesting realigning existing arrays. What I would like numpy
to support are the following cases:

- Check whether a given a numpy array is simd aligned:

/* Simple case: if aligned, use optimized func, use non optimized
otherwise */
int simd_func(double* in, size_t n);
int nosimd_func(double* in, size_t n);

if (PyArray_ISALIGNED_SIMD(a)) {
simd_func((double *)a->data, a->size);
} else {
nosimd_func((double *)a->data, a->size);
}
- Request explicitely an aligned arrays from any PyArray_* functions
which create a ndarray, eg: ar = PyArray_FROM_OF(a, NPY_SIMD_ALIGNED);

Allocating a buffer aligned to a given alignment is not the problem:
there is a posix functions to do it, and we can implement easily a
function for the OS who do not support it. This would be done in C, not
in python.

cheers,

David
Anne Archibald
2007-08-04 07:24:55 UTC
Permalink
This post might be inappropriate. Click to display it.
Steven G. Johnson
2007-08-05 03:20:31 UTC
Permalink
Post by Anne Archibald
* A mechanism for requesting numpy arrays with buffers aligned to an
arbitrary power-of-two size (basically just using posix_memalign or
some horrible hack on platforms that don't have it).
Right, you might as well allow the alignment (to a power-of-two size)
to be specified at runtime, as there is really no cost to implementing
an arbitrary alignment once you have any alignment.

Although you should definitely use posix_memalign (or the old
memalign) where it is available, unfortunately it's not implemented on
all systems. e.g. MacOS X and FreeBSD don't have it, last I checked
(although in both cases their malloc is 16-byte aligned). Microsoft VC
++ has a function called _aligned_malloc which is equivalent.

However, since MinGW (www.mingw.org) didn't have an _aligned_malloc
function, I wrote one for them a few years ago and put it in the
public domain (I use MinGW to cross-compile to Windows from Linux and
need the alignment). You are free to use it as a fallback on systems
that don't have a memalign function if you want. It should work on
any system where sizeof(void*) is a power of two (i.e. every extant
architecture, that I know of). You can download it and its test
program from:
ab-initio.mit.edu/~stevenj/align.c
ab-initio.mit.edu/~stevenj/tstalign.c
It just uses malloc with a little extra padding as needed to align the
data, plus a copy of the original pointer so that you can still free
and realloc (using _aligned_free and _aligned_realloc). It could be
made a bit more efficient, but it probably doesn't matter.
Post by Anne Archibald
* A macro (in C, and some way to get the same information from python,
perhaps just "a.ctypes.data % 16") to test for common alignment cases;
SIMD alignment and arbitrary power-of-two alignment are probably
sufficient.
In C this is easy, just ((uintptr_t) pointer) % 16 == 0.

You might also consider a way to set the default alignment of numpy
arrays at runtime, rather than requesting aligned arrays
individually. e.g. so that someone could come along at a later date
to a large program and just add one function call to make all the
arrays 16-byte aligned to improve performance using SIMD libraries.

Regards,
Steven G. Johnson
Lisandro Dalcin
2007-08-06 20:08:50 UTC
Permalink
Post by David Cournapeau
- adding an API to know whether a given PyArrayObject has its data
buffer 16 bytes aligned, and requesting a 16 bytes aligned
PyArrayObject. Something like NPY_ALIGNED, basically.
- forcing data allocation to be 16 bytes aligned in numpy (eg
define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
All this sounds pretty similar to sdt::allocator we can found in C++
STL (http://www.sgi.com/tech/stl/Allocators.html). Perhaps a NumPy
array could be associated with an instance of an 'allocator' object
(surely written in C, perhaps subclassable in Python) providing
appropriate methos for
alloc/dealloc(/realloc?/initialize(memset)?/copy(memcpy)?) memory.

This would be really nice, as it is extensible (you could even write a
custom allocator, perhaps making use of a preallocated,static pool;
use of C++ new/delete; use of any C++ std::allocator, shared memory,
etc. etc.). I think this is the direction to go but no idea how much
difficult it could be to implement.
--
Lisandro Dalcín
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
David Cournapeau
2007-08-07 03:41:03 UTC
Permalink
Post by Lisandro Dalcin
Post by David Cournapeau
- adding an API to know whether a given PyArrayObject has its data
buffer 16 bytes aligned, and requesting a 16 bytes aligned
PyArrayObject. Something like NPY_ALIGNED, basically.
- forcing data allocation to be 16 bytes aligned in numpy (eg
define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
All this sounds pretty similar to sdt::allocator we can found in C++
STL (http://www.sgi.com/tech/stl/Allocators.html). Perhaps a NumPy
array could be associated with an instance of an 'allocator' object
(surely written in C, perhaps subclassable in Python) providing
appropriate methos for
alloc/dealloc(/realloc?/initialize(memset)?/copy(memcpy)?) memory.
This would be really nice, as it is extensible (you could even write a
custom allocator, perhaps making use of a preallocated,static pool;
use of C++ new/delete; use of any C++ std::allocator, shared memory,
etc. etc.). I think this is the direction to go but no idea how much
difficult it could be to implement.
Well, when I proposed the SIMD extension, I was willing to implement the
proposal, and this was for a simple goal: enabling better integration
with many numeric libraries which need SIMD alignment.

As nice as a custom allocator might be, I will certainly not implement
it myself. For SIMD, I think the weight adding complexity / benefit
worth it (since there is not much change to the API and implementation),
and I know more or less how to do it; for custom allocator, that's an
entirely different story. That's really more complex; static pools may
be useful in some cases (but that's not obvious, since only the data are
allocated with this buffer, everything else being allocated through the
python memory allocator, and numpy arrays have pretty simple memory
allocation patterns).

David
Anne Archibald
2007-08-07 04:49:09 UTC
Permalink
Post by David Cournapeau
Well, when I proposed the SIMD extension, I was willing to implement the
proposal, and this was for a simple goal: enabling better integration
with many numeric libraries which need SIMD alignment.
As nice as a custom allocator might be, I will certainly not implement
it myself. For SIMD, I think the weight adding complexity / benefit
worth it (since there is not much change to the API and implementation),
and I know more or less how to do it; for custom allocator, that's an
entirely different story. That's really more complex; static pools may
be useful in some cases (but that's not obvious, since only the data are
allocated with this buffer, everything else being allocated through the
python memory allocator, and numpy arrays have pretty simple memory
allocation patterns).
I have to agree. I can hardly volunteer David for anything, and I
don't have time to implement this myself, but I think a custom
allocator is a rather special-purpose tool; if one were to implement
one, I think the way to go would be to implement a subclass of ndarray
(or just a constructor) that allocated the memory. This could be done
from python, since you can make an ndarray from scratch using a given
memory array. Of course, making temporaries be allocated with the
correct allocator will be very complicated, since it's unclear which
allocator should be used.

Adding SIMD alignment should be a very small modification; it can be
done as simply as using ctypes to wrap posix_memalign (or a portable
version, possibly written in python) and writing a simple python
function that checks the beginning data address. There's really no
need to make it complicated.

Anne
David Cournapeau
2007-08-07 05:00:20 UTC
Permalink
Post by Anne Archibald
I have to agree. I can hardly volunteer David for anything, and I
don't have time to implement this myself, but I think a custom
allocator is a rather special-purpose tool; if one were to implement
one, I think the way to go would be to implement a subclass of ndarray
(or just a constructor) that allocated the memory. This could be done
from python, since you can make an ndarray from scratch using a given
memory array. Of course, making temporaries be allocated with the
correct allocator will be very complicated, since it's unclear which
allocator should be used.
Adding SIMD alignment should be a very small modification; it can be
done as simply as using ctypes to wrap posix_memalign (or a portable
version, possibly written in python) and writing a simple python
function that checks the beginning data address. There's really no
need to make it complicated.
Anne, you said previously that it was easy to allocate buffers for a
given alignment at runtime. Could you point me to a document which
explains how ? For platforms without posix_memalign, I don't see how to
implement a memory allocator with an arbitrary alignment (more
precisely, I don't see how to free it if I cannot assume a fixed
alignement: how do I know where the "real" pointer is ?).

David
Anne Archibald
2007-08-07 05:33:24 UTC
Permalink
Post by David Cournapeau
Anne, you said previously that it was easy to allocate buffers for a
given alignment at runtime. Could you point me to a document which
explains how ? For platforms without posix_memalign, I don't see how to
implement a memory allocator with an arbitrary alignment (more
precisely, I don't see how to free it if I cannot assume a fixed
alignement: how do I know where the "real" pointer is ?).
Well, it can be done in Python: just allocate a too-big ndarray and
take a slice that's the right shape and has the right alignment. But
this sucks. Stephen G. Johnson posted code earlier in this thread that
provides a portable aligned-memory allocator - it handles the freeing
by (always) storing enough information to recover the original pointer
in the padding space. (This means you always need to pad, which is a
pain, but there's not much you can do about that.) His implementation
stores the original pointer just before the beginning of the aligned
data, so _aligned_free is free(((void**)ptr)[-1]). If you were worried
about space (rather than time) you could store a single byte just
before the pointer whose value indicated how much padding was done, or
whatever.

These schemes all waste space, but unless malloc's internal structures
are the size of the alignment block, it's almost unavoidable to waste
some space; the only way around it I can see is if the program also
allocates lots of small, odd-shaped, unaligned blocks of memory that
can be used to fill the gaps (and even then I doubt any sensible
malloc implementation fills in little gaps like this, since it seems
likely to lead to memory fragmentation). A posix_memalign that is
built into malloc can do better than any implementation that isn't,
though, with the possible exception of a specialized pool allocator
built with aligned allocation in mind.

Anne
David Cournapeau
2007-08-07 06:11:52 UTC
Permalink
Post by Anne Archibald
Well, it can be done in Python: just allocate a too-big ndarray and
take a slice that's the right shape and has the right alignment. But
this sucks. Stephen G. Johnson posted code earlier in this thread that
provides a portable aligned-memory allocator - it handles the freeing
by (always) storing enough information to recover the original pointer
in the padding space. (This means you always need to pad, which is a
pain, but there's not much you can do about that.)
This is indeed no rocket science, I feel a bit ashamed :) I don't see
the problem with padding (except wasted time) ?
Post by Anne Archibald
His implementation
stores the original pointer just before the beginning of the aligned
data, so _aligned_free is free(((void**)ptr)[-1]). If you were worried
about space (rather than time) you could store a single byte just
before the pointer whose value indicated how much padding was done, or
whatever.
I really don't see how space would be a problem in our situation: it is
not like we will pad more than a few bytes; in the case it is, I don't
see how python would be the right choice anymore anyway. I will try to
prepare a patch the next few days, then.

cheers,

David
Stefan van der Walt
2007-08-08 09:53:30 UTC
Permalink
Post by Anne Archibald
Well, it can be done in Python: just allocate a too-big ndarray and
take a slice that's the right shape and has the right alignment. But
this sucks.
Could you explain to me why is this such a bad idea?

Stéfan
Anne Archibald
2007-08-08 15:29:55 UTC
Permalink
Post by Stefan van der Walt
Post by Anne Archibald
Well, it can be done in Python: just allocate a too-big ndarray and
take a slice that's the right shape and has the right alignment. But
this sucks.
Could you explain to me why is this such a bad idea?
Oh. Well, it's not *terrible*; it gets you an aligned array. But you
have to allocate the original array as a 1D byte array (to allow for
arbitrary realignments) and then align it, reshape it, and reinterpret
it as a new type. Plus you're allocating an extra ndarray structure,
which will live as long as the new array does; this not only wastes
even more memory than the portable alignment solutions, it clogs up
python's garbage collector.

It's not outrageous, if you need aligned arrays *now*, on a released
version of numpy, but numpy itself should do better.

Anne
Charles R Harris
2007-08-08 16:04:27 UTC
Permalink
Post by Anne Archibald
Post by Stefan van der Walt
Post by Anne Archibald
Well, it can be done in Python: just allocate a too-big ndarray and
take a slice that's the right shape and has the right alignment. But
this sucks.
Could you explain to me why is this such a bad idea?
Oh. Well, it's not *terrible*; it gets you an aligned array. But you
have to allocate the original array as a 1D byte array (to allow for
arbitrary realignments) and then align it, reshape it, and reinterpret
it as a new type. Plus you're allocating an extra ndarray structure,
which will live as long as the new array does; this not only wastes
even more memory than the portable alignment solutions, it clogs up
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that is
large and the data is shared between the original array and the slice. Nor
does the data type of the slice need changing, one simply uses the desired
type to begin with, or at least a type of the right size so that a view will
do the job without copies. Nor do I see how the garbage collector will get
clogged up, slices are a common feature of using numpy. The slice method
also has the advantage of being compiler and operating system independent,
there is a reason Intel used that approach.

Aligning multidimensional arrays might indeed be complicated, but I suspect
those complications will be easier to handle in Python than in C.

Chuck
Anne Archibald
2007-08-08 18:23:44 UTC
Permalink
Post by Charles R Harris
Post by Anne Archibald
Oh. Well, it's not *terrible*; it gets you an aligned array. But you
have to allocate the original array as a 1D byte array (to allow for
arbitrary realignments) and then align it, reshape it, and reinterpret
it as a new type. Plus you're allocating an extra ndarray structure,
which will live as long as the new array does; this not only wastes
even more memory than the portable alignment solutions, it clogs up
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that is
large and the data is shared between the original array and the slice. Nor
does the data type of the slice need changing, one simply uses the desired
type to begin with, or at least a type of the right size so that a view will
do the job without copies. Nor do I see how the garbage collector will get
clogged up, slices are a common feature of using numpy. The slice method
also has the advantage of being compiler and operating system independent,
there is a reason Intel used that approach.
Aligning multidimensional arrays might indeed be complicated, but I suspect
those complications will be easier to handle in Python than in C.
Can we assume that numpy arrays allocated to contain (say) complex64s
are aligned to a 16-byte boundary? I don't think they will
necessarily, so the shift we need may not be an integer number of
complex64s. float96s pose even more problems. So to ensure alignment,
we do need to do type conversion; if we're doing it anyway, byte
arrays require the least trust in malloc().

The ndarray object isn't too big, probably some twenty or thirty
bytes, so I'm not talking about a huge waste. But it is a python
object, and the garbage collector needs to walk the whole tree of
accessible python objects every time it runs, so this is one more
object on the list.

As an aside: numpy's handling of ndarray objects is actually not
ideal; if you want to exhaust memory on your system, do:

a = arange(5)
while True:
a = a[::-1]

Each ndarray object keeps alive the ndarray object it is a slice of,
so this operation creates an ever-growing linked list of ndarray
objects. Seems to me it would be better to keep a pointer only to the
original object that holds the address of the buffer (so it can be
freed).

Aligning multidimensional arrays is an interesting question. To first
order, aligning the first element should be enough. If the dimensions
of the array are not divisible by the alignment, though, this means
that lower-dimensional complete slices may not be aligned:

A = aligned_empty((7,5),dtype=float,alignment=16)

Then A is aligned, as is A[0,:], but A[1,:] is not.

So in this case we might want to actually allocate an 8-by-5 array and
take a slice. This does mean it won't be contiguous in memory, so that
flattening it requires a copy (which may not wind up aligned). This is
something we might want to do - that is, make available as an option -
in python.

Anne
Charles R Harris
2007-08-08 18:58:13 UTC
Permalink
Anne,
Post by Charles R Harris
Post by Charles R Harris
Post by Anne Archibald
Oh. Well, it's not *terrible*; it gets you an aligned array. But you
have to allocate the original array as a 1D byte array (to allow for
arbitrary realignments) and then align it, reshape it, and reinterpret
it as a new type. Plus you're allocating an extra ndarray structure,
which will live as long as the new array does; this not only wastes
even more memory than the portable alignment solutions, it clogs up
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that
is
Post by Charles R Harris
large and the data is shared between the original array and the slice.
Nor
Post by Charles R Harris
does the data type of the slice need changing, one simply uses the
desired
Post by Charles R Harris
type to begin with, or at least a type of the right size so that a view
will
Post by Charles R Harris
do the job without copies. Nor do I see how the garbage collector will
get
Post by Charles R Harris
clogged up, slices are a common feature of using numpy. The slice method
also has the advantage of being compiler and operating system
independent,
Post by Charles R Harris
there is a reason Intel used that approach.
Aligning multidimensional arrays might indeed be complicated, but I
suspect
Post by Charles R Harris
those complications will be easier to handle in Python than in C.
Can we assume that numpy arrays allocated to contain (say) complex64s
are aligned to a 16-byte boundary? I don't think they will
necessarily, so the shift we need may not be an integer number of
complex64s. float96s pose even more problems. So to ensure alignment,
we do need to do type conversion; if we're doing it anyway, byte
arrays require the least trust in malloc().
I think that is a safe assumption, it is probably almost as safe as assuming
binary and two's complement, likely more safe than assuming ieee 784. I
expect almost all 32 bit OS's to align on 4 byte boundaries at worst, 64 bit
machines to align on 8 byte boundaries. Even C structures are typically
filled out with blanks to preserve some sort of alignment. That is because
of addressing efficiency, or even the impossibility of odd addressing --
depends on the architecture. Sometimes even byte addressing is easier to get
by putting a larger integer on the bus and extracting the relevant part. In
addition, I expect the heap implementation to make some alignment decisions
for efficiency.

My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte
boundaries. It might be interesting to see what happens with the Intel and
MSVC comipilers, but I expect similar results. PPC's, Sun and SGI need to be
checked, but I don't expect problems. I think that will cover almost all
architectures numpy is likely to run on.
Post by Charles R Harris
The ndarray object isn't too big, probably some twenty or thirty
bytes, so I'm not talking about a huge waste. But it is a python
object, and the garbage collector needs to walk the whole tree of
accessible python objects every time it runs, so this is one more
object on the list.
As an aside: numpy's handling of ndarray objects is actually not
a = arange(5)
a = a[::-1]
Well, that's a pathological case present in numpy. Fixing it doesn't seem to
be a high priority although there is a ticket somewhere.

Each ndarray object keeps alive the ndarray object it is a slice of,
Post by Charles R Harris
so this operation creates an ever-growing linked list of ndarray
objects. Seems to me it would be better to keep a pointer only to the
original object that holds the address of the buffer (so it can be
freed).
Aligning multidimensional arrays is an interesting question. To first
order, aligning the first element should be enough. If the dimensions
of the array are not divisible by the alignment, though, this means
A = aligned_empty((7,5),dtype=float,alignment=16)
Then A is aligned, as is A[0,:], but A[1,:] is not.
So in this case we might want to actually allocate an 8-by-5 array and
take a slice. This does mean it won't be contiguous in memory, so that
flattening it requires a copy (which may not wind up aligned). This is
something we might want to do - that is, make available as an option -
in python.
I think that is better viewed as need based. I suspect that if you really
need such alignment it is better to start with array dimensions that will
naturally align the rows. It will be impossible to naturally align all the
columnes unless the data type is the correct size.

Chuck
Matthieu Brucher
2007-08-08 19:01:59 UTC
Permalink
Post by Charles R Harris
My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte
boundaries. It might be interesting to see what happens with the Intel and
MSVC comipilers, but I expect similar results.
According to the doc on the msdn, the data should be 16-bits aligned.

Matthieu
Charles R Harris
2007-08-08 19:16:41 UTC
Permalink
Post by Charles R Harris
My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte
Post by Charles R Harris
boundaries. It might be interesting to see what happens with the Intel and
MSVC comipilers, but I expect similar results.
According to the doc on the msdn, the data should be 16-bits aligned.
Shades of DOS and 16 bit machines. Have you checked what actually happens on
modern hardware?

Chuck
David Cournapeau
2007-08-09 03:08:45 UTC
Permalink
Post by Anne Archibald
Anne,
Post by Charles R Harris
Post by Anne Archibald
Oh. Well, it's not *terrible*; it gets you an aligned array.
But you
Post by Charles R Harris
Post by Anne Archibald
have to allocate the original array as a 1D byte array (to
allow for
Post by Charles R Harris
Post by Anne Archibald
arbitrary realignments) and then align it, reshape it, and
reinterpret
Post by Charles R Harris
Post by Anne Archibald
it as a new type. Plus you're allocating an extra ndarray
structure,
Post by Charles R Harris
Post by Anne Archibald
which will live as long as the new array does; this not only
wastes
Post by Charles R Harris
Post by Anne Archibald
even more memory than the portable alignment solutions, it
clogs up
Post by Charles R Harris
Post by Anne Archibald
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the
data that is
Post by Charles R Harris
large and the data is shared between the original array and the
slice. Nor
Post by Charles R Harris
does the data type of the slice need changing, one simply uses
the desired
Post by Charles R Harris
type to begin with, or at least a type of the right size so that
a view will
Post by Charles R Harris
do the job without copies. Nor do I see how the garbage
collector will get
Post by Charles R Harris
clogged up, slices are a common feature of using numpy. The
slice method
Post by Charles R Harris
also has the advantage of being compiler and operating system
independent,
Post by Charles R Harris
there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are
talking about here ?

IMHO, the discussion is becoming a bit carried away. What I was
suggesting is
- being able to check whether a given data buffer is aligned to a
given alignment (easy)
- being able to request an aligned data buffer: requires aligned
memory allocators, and some additions to the API for creating arrays.

This all boils down to the following case: I have a C function which
requires N bytes aligned data, I want the numpy API to provide this
capability. I don't understand the discussion on doing it in python:
first, this means you cannot request a data buffer at the C level, and I
don't understand the whole discussion on slice, multi dimension and so
on either: at the C level, different libraries may need different arrays
formats, and in the case of fftw, all it cares about is the alignment of
the data pointer. For contiguous, C order arrays, as long as the data
pointer is aligned, I don't think we need more; are some people familiar
with the MKL, who could tell whether we need more ?

cheers,

David
Charles R Harris
2007-08-09 07:17:28 UTC
Permalink
Post by David Cournapeau
Post by Anne Archibald
Anne,
Post by Charles R Harris
Post by Anne Archibald
Oh. Well, it's not *terrible*; it gets you an aligned array.
But you
Post by Charles R Harris
Post by Anne Archibald
have to allocate the original array as a 1D byte array (to
allow for
Post by Charles R Harris
Post by Anne Archibald
arbitrary realignments) and then align it, reshape it, and
reinterpret
Post by Charles R Harris
Post by Anne Archibald
it as a new type. Plus you're allocating an extra ndarray
structure,
Post by Charles R Harris
Post by Anne Archibald
which will live as long as the new array does; this not only
wastes
Post by Charles R Harris
Post by Anne Archibald
even more memory than the portable alignment solutions, it
clogs up
Post by Charles R Harris
Post by Anne Archibald
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the
data that is
Post by Charles R Harris
large and the data is shared between the original array and the
slice. Nor
Post by Charles R Harris
does the data type of the slice need changing, one simply uses
the desired
Post by Charles R Harris
type to begin with, or at least a type of the right size so that
a view will
Post by Charles R Harris
do the job without copies. Nor do I see how the garbage
collector will get
Post by Charles R Harris
clogged up, slices are a common feature of using numpy. The
slice method
Post by Charles R Harris
also has the advantage of being compiler and operating system
independent,
Post by Charles R Harris
there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are
talking about here ?
IMHO, the discussion is becoming a bit carried away. What I was
suggesting is
- being able to check whether a given data buffer is aligned to a
given alignment (easy)
- being able to request an aligned data buffer: requires aligned
memory allocators, and some additions to the API for creating arrays.
This all boils down to the following case: I have a C function which
requires N bytes aligned data, I want the numpy API to provide this
Well, what you want might be very easy to do in python, we just need to
check the default alignments for doubles and floats for some of the other
compilers, architectures, and OS's out there. On the other hand, you might
not be able to request a c malloc that is aligned in a portable way without
resorting to the same tricks as you do in python. So why not use python and
get the reference counting and garbage collection along with it? What we
want are doubles 8 byte aligned and floats 4 byte aligned. That seems to be
the case with gcc, linux, and the Intel architecture. The idea is to create
a slightly oversize array, then use a slice of the proper size that is 16
byte aligned.

Chuck
David Cournapeau
2007-08-09 07:52:38 UTC
Permalink
Post by Charles R Harris
Well, what you want might be very easy to do in python, we just need
to check the default alignments for doubles and floats for some of the
other compilers, architectures, and OS's out there. On the other hand,
you might not be able to request a c malloc that is aligned in a
portable way without resorting to the same tricks as you do in python.
So why not use python and get the reference counting and garbage
collection along with it?
First, doing it in python means that I cannot use the facility from C
easily. But this is exactly where I need it, and where I would guess
most people need it. People want to interface numpy with the mkl ? They
will do it in C, right ? And maybe I am just too dumb to see the
problem, but I don't see the need for garbage collection and so on :)
Again, what is needed is:
- aligned allocator -> we can use the one from Steven Johnson, used
in fftw, which support more or less the same archs than numpy
- Refactor the array creation functions in C such as the
implementation takes one additional alignement argument, and the
original functions are kept identical to before
- Add a few utilities function to check whether it is SSE aligned,
arbitrary aligned, etc...

The only non trivial point is 2 . Actually, when I first thought about
it, I thought about fixing alignement at compile time, which would have
made it totally avoidable: it would have been a simple change of the
definition of PyDataMem_New to an aligned malloc with a constant. I have
already the code for this, and besides aligned malloc code, it is like a
5 lines change of numpy code, nothing terrible, really.

David
Charles R Harris
2007-08-09 08:05:03 UTC
Permalink
Post by David Cournapeau
Post by Charles R Harris
Well, what you want might be very easy to do in python, we just need
to check the default alignments for doubles and floats for some of the
other compilers, architectures, and OS's out there. On the other hand,
you might not be able to request a c malloc that is aligned in a
portable way without resorting to the same tricks as you do in python.
So why not use python and get the reference counting and garbage
collection along with it?
First, doing it in python means that I cannot use the facility from C
easily. But this is exactly where I need it, and where I would guess
most people need it. People want to interface numpy with the mkl ? They
will do it in C, right ? And maybe I am just too dumb to see the
problem, but I don't see the need for garbage collection and so on :)
- aligned allocator -> we can use the one from Steven Johnson, used
in fftw, which support more or less the same archs than numpy
- Refactor the array creation functions in C such as the
implementation takes one additional alignement argument, and the
original functions are kept identical to before
- Add a few utilities function to check whether it is SSE aligned,
arbitrary aligned, etc...
The only non trivial point is 2 . Actually, when I first thought about
it, I thought about fixing alignement at compile time, which would have
made it totally avoidable: it would have been a simple change of the
definition of PyDataMem_New to an aligned malloc with a constant. I have
already the code for this, and besides aligned malloc code, it is like a
5 lines change of numpy code, nothing terrible, really.
Ah, you want it in C. Well, I think it would not be too difficult to change
PyDataMem_New, however, the function signature would change and all the code
that used it would break. That is pretty drastic. Better to define
PyDataMem_New_Aligned, then redefine PyDataMem_New to use the new function.
That way nothing breaks and you get the function you need. I don't think
Travis would get upset if you added such a function and documented it.

Chuck
David Cournapeau
2007-08-09 08:26:12 UTC
Permalink
Post by Charles R Harris
Ah, you want it in C.
What would be the use to get SIMD aligned arrays in python ?

David
Charles R Harris
2007-08-09 08:58:24 UTC
Permalink
Post by David Cournapeau
Post by Charles R Harris
Ah, you want it in C.
What would be the use to get SIMD aligned arrays in python ?
If I wanted a fairly specialized routine and didn't want to touch the guts
of numpy, I would pass the aligned array to a C function and use the data
pointer. The python code would be just a high level wrapper. You might even
be able to use ctypes to pass the pointer into a library function. It's not
necessary to code everything in C using the python C API.

Chuck
David Cournapeau
2007-08-09 15:03:31 UTC
Permalink
Post by Charles R Harris
Post by David Cournapeau
Post by Charles R Harris
Ah, you want it in C.
What would be the use to get SIMD aligned arrays in python ?
If I wanted a fairly specialized routine and didn't want to touch the guts
of numpy, I would pass the aligned array to a C function and use the data
pointer. The python code would be just a high level wrapper. You might even
be able to use ctypes to pass the pointer into a library function. It's not
necessary to code everything in C using the python C API.
I certainly do not argue on this point. But if it was specialized,
there would be no point putting in in numpy in the first place. What I
hope is that at some point, the aligned allocators can be used inside
core numpy to optimize things internally (ufunc, etc...). Those
facilities would be really useful for many optimized libraries, which
are all C: as such, doing it in C makes sense, no ?

David
Stefan van der Walt
2007-08-09 09:40:23 UTC
Permalink
Post by David Cournapeau
Post by Charles R Harris
Well, what you want might be very easy to do in python, we just need
to check the default alignments for doubles and floats for some of the
other compilers, architectures, and OS's out there. On the other hand,
you might not be able to request a c malloc that is aligned in a
portable way without resorting to the same tricks as you do in python.
So why not use python and get the reference counting and garbage
collection along with it?
First, doing it in python means that I cannot use the facility from C
easily. But this is exactly where I need it, and where I would guess
most people need it. People want to interface numpy with the mkl ? They
will do it in C, right ?
It doesn't really matter where the memory allocation occurs, does it?
As far as I understand, the underlying fftw function has some flag to
indicate when the data is aligned. If so, we could expose that flag
in Python, and do something like

x = align16(data)
_fft(x, is_aligned=True)

I am not intimately familiar with the fft wrappers, so maybe I'm
missing something more fundamental.

Cheers
Stéfan
David Cournapeau
2007-08-09 15:14:05 UTC
Permalink
Post by Stefan van der Walt
It doesn't really matter where the memory allocation occurs, does it?
As far as I understand, the underlying fftw function has some flag to
indicate when the data is aligned. If so, we could expose that flag
in Python, and do something like
x = align16(data)
_fft(x, is_aligned=True)
I am not intimately familiar with the fft wrappers, so maybe I'm
missing something more fundamental.
You can do that, but this is only a special case of what I have in
mind. For example, what if you want to call functions which are
relatively cheap, but called many times, and want an aligned array ?
Going back and forth would be a huge waste. Also, having aligned
buffers internally (in C_, even for non array data, can be useful (eg
filters, and maybe even core numpy functionalities like ufunc,
etc...). Another point I forgot to mention before is that we can
define a default alignment which would already be SIMD friendly (as
done on Mac OS X or FreeBSD by default malloc) for *all* numpy arrays
at 0 cost: for fft, this means that most arrays would already by as
wanted, meaning a huge boost of performances for free.

Basically, the functionalities would be more usable in C, without too
much constraint, because frankly, the implementation is not difficult:
I have something almost ready, and the patch is 7kb, including code to
detect platform dependent aligned allocator. The C code can be tested
really easily (since it is independent of python).

David

Charles R Harris
2007-08-09 07:55:50 UTC
Permalink
Post by Charles R Harris
Post by David Cournapeau
Post by Anne Archibald
Anne,
Post by Charles R Harris
Post by Anne Archibald
Oh. Well, it's not *terrible*; it gets you an aligned array.
But you
Post by Charles R Harris
Post by Anne Archibald
have to allocate the original array as a 1D byte array (to
allow for
Post by Charles R Harris
Post by Anne Archibald
arbitrary realignments) and then align it, reshape it, and
reinterpret
Post by Charles R Harris
Post by Anne Archibald
it as a new type. Plus you're allocating an extra ndarray
structure,
Post by Charles R Harris
Post by Anne Archibald
which will live as long as the new array does; this not only
wastes
Post by Charles R Harris
Post by Anne Archibald
even more memory than the portable alignment solutions, it
clogs up
Post by Charles R Harris
Post by Anne Archibald
python's garbage collector.
The ndarray structure doesn't take up much memory, it is the
data that is
Post by Charles R Harris
large and the data is shared between the original array and the
slice. Nor
Post by Charles R Harris
does the data type of the slice need changing, one simply uses
the desired
Post by Charles R Harris
type to begin with, or at least a type of the right size so that
a view will
Post by Charles R Harris
do the job without copies. Nor do I see how the garbage
collector will get
Post by Charles R Harris
clogged up, slices are a common feature of using numpy. The
slice method
Post by Charles R Harris
also has the advantage of being compiler and operating system
independent,
Post by Charles R Harris
there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are
talking about here ?
IMHO, the discussion is becoming a bit carried away. What I was
suggesting is
- being able to check whether a given data buffer is aligned to a
given alignment (easy)
- being able to request an aligned data buffer: requires aligned
memory allocators, and some additions to the API for creating arrays.
This all boils down to the following case: I have a C function which
requires N bytes aligned data, I want the numpy API to provide this
Well, what you want might be very easy to do in python, we just need to
check the default alignments for doubles and floats for some of the other
compilers, architectures, and OS's out there. On the other hand, you might
not be able to request a c malloc that is aligned in a portable way without
resorting to the same tricks as you do in python. So why not use python and
get the reference counting and garbage collection along with it? What we
want are doubles 8 byte aligned and floats 4 byte aligned. That seems to be
the case with gcc, linux, and the Intel architecture. The idea is to create
a slightly oversize array, then use a slice of the proper size that is 16
byte aligned.
Chuck
For instance, in the case of linux-x86 and linux-x86_64, the following
should work:

In [68]: def align16(n,dtype=float64) :
....: size = dtype().dtype.itemsize
....: over = 16/size
....: data = empty(n + over, dtype=dtype)
....: skip = (- data.ctypes.data % 16)/size
....: return data[skip:skip + n]

Of course, now you need to fill in the data.

Chuck
Matthieu Brucher
2007-08-07 06:23:15 UTC
Permalink
Post by David Cournapeau
For platforms without posix_memalign, I don't see how to
implement a memory allocator with an arbitrary alignment (more
precisely, I don't see how to free it if I cannot assume a fixed
alignement: how do I know where the "real" pointer is ?).
Visual Studio seems to offer a counter part (also note that malloc is
supposed to return a pointer on a 16bits boundary) which is called
_aligned_malloc (
http://msdn2.microsoft.com/en-us/library/8z34s9c6(VS.80).aspx). It should be
what you need, at least for Windows/MSVC.

Matthieu
Charles R Harris
2007-08-07 20:26:27 UTC
Permalink
Post by Anne Archibald
Post by David Cournapeau
Well, when I proposed the SIMD extension, I was willing to implement the
proposal, and this was for a simple goal: enabling better integration
with many numeric libraries which need SIMD alignment.
As nice as a custom allocator might be, I will certainly not implement
it myself. For SIMD, I think the weight adding complexity / benefit
worth it (since there is not much change to the API and implementation),
and I know more or less how to do it; for custom allocator, that's an
entirely different story. That's really more complex; static pools may
be useful in some cases (but that's not obvious, since only the data are
allocated with this buffer, everything else being allocated through the
python memory allocator, and numpy arrays have pretty simple memory
allocation patterns).
I have to agree. I can hardly volunteer David for anything, and I
don't have time to implement this myself, but I think a custom
allocator is a rather special-purpose tool; if one were to implement
one, I think the way to go would be to implement a subclass of ndarray
(or just a constructor) that allocated the memory. This could be done
from python, since you can make an ndarray from scratch using a given
memory array. Of course, making temporaries be allocated with the
correct allocator will be very complicated, since it's unclear which
allocator should be used.
Maybe I'm missing something, but handling the temporaries is automatic. Just
return the appropriate slice from an array created in a subroutine. The
original array gets its reference count decremented when the routine exits
but the slice will still hold one. When the slice is deleted all the
allocated memory will get garbage collected.

Chuck
Loading...