Discussion:
[Numpy-discussion] Question about unaligned access
Francesc Alted
2015-07-06 15:18:13 UTC
Permalink
Hi,

I have stumbled into this:

In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int32)])

In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop

In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int64)])

In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop

The first structured array is made of 12-byte records, while the second is
made by 16-byte records, but the latter performs 5x faster. Also, using an
structured array that is made of 8-byte records is the fastest (expected):

In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0',
np.int64)])

In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop

Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform
quite well on unaligned data:

http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

So, if 4 years-old Intel architectures do not have a penalty for unaligned
access, why I am seeing that in NumPy? That strikes like a quite strange
thing to me.

Thanks,
Francesc
--
Francesc Alted
Francesc Alted
2015-07-06 15:28:37 UTC
Permalink
Oops, forgot to mention my NumPy version:

In [72]: np.__version__
Out[72]: '1.9.2'

Francesc
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the second is
made by 16-byte records, but the latter performs 5x faster. Also, using an
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0',
np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for unaligned
access, why I am seeing that in NumPy? That strikes like a quite strange
thing to me.
Thanks,
Francesc
--
Francesc Alted
--
Francesc Alted
Jaime Fernández del Río
2015-07-06 16:04:11 UTC
Permalink
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0',
np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the second is
made by 16-byte records, but the latter performs 5x faster. Also, using an
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0',
np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for unaligned
access, why I am seeing that in NumPy? That strikes like a quite strange
thing to me.
I believe that the way numpy is setup, it never does unaligned access,
regardless of the platform, in case it gets run on one that would go up in
flames if you tried to. So my guess would be that you are seeing chunked
copies into a buffer, as opposed to bulk copying or no copying at all, and
that would explain your timing differences. But Julian or Sebastian can
probably give you a more informed answer.

Jaime
Post by Francesc Alted
Thanks,
Francesc
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
Francesc Alted
2015-07-06 16:21:20 UTC
Permalink
Post by Jaime Fernández del Río
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the second
is made by 16-byte records, but the latter performs 5x faster. Also, using
an structured array that is made of 8-byte records is the fastest
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0',
np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes like a quite
strange thing to me.
I believe that the way numpy is setup, it never does unaligned access,
regardless of the platform, in case it gets run on one that would go up in
flames if you tried to. So my guess would be that you are seeing chunked
copies into a buffer, as opposed to bulk copying or no copying at all, and
that would explain your timing differences. But Julian or Sebastian can
probably give you a more informed answer.
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern on
Intel platforms, but provided that structured arrays are not that used
(pandas is probably leading this use case by far, and as far as I know,
they are not using structured arrays internally in DataFrames), then maybe
it is not worth to worry about this too much.

Thanks anyway,
Francesc
Post by Jaime Fernández del Río
Jaime
Post by Francesc Alted
Thanks,
Francesc
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
Julian Taylor
2015-07-06 18:11:47 UTC
Permalink
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the
second is made by 16-byte records, but the latter performs 5x
faster. Also, using an structured array that is made of 8-byte
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)),
dtype=[('f0', np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes
like a quite strange thing to me.
I believe that the way numpy is setup, it never does unaligned
access, regardless of the platform, in case it gets run on one that
would go up in flames if you tried to. So my guess would be that you
are seeing chunked copies into a buffer, as opposed to bulk copying
or no copying at all, and that would explain your timing
differences. But Julian or Sebastian can probably give you a more
informed answer.
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern
on Intel platforms, but provided that structured arrays are not that
used (pandas is probably leading this use case by far, and as far as I
know, they are not using structured arrays internally in DataFrames),
then maybe it is not worth to worry about this too much.
Thanks anyway,
Francesc
Jaime
Thanks,
Francesc
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
planes de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian Taylor
2015-07-06 18:11:47 UTC
Permalink
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the
second is made by 16-byte records, but the latter performs 5x
faster. Also, using an structured array that is made of 8-byte
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)),
dtype=[('f0', np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes
like a quite strange thing to me.
I believe that the way numpy is setup, it never does unaligned
access, regardless of the platform, in case it gets run on one that
would go up in flames if you tried to. So my guess would be that you
are seeing chunked copies into a buffer, as opposed to bulk copying
or no copying at all, and that would explain your timing
differences. But Julian or Sebastian can probably give you a more
informed answer.
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern
on Intel platforms, but provided that structured arrays are not that
used (pandas is probably leading this use case by far, and as far as I
know, they are not using structured arrays internally in DataFrames),
then maybe it is not worth to worry about this too much.
Thanks anyway,
Francesc
Jaime
Thanks,
Francesc
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
planes de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian Taylor
2015-07-06 18:11:47 UTC
Permalink
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the
second is made by 16-byte records, but the latter performs 5x
faster. Also, using an structured array that is made of 8-byte
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)),
dtype=[('f0', np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes
like a quite strange thing to me.
I believe that the way numpy is setup, it never does unaligned
access, regardless of the platform, in case it gets run on one that
would go up in flames if you tried to. So my guess would be that you
are seeing chunked copies into a buffer, as opposed to bulk copying
or no copying at all, and that would explain your timing
differences. But Julian or Sebastian can probably give you a more
informed answer.
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern
on Intel platforms, but provided that structured arrays are not that
used (pandas is probably leading this use case by far, and as far as I
know, they are not using structured arrays internally in DataFrames),
then maybe it is not worth to worry about this too much.
Thanks anyway,
Francesc
Jaime
Thanks,
Francesc
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
planes de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian Taylor
2015-07-06 18:32:34 UTC
Permalink
sorry for the 3 empty mails, my client bugged out...

as a workaround you can align structured dtypes to avoid this issue:

sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=np.dtype([('f0', np.int64), ('f1', np.int32)], align=True))
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the
second is made by 16-byte records, but the latter performs 5x
faster. Also, using an structured array that is made of 8-byte
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)),
dtype=[('f0', np.int64)])
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes
like a quite strange thing to me.
I believe that the way numpy is setup, it never does unaligned
access, regardless of the platform, in case it gets run on one that
would go up in flames if you tried to. So my guess would be that you
are seeing chunked copies into a buffer, as opposed to bulk copying
or no copying at all, and that would explain your timing
differences. But Julian or Sebastian can probably give you a more
informed answer.
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern
on Intel platforms, but provided that structured arrays are not that
used (pandas is probably leading this use case by far, and as far as I
know, they are not using structured arrays internally in DataFrames),
then maybe it is not worth to worry about this too much.
Thanks anyway,
Francesc
Todd
2015-07-07 07:53:50 UTC
Permalink
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
Hi,
In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int32)])
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
In [63]: %timeit sa['f0'].sum()
100 loops, best of 3: 4.52 ms per loop
In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)),
dtype=[('f0', np.int64), ('f1', np.int64)])
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
In [65]: %timeit sa['f0'].sum()
1000 loops, best of 3: 896 µs per loop
The first structured array is made of 12-byte records, while the second
is made by 16-byte records, but the latter performs 5x faster. Also, using
an structured array that is made of 8-byte records is the fastest
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)),
dtype=[('f0', np.int64)])
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
In [67]: %timeit sa['f0'].sum()
1000 loops, best of 3: 567 µs per loop
Now, my laptop has a Ivy Bridge processor (i5-3380M) that should
perform quite well on unaligned data:
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
Post by Francesc Alted
Post by Jaime Fernández del Río
Post by Francesc Alted
So, if 4 years-old Intel architectures do not have a penalty for
unaligned access, why I am seeing that in NumPy? That strikes like a quite
strange thing to me.
Post by Francesc Alted
Post by Jaime Fernández del Río
I believe that the way numpy is setup, it never does unaligned access,
regardless of the platform, in case it gets run on one that would go up in
flames if you tried to. So my guess would be that you are seeing chunked
copies into a buffer, as opposed to bulk copying or no copying at all, and
that would explain your timing differences. But Julian or Sebastian can
probably give you a more informed answer.
Post by Francesc Alted
Yes, my guess is that you are right. I suppose that it is possible to
improve the numpy codebase to accelerate this particular access pattern on
Intel platforms, but provided that structured arrays are not that used
(pandas is probably leading this use case by far, and as far as I know,
they are not using structured arrays internally in DataFrames), then maybe
it is not worth to worry about this too much.
That may be more of a chicken-and-egg problem. Structured arrays are pretty
complicated to set up, which means they don't get used much, which means
they don't get much attention, which means they remain complicated.
Loading...