Discussion:
[Numpy-discussion] automatically avoiding temporary arrays
Julian Taylor
2016-09-30 13:38:37 UTC
Permalink
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

r = a + b + c

creates the b + c temporary and then adds a to it.
This can be rewritten to be more efficient using inplace operations:

r = b + c
r += a

This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.

The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.

In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.

This PR implements this:
https://github.com/numpy/numpy/pull/7997

It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.

A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.

I made a little crappy benchmark script to test this cutoff in this branch:
https://github.com/juliantaylor/numpy/tree/elide-bench

If you are interested you can run it with:
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy

At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.

If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.

cheers,
Julian
j***@gmail.com
2016-09-30 21:09:04 UTC
Permalink
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)

how is this affected by broadcasting and type promotion?

Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.

for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
Post by Julian Taylor
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.

Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian Taylor
2016-09-30 21:50:00 UTC
Permalink
Post by j***@gmail.com
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.

Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.
Post by j***@gmail.com
Post by Julian Taylor
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.
Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2016-10-01 18:38:16 UTC
Permalink
Julian,

This is really, really cool!

I have been wanting something like this for years (over a decade? wow!),
but always thought it would require hacking the interpreter to intercept
operations. This is a really inspired idea, and could buy numpy a lot of
performance.

I'm afraid I can't say much about the implementation details -- but great
work!

-Chris




On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <
Post by Julian Taylor
Post by j***@gmail.com
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.
Post by j***@gmail.com
Post by Julian Taylor
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.
Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the
temporary.
Post by j***@gmail.com
Post by Julian Taylor
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Benjamin Root
2016-10-02 13:10:45 UTC
Permalink
Just thinking aloud, an idea I had recently takes a different approach. The
problem with temporaries isn't so much that they exists, but rather they
they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of
weakref'ed temporaries? Whenever a new numpy array is requested, numpy
could see if there is already one in its cache of matching size and use it.
If you think about it, expressions that result in many temporaries would
quite likely have many of them being the same size in memory.

Don't know how feasible it would be to implement though.

Cheers!
Ben Root
Post by Chris Barker
Julian,
This is really, really cool!
I have been wanting something like this for years (over a decade? wow!),
but always thought it would require hacking the interpreter to intercept
operations. This is a really inspired idea, and could buy numpy a lot of
performance.
I'm afraid I can't say much about the implementation details -- but great
work!
-Chris
On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <
Post by Julian Taylor
Post by j***@gmail.com
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy
operations.
Post by j***@gmail.com
Post by Julian Taylor
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.
Post by j***@gmail.com
Post by Julian Taylor
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it
to
Post by j***@gmail.com
Post by Julian Taylor
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.
Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself
does
Post by j***@gmail.com
Post by Julian Taylor
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API
directly
Post by j***@gmail.com
Post by Julian Taylor
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the
temporary.
Post by j***@gmail.com
Post by Julian Taylor
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided
runtime.
Post by j***@gmail.com
Post by Julian Taylor
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Julian Taylor
2016-10-03 10:16:48 UTC
Permalink
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.
A context manager could work too but it would probably lead to premature
optimization.

Very new Linux versions (4.6+) now finally support MADV_FREE which gives
memory back to the system but does not require refaulting it if nothing
else needed it. So this might be an option now.
But libc implementations will probably use that at some point too and
then numpy doesn't need to do this.
Post by Benjamin Root
Just thinking aloud, an idea I had recently takes a different approach.
The problem with temporaries isn't so much that they exists, but rather
they they keep on malloc'ed and cleared. What if numpy kept a small LRU
cache of weakref'ed temporaries? Whenever a new numpy array is
requested, numpy could see if there is already one in its cache of
matching size and use it. If you think about it, expressions that result
in many temporaries would quite likely have many of them being the same
size in memory.
Don't know how feasible it would be to implement though.
Cheers!
Ben Root
Julian,
This is really, really cool!
I have been wanting something like this for years (over a decade?
wow!), but always thought it would require hacking the interpreter
to intercept operations. This is a really inspired idea, and could
buy numpy a lot of performance.
I'm afraid I can't say much about the implementation details -- but
great work!
-Chris
On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor
Post by j***@gmail.com
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace
operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full
broadcasting but
its not implemented yet in the PR.
Post by j***@gmail.com
Post by Julian Taylor
This saves some memory bandwidth and can speedup the
operation by 50%
Post by j***@gmail.com
Post by Julian Taylor
for very large arrays or even more if the inplace operation
allows it to
Post by j***@gmail.com
Post by Julian Taylor
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.
Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less
readable so they
Post by j***@gmail.com
Post by Julian Taylor
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython
itself does
Post by j***@gmail.com
Post by Julian Taylor
this optimization for string concatenations.
In numpy we have the issue that we can be called from the
C-API directly
Post by j***@gmail.com
Post by Julian Taylor
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python
functions in
Post by j***@gmail.com
Post by Julian Taylor
between that and our entry point we should be able to elide
the temporary.
Post by j***@gmail.com
Post by Julian Taylor
https://github.com/numpy/numpy/pull/7997
<https://github.com/numpy/numpy/pull/7997>
Post by j***@gmail.com
Post by Julian Taylor
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how
good their
Post by j***@gmail.com
Post by Julian Taylor
backtrace is. On windows the backtrace APIs are different and
I don't
Post by j***@gmail.com
Post by Julian Taylor
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive,
so should
Post by j***@gmail.com
Post by Julian Taylor
only be enabled when the involved arrays are large enough for
it to be
Post by j***@gmail.com
Post by Julian Taylor
worthwhile. In my testing this seems to be around 180-300KiB
sized
Post by j***@gmail.com
Post by Julian Taylor
arrays, basically where they start spilling out of the CPU L2
cache.
Post by j***@gmail.com
Post by Julian Taylor
I made a little crappy benchmark script to test this cutoff
https://github.com/juliantaylor/numpy/tree/elide-bench
<https://github.com/juliantaylor/numpy/tree/elide-bench>
Post by j***@gmail.com
Post by Julian Taylor
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and
non-elided runtime.
Post by j***@gmail.com
Post by Julian Taylor
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to
get this
Post by j***@gmail.com
Post by Julian Taylor
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
<https://mail.scipy.org/mailman/listinfo/numpy-discussion>
Post by j***@gmail.com
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
<https://mail.scipy.org/mailman/listinfo/numpy-discussion>
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
<https://mail.scipy.org/mailman/listinfo/numpy-discussion>
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 <tel:%28206%29%20526-6959>
voice
7600 Sand Point Way NE (206) 526-6329 <tel:%28206%29%20526-6329> fax
Seattle, WA 98115 (206) 526-6317 <tel:%28206%29%20526-6317>
main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
<https://mail.scipy.org/mailman/listinfo/numpy-discussion>
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Chris Barker
2016-10-03 18:23:53 UTC
Permalink
Post by Julian Taylor
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items, not
small memory), and then it get used within expressions, but not hold on to
much outside of expressions.

However, is the allocation the only (Or even biggest) source of the
performance hit?

If you generate a temporary as a result of an operation, rather than doing
it in-place, that temporary needs to be allocated, but it also means that
an additional array needs to be pushed through the processor -- and that
can make a big performance difference too.

I"m not entirely sure how to profile this correctly, but this seems to
indicate that the allocation is cheap compared to the operations (for a
million--element array)

* Regular old temporary creation

In [24]: def f1(arr1, arr2):
...: result = arr1 + arr2
...: return result

In [26]: %timeit f1(arr1, arr2)
1000 loops, best of 3: 1.13 ms per loop

* Completely in-place, no allocation of an extra array

In [27]: def f2(arr1, arr2):
...: arr1 += arr2
...: return arr1

In [28]: %timeit f2(arr1, arr2)
1000 loops, best of 3: 755 µs per loop

So that's about 30% faster

* allocate a temporary that isn't used -- but should catch the creation cost

In [29]: def f3(arr1, arr2):
...: result = np.empty_like(arr1)
...: arr1 += arr2
...: return arr1

In [30]: % timeit f3(arr1, arr2)

1000 loops, best of 3: 756 µs per loop

only a µs slower!

Profiling is hard, and I'm not good at it, but this seems to indicate that
the allocation is cheap.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
Julian Taylor
2016-10-03 18:43:16 UTC
Permalink
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items,
not small memory), and then it get used within expressions, but not hold
on to much outside of expressions.
numpy doesn't see the whole expression so we can't really do much.
(technically we could in 3.5 by using pep 523, but that would be a
larger undertaking)
However, is the allocation the only (Or even biggest) source of the
performance hit?
on large arrays the allocation is insignificant. What does cost some
time is faulting the memory into the process which implies writing zeros
into the pages (a page at a time as it is being used).
By storing memory blocks in numpy we would save this portion. This is
really the job of the libc, but these are usually tuned for general
purpose workloads and thus tend to give back memory back to the system
much earlier than numerical workloads would like.

Note that numpy already has a small memory block cache but its only used
for very small arrays where the allocation cost itself is significant,
it is limited to a couple megabytes at most.
Benjamin Root
2016-10-03 19:07:28 UTC
Permalink
With regards to arguments about holding onto large arrays, I would like to
emphasize that my original suggestion mentioned weakref'ed numpy arrays.
Essentially, the idea is to claw back only the raw memory blocks during
that limbo period between discarding the numpy array python object and when
python garbage-collects it.

Ben Root
Post by Julian Taylor
On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.
indeed -- but one could set an LRU cache to be very small (few items,
not small memory), and then it get used within expressions, but not hold
on to much outside of expressions.
numpy doesn't see the whole expression so we can't really do much.
(technically we could in 3.5 by using pep 523, but that would be a
larger undertaking)
However, is the allocation the only (Or even biggest) source of the
performance hit?
on large arrays the allocation is insignificant. What does cost some
time is faulting the memory into the process which implies writing zeros
into the pages (a page at a time as it is being used).
By storing memory blocks in numpy we would save this portion. This is
really the job of the libc, but these are usually tuned for general
purpose workloads and thus tend to give back memory back to the system
much earlier than numerical workloads would like.
Note that numpy already has a small memory block cache but its only used
for very small arrays where the allocation cost itself is significant,
it is limited to a couple megabytes at most.
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Pauli Virtanen
2016-10-03 19:33:57 UTC
Permalink
Post by Benjamin Root
With regards to arguments about holding onto large arrays, I would like
to emphasize that my original suggestion mentioned weakref'ed numpy
arrays.
Essentially, the idea is to claw back only the raw memory blocks during
that limbo period between discarding the numpy array python object and
when python garbage-collects it.
CPython afaik deallocates immediately when the refcount hits zero. It's
relatively rare that you have arrays hanging around waiting for cycle
breaking by gc. If you have them hanging around, I don't think it's
possible to distinguish these from other arrays without running the gc.

Note also that an "is an array in use" check probably always requires
Julian's stack based hack since you cannot rely on the refcount.

Pauli
Marten van Kerkwijk
2016-10-04 04:45:04 UTC
Permalink
Note that numpy does store some larger arrays already, in the fft
module. (In fact, this was a cache of unlimited size until #7686.) It
might not be bad if the same cache were used more generally.

That said, if newer versions of python are offering ways of doing this
better, maybe that is the best way forward.

-- Marten
Hush Hush
2016-10-03 00:15:13 UTC
Permalink
The same idea was published two years ago:

http://hiperfit.dk/pdf/Doubling.pdf
Send NumPy-Discussion mailing list submissions to
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.scipy.org/mailman/listinfo/numpy-discussion
or, via email, send a message with subject or body 'help' to
You can reach the person managing the list at
When replying, please edit your Subject line so it is more specific
than "Re: Contents of NumPy-Discussion digest..."
1. Re: automatically avoiding temporary arrays (Benjamin Root)
2. Re: Dropping sourceforge for releases. (David Cournapeau)
3. Re: Dropping sourceforge for releases. (Vincent Davis)
----------------------------------------------------------------------
Message: 1
Date: Sun, 2 Oct 2016 09:10:45 -0400
Subject: Re: [Numpy-discussion] automatically avoiding temporary
arrays
gmail.com>
Content-Type: text/plain; charset="utf-8"
Just thinking aloud, an idea I had recently takes a different approach. The
problem with temporaries isn't so much that they exists, but rather they
they keep on malloc'ed and cleared. What if numpy kept a small LRU cache of
weakref'ed temporaries? Whenever a new numpy array is requested, numpy
could see if there is already one in its cache of matching size and use it.
If you think about it, expressions that result in many temporaries would
quite likely have many of them being the same size in memory.
Don't know how feasible it would be to implement though.
Cheers!
Ben Root
Post by Chris Barker
Julian,
This is really, really cool!
I have been wanting something like this for years (over a decade? wow!),
but always thought it would require hacking the interpreter to intercept
operations. This is a really inspired idea, and could buy numpy a lot of
performance.
I'm afraid I can't say much about the implementation details -- but great
work!
-Chris
On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <
Post by Julian Taylor
Post by j***@gmail.com
On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy
operations.
Post by j***@gmail.com
Post by Julian Taylor
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
general question (I wouldn't understand the details even if I looked.)
how is this affected by broadcasting and type promotion?
Some of the main reasons that I don't like to use inplace operation in
general is that I'm often not sure when type promotion occurs and when
arrays expand during broadcasting.
for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
another case when I switch away from broadcasting is when b + c is int
or bool and a is float. Thankfully, we get error messages for casting
now.
the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.
Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.
Post by j***@gmail.com
Post by Julian Taylor
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it
to
Post by j***@gmail.com
Post by Julian Taylor
be completed completely in the cpu cache.
I didn't realize the difference can be so large. That would make
streamlining some code worth the effort.
Josef
Post by Julian Taylor
The problem is that inplace operations are a lot less readable so
they
Post by Chris Barker
Post by Julian Taylor
Post by j***@gmail.com
Post by Julian Taylor
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself
does
Post by j***@gmail.com
Post by Julian Taylor
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API
directly
Post by j***@gmail.com
Post by Julian Taylor
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the
temporary.
Post by j***@gmail.com
Post by Julian Taylor
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so
should
Post by Chris Barker
Post by Julian Taylor
Post by j***@gmail.com
Post by Julian Taylor
only be enabled when the involved arrays are large enough for it to
be
Post by Chris Barker
Post by Julian Taylor
Post by j***@gmail.com
Post by Julian Taylor
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided
runtime.
Post by j***@gmail.com
Post by Julian Taylor
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get
this
Post by Chris Barker
Post by Julian Taylor
Post by j***@gmail.com
Post by Julian Taylor
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/
attachments/20161002/2e65258f/attachment-0001.html>
------------------------------
Message: 2
Date: Sun, 2 Oct 2016 22:26:28 +0100
Subject: Re: [Numpy-discussion] Dropping sourceforge for releases.
gmail.com>
Content-Type: text/plain; charset="utf-8"
+1 from me.
If we really need some distribution on top of github/pypi, note that
bintray (https://bintray.com/) is free for OSS projects, and is a much
better experience than sourceforge.
David
On Sun, Oct 2, 2016 at 12:02 AM, Charles R Harris <
Post by Chris Barker
Hi All,
Ralf has suggested dropping sourceforge as a NumPy release site. There
was
Post by Chris Barker
discussion of doing that some time back but we have not yet done it. Now
that we put wheels up on PyPI for all supported architectures source
forge
Post by Chris Barker
is not needed. I note that there are still some 15,000 downloads a week
from the site, so it is still used.
Thoughts?
Chuck
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/
attachments/20161002/4e462a48/attachment-0001.html>
------------------------------
Message: 3
Date: Sun, 2 Oct 2016 17:53:32 -0600
Subject: Re: [Numpy-discussion] Dropping sourceforge for releases.
mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
+1, I am very skeptical of anything on SourceForge, it negatively impacts
my opinion of any project that requires me to download from sourceforge.
Post by Chris Barker
Hi All,
Ralf has suggested dropping sourceforge as a NumPy release site. There
was
Post by Chris Barker
discussion of doing that some time back but we have not yet done it. Now
that we put wheels up on PyPI for all supported architectures source
forge
Post by Chris Barker
is not needed. I note that there are still some 15,000 downloads a week
from the site, so it is still used.
Thoughts?
Chuck
--
Sent from mobile app.
Vincent Davis
720-301-3003
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.scipy.org/pipermail/numpy-discussion/
attachments/20161002/eb4cbff3/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
------------------------------
End of NumPy-Discussion Digest, Vol 121, Issue 3
************************************************
srean
2016-10-05 06:45:11 UTC
Permalink
Good discussion, but was surprised by the absence of numexpr in the
discussion., given how relevant it (numexpr) is to the topic.

Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?

On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Francesc Alted
2016-10-05 09:46:21 UTC
Permalink
Post by srean
Good discussion, but was surprised by the absence of numexpr in the
discussion., given how relevant it (numexpr) is to the topic.
Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?
Yes, the question about merging numexpr into numpy has been something that
periodically shows up in this list. I think mostly everyone agree that it
is a good idea, but things are not so easy, and so far nobody provided a
good patch for this. Also, the fact that numexpr relies on grouping an
expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) +
4")) does not play well with the way in that numpy evaluates expressions,
so something should be suggested to cope with this too.
Post by srean
On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
Robert McLeod
2016-10-05 10:56:20 UTC
Permalink
All,
Post by Francesc Alted
Post by srean
Good discussion, but was surprised by the absence of numexpr in the
discussion., given how relevant it (numexpr) is to the topic.
Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?
Yes, the question about merging numexpr into numpy has been something that
periodically shows up in this list. I think mostly everyone agree that it
is a good idea, but things are not so easy, and so far nobody provided a
good patch for this. Also, the fact that numexpr relies on grouping an
expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) +
4")) does not play well with the way in that numpy evaluates expressions,
so something should be suggested to cope with this too.
As Francesc said, Numexpr is going to get most of its power through
grouping a series of operations so it can send blocks to the CPU cache and
run the entire series of operations on the cache before returning the block
to system memory. If it was just used to back-end NumPy, it would only
gain from the multi-threading portion inside each function call. I'm not
sure how one would go about grouping successive numpy expressions without
modifying the Python interpreter?

I put a bit of effort into extending numexpr to use 4-byte word opcodes
instead of 1-byte. Progress has been very slow, however, due to time
constraints, but I have most of the numpy data types (u[1-4], i[1-4],
f[4,8], c[8,16], S[1-4], U[1-4]). On Tuesday I finished writing a Python
generator script that writes all the C-side opcode macros for opcodes.hpp.
Now I have about 900 opcodes, and this could easily grow into thousands if
more functions are added, so I also built a reverse lookup tree (based on
collections.defaultdict) for the Python-side of numexpr.

Robert
--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der UniversitÀt Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
***@unibas.ch
***@bsse.ethz.ch <***@ethz.ch>
***@gmail.com
srean
2016-10-05 11:11:15 UTC
Permalink
Thanks Francesc, Robert for giving me a broader picture of where this fits
in. I believe numexpr does not handle slicing, so that might be another
thing to look at.
Post by Robert McLeod
As Francesc said, Numexpr is going to get most of its power through
grouping a series of operations so it can send blocks to the CPU cache and
run the entire series of operations on the cache before returning the block
to system memory. If it was just used to back-end NumPy, it would only
gain from the multi-threading portion inside each function call.
Is that so ?

I thought numexpr also cuts down on number of temporary buffers that get
filled (in other words copy operations) if the same expression was written
as series of operations. My understanding can be wrong, and would
appreciate correction.

The 'out' parameter in ufuncs can eliminate extra temporaries but its not
composable. Right now I have to manually carry along the array where the in
place operations take place. I think the goal here is to eliminate that.
Robert McLeod
2016-10-05 12:06:06 UTC
Permalink
Post by srean
Thanks Francesc, Robert for giving me a broader picture of where this fits
in. I believe numexpr does not handle slicing, so that might be another
thing to look at.
Dereferencing would be relatively simple to add into numexpr, as it would
just be some getattr() calls. Personally I will add that at some point
because it will clean up my code.

Slicing, maybe only for continuous blocks in memory?

I.e.

imageStack[0,:,:]

would be possible, but

imageStack[:, ::2, ::2]

would not be trivial (I think...). I seem to remember someone asked David
Cooke about slicing and he said something along the lines of, "that's what
Numba is for." Perhaps NumPy backended by Numba is more so what you are
looking for, as it hooks into the byte compiler? The main advantage of
numexpr is that a series of numpy functions in <expression> can be enclosed
in ne.evaluate( "<expression>" ) and it provides a big acceleration for
little programmer effort, but it's not nearly as sophisticated as Numba or
PyPy.
Post by srean
Post by Robert McLeod
As Francesc said, Numexpr is going to get most of its power through
grouping a series of operations so it can send blocks to the CPU cache and
run the entire series of operations on the cache before returning the block
to system memory. If it was just used to back-end NumPy, it would only
gain from the multi-threading portion inside each function call.
Is that so ?
I thought numexpr also cuts down on number of temporary buffers that get
filled (in other words copy operations) if the same expression was written
as series of operations. My understanding can be wrong, and would
appreciate correction.
The 'out' parameter in ufuncs can eliminate extra temporaries but its not
composable. Right now I have to manually carry along the array where the in
place operations take place. I think the goal here is to eliminate that.
The numexpr virtual machine does create temporaries where needed when it
parses the abstract syntax tree for all the operations it has to do. I
believe the main advantage is that the temporaries are created on the CPU
cache, and not in system memory. It's certainly true that numexpr doesn't
create a lot of OP_COPY operations, rather it's optimized to minimize them,
so probably it's fewer ops than naive successive calls to numpy within
python, but I'm unsure if there's any difference in operation count between
a hand-optimized numpy with out= set and numexpr. Numexpr just does it for
you.

This blog post from Tim Hochberg is useful for understanding the
performance advantages of blocking versus multithreading:

http://www.bitsofbits.com/2014/09/21/numpy-micro-optimization-and-numexpr/


Robert
--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der UniversitÀt Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
***@unibas.ch
***@bsse.ethz.ch <***@ethz.ch>
***@gmail.com
srean
2016-10-06 09:51:07 UTC
Permalink
It's certainly true that numexpr doesn't create a lot of OP_COPY
operations, rather it's optimized to minimize them, so probably it's fewer
ops than naive successive calls to numpy within python, but I'm unsure if
there's any difference in operation count between a hand-optimized numpy
with out= set and numexpr. Numexpr just does it for you.
That was my understanding as well. If it automatically does what one could
achieve by carrying the state along in the 'out' parameter, that's as good
as it can get in terms removing unnecessary ops. There are other speedup
opportunities of course, but that's a separate matter.
This blog post from Tim Hochberg is useful for understanding the
http://www.bitsofbits.com/2014/09/21/numpy-micro-optimization-and-numexpr/
Hadnt come across that one before. Great link. Thanks. using caches and
vector registers well trumps threading, unless one has a lot of data and it
helps to disable hyper-threading.
Julian Taylor
2017-02-25 10:17:55 UTC
Permalink
hi,
This PR has now been merged. As it has the potential to break stuff
please test your applications, in particular ones that use cython and
C-extensions against master.

It will only do something when working on large arrays (> 256 KiB) on
platforms providing the backtrace() function. So linux and mac but
windows probably not (though windows could probably be added if someone
volunteers to write the equivalent winapi code).

If you compile from source and want to see for sure if it is doing
anything you can set the NPY_ELIDE_DEBUG define in
./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out
when it is eliding something.

cheers,
Julian
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
Benjamin Root
2017-02-27 18:43:23 UTC
Permalink
What's the timeline for the next release? I have the perfect usecase for
this (Haversine calculation on large arrays that takes up ~33% of one of my
processing scripts). However, to test it out, I have a huge dependency mess
to wade through first, and there are no resources devoted to that project
for at least a few weeks. I want to make sure I get feedback to y'all.

Cheers!
Ben Root


On Sat, Feb 25, 2017 at 5:17 AM, Julian Taylor <
Post by Julian Taylor
hi,
This PR has now been merged. As it has the potential to break stuff
please test your applications, in particular ones that use cython and
C-extensions against master.
It will only do something when working on large arrays (> 256 KiB) on
platforms providing the backtrace() function. So linux and mac but
windows probably not (though windows could probably be added if someone
volunteers to write the equivalent winapi code).
If you compile from source and want to see for sure if it is doing
anything you can set the NPY_ELIDE_DEBUG define in
./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out
when it is eliding something.
cheers,
Julian
Post by Julian Taylor
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
r = a + b + c
creates the b + c temporary and then adds a to it.
r = b + c
r += a
This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.
The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.
In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the
temporary.
Post by Julian Taylor
https://github.com/numpy/numpy/pull/7997
It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.
A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.
I made a little crappy benchmark script to test this cutoff in this
https://github.com/juliantaylor/numpy/tree/elide-bench
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy
At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.
If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.
cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris
2017-02-27 19:27:48 UTC
Permalink
Post by Benjamin Root
What's the timeline for the next release? I have the perfect usecase for
this (Haversine calculation on large arrays that takes up ~33% of one of my
processing scripts). However, to test it out, I have a huge dependency mess
to wade through first, and there are no resources devoted to that project
for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that
still need to go in are the `__array_ufunc__` work and the `lapack_lite`
update. The first RC should not take much longer. I believe Matthew is
building wheels for testing on the fly but I don't know where you can find
them.

Chuck
Matthew Brett
2017-02-27 19:31:53 UTC
Permalink
Hi,

On Mon, Feb 27, 2017 at 11:27 AM, Charles R Harris
Post by Charles R Harris
Post by Benjamin Root
What's the timeline for the next release? I have the perfect usecase for
this (Haversine calculation on large arrays that takes up ~33% of one of my
processing scripts). However, to test it out, I have a huge dependency mess
to wade through first, and there are no resources devoted to that project
for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that
still need to go in are the `__array_ufunc__` work and the `lapack_lite`
update. The first RC should not take much longer. I believe Matthew is
building wheels for testing on the fly but I don't know where you can find
them.
Latest wheels at :
https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

PRE_URL=https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com
pip install -f $PRE_URL --pre numpy

Cheers,

Matthew
Benjamin Root
2017-02-27 19:46:18 UTC
Permalink
Ah, that wheel would be a huge help. Most of the packages I have as
dependencies for this project were compiled against v1.10, so I am hoping
that there won't be too big of a problem. Are the mostlinux wheels still
compatible with CentOS5, or has the base image been bumped to CentOS6?

Cheers!
Ben Root
Post by Matthew Brett
Hi,
On Mon, Feb 27, 2017 at 11:27 AM, Charles R Harris
Post by Charles R Harris
Post by Benjamin Root
What's the timeline for the next release? I have the perfect usecase for
this (Haversine calculation on large arrays that takes up ~33% of one
of my
Post by Charles R Harris
Post by Benjamin Root
processing scripts). However, to test it out, I have a huge dependency
mess
Post by Charles R Harris
Post by Benjamin Root
to wade through first, and there are no resources devoted to that
project
Post by Charles R Harris
Post by Benjamin Root
for at least a few weeks. I want to make sure I get feedback to y'all.
I'd like to branch 1.13.x at the end of March. The planned features that
still need to go in are the `__array_ufunc__` work and the `lapack_lite`
update. The first RC should not take much longer. I believe Matthew is
building wheels for testing on the fly but I don't know where you can
find
Post by Charles R Harris
them.
https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a
83.ssl.cf2.rackcdn.com/
PRE_URL=https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a
83.ssl.cf2.rackcdn.com
pip install -f $PRE_URL --pre numpy
Cheers,
Matthew
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Matthew Brett
2017-02-27 19:52:31 UTC
Permalink
Hi,
Post by Benjamin Root
Ah, that wheel would be a huge help. Most of the packages I have as
dependencies for this project were compiled against v1.10, so I am hoping
that there won't be too big of a problem.
I don't think so - the problems generally arise when you try to import
a package compiled against a more recent version of numpy than the one
you have installed.
Post by Benjamin Root
Are the mostlinux wheels still
compatible with CentOS5, or has the base image been bumped to CentOS6?
Yup, still CentOS5. When it shifts to CentOS6, probably not very
soon, you'll likely see a version number bump to manylinux2
(manylinux1 is the current version).

Cheers,

Matthew

Loading...