Discussion:
[Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)
Matěj Týč
2016-04-11 12:39:41 UTC
Permalink
Dear Numpy developers,
I propose a pull request https://github.com/numpy/numpy/pull/7533 that
features numpy arrays that can be shared among processes (with some
effort).

Why:
In CPython, multiprocessing is the only way of how to exploit
multi-core CPUs if your parallel code can't avoid creating Python
objects. In that case, CPython's GIL makes threads unusable. However,
unlike with threading, sharing data among processes is something that
is non-trivial and platform-dependent.

Although numpy (and certainly some other packages) implement some
operations in a way that GIL is not a concern, consider another case:
You have a large amount of data in a form of a numpy array and you
want to pass it to a function of an arbitrary Python module that also
expects numpy array (e.g. list of vertices coordinates as an input and
array of the corresponding polygon as an output). Here, it is clear
GIL is an issue you and since you want a numpy array on both ends, now
you would have to copy your numpy array to a multiprocessing.Array (to
pass the data) and then to convert it back to ndarray in the worker
process.
This contribution would streamline it a bit - you would create an
array as you are used to, pass it to the subprocess as you would do
with the multiprocessing.Array, and the process can work with a numpy
array right away.

How:
The idea is to create a numpy array in a buffer that can be shared
among processes. Python has support for this in its standard library,
so the current solution creates a multiprocessing.Array and then
passes it as the "buffer" to the ndarray.__new__. That would be it on
Unixes, but on Windows, there has to be a a custom pickle method,
otherwise the array "forgets" that its buffer is that special and the
sharing doesn't work.

Some of what has been said in the pull request & my answer to that:

* ... I do see some value in providing a canonical right way to
construct shared memory arrays in NumPy, but I'm not very happy with
this solution, ... terrible code organization (with the global
variables):
* I understand that, however this is a pattern of Python
multiprocessing and everybody who wants to use the Pool and shared
data either is familiar with this approach or has to become familiar
with[2, 3]. The good compromise is to have a separate module for each
parallel calculation, so global variables are not a problem.

* Can you explain why the ndarray subclass is needed? Subclasses can
be rather annoying to get right, and also for other reasons.
* The shmarray class needs the custom pickler (but only on Windows).

* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
then yes, that would be great. But otherwise I'm not sure there's
anything to be gained by putting it in a library rather than referring
users to the examples on StackOverflow [1] [2].
* What about telling users: "You can use numpy with multiprocessing.
Remeber the multiprocessing.Value and multiprocessing.Aray classes?
numpy.shm works exactly the same way, which means that it shares their
limitations. Refer to an example: <link to numpy doc>." Notice that
although those SO links contain all of the information, it is very
difficult to get it up and running for a newcomer like me few years
ago.

* This needs tests and justification for custom pickling methods,
which are not used in any of the current examples. ...
* I am sorry, but don't fully understand that point. The custom
pickling method of shmarray has to be there on Windows, but users
don't have to know about it at all. As noted earlier, the global
variable is the only way of using standard Python multiprocessing.Pool
with shared objects.

[1]: http://stackoverflow.com/questions/10721915/shared-memory-objects-in-python-multiprocessing
[2]: http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing
[3]: http://stackoverflow.com/questions/1675766/how-to-combine-pool-map-with-array-shared-memory-in-python-multiprocessing
Stephan Hoyer
2016-04-11 16:37:41 UTC
Permalink
Post by Matěj Týč
* ... I do see some value in providing a canonical right way to
construct shared memory arrays in NumPy, but I'm not very happy with
this solution, ... terrible code organization (with the global
* I understand that, however this is a pattern of Python
multiprocessing and everybody who wants to use the Pool and shared
data either is familiar with this approach or has to become familiar
with[2, 3]. The good compromise is to have a separate module for each
parallel calculation, so global variables are not a problem.
OK, we can agree to disagree on this one. I still don't think I could get
code using this pattern checked in at my work (for good reason).
Post by Matěj Týč
* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
Post by Matěj Týč
then yes, that would be great. But otherwise I'm not sure there's
anything to be gained by putting it in a library rather than referring
users to the examples on StackOverflow [1] [2].
* What about telling users: "You can use numpy with multiprocessing.
Remeber the multiprocessing.Value and multiprocessing.Aray classes?
numpy.shm works exactly the same way, which means that it shares their
limitations. Refer to an example: <link to numpy doc>." Notice that
although those SO links contain all of the information, it is very
difficult to get it up and running for a newcomer like me few years
ago.
I guess I'm still not convinced this is the best we can with the
multiprocessing library. If we're going to do this, then we definitely need
to have the fully canonical example.

For example, could you make the shared array a global variable and then
still pass references to functions called by the processes anyways? The
examples on stackoverflow that we're both looking are varied enough that
it's not obvious to me that this is as good as it gets.

* This needs tests and justification for custom pickling methods,
Post by Matěj Týč
which are not used in any of the current examples. ...
* I am sorry, but don't fully understand that point. The custom
pickling method of shmarray has to be there on Windows, but users
don't have to know about it at all. As noted earlier, the global
variable is the only way of using standard Python multiprocessing.Pool
with shared objects.
That sounds like a fine justification, but given that it wasn't obvious you
needs a comment saying as much in the source code :). Also, it breaks
pickle, which is another limitation that needs to be documented.
Sturla Molden
2016-05-11 08:29:02 UTC
Permalink
I did some work on this some years ago. I have more or less concluded that
it was a waste of effort. But first let me explain what the suggested
approach do not work. As it uses memory mapping to create shared memory
(i.e. shared segments are not named), they must be created ahead of
spawning processes. But if you really want this to work smoothly, you want
named shared memory (Sys V IPC or posix shm_open), so that shared arrays
can be created in the spawned processes and passed back.

Now for the reason I don't care about shared memory arrays anymore, and
what I am currently working on instead:

1. I have come across very few cases where threaded code cannot be used in
numerical computing. In fact, multithreading nearly always happens in the
code where I write pure C or Fortran anyway. Most often it happens in
library code that are already multithreaded (Intel MKL, Apple Accelerate
Framework, OpenBLAS, etc.), which means using it requires no extra effort
from my side. A multithreaded LAPACK library is not less multithreaded if I
call it from Python.

2. Getting shared memory right can be difficult because of hierarchical
memory and false sharing. You might not see it if you only have a multicore
CPU with a shared cache. But your code might not scale up on computers with
more than one physical processor. False sharing acts like the GIL, except
it happens in hardware and affects your C code invisibly without any
explicit locking you can pinpoint. This is also why MPI code tends to scale
much better than OpenMP code. If nothing is shared there will be no false
sharing.

3. Raw C level IPC is cheap – very, very cheap. Even if you use pipes or
sockets instead of shared memory it is cheap. There are very few cases
where the IPC tends to be a bottleneck.

4. The reason IPC appears expensive with NumPy is because multiprocessing
pickles the arrays. It is pickle that is slow, not the IPC. Some would say
that the pickle overhead is an integral part of the IPC ovearhead, but i
will argue that it is not. The slowness of pickle is a separate problem
alltogether.

5. Share memory does not improve on the pickle overhead because also NumPy
arrays with shared memory must be pickled. Multiprocessing can bypass
pickling the RawArray object, but the rest of the NumPy array is pickled.
Using shared memory arrays have no speed advantage over normal NumPy arrays
when we use multiprocessing.

6. It is much easier to write concurrent code that uses queues for message
passing than anything else. That is why using a Queue object has been the
popular Pythonic approach to both multitreading and multiprocessing. I
would like this to continue.

I am therefore focusing my effort on the multiprocessing.Queue object. If
you understand the six points I listed you will see where this is going:
What we really need is a specialized queue that has knowledge about NumPy
arrays and can bypass pickle. I am therefore focusing my efforts on
creating a NumPy aware queue object.

We are not doing the users a favor by encouraging the use of shared memory
arrays. They help with nothing.


Sturla Molden
Post by Matěj Týč
Dear Numpy developers,
I propose a pull request https://github.com/numpy/numpy/pull/7533 that
features numpy arrays that can be shared among processes (with some
effort).
In CPython, multiprocessing is the only way of how to exploit
multi-core CPUs if your parallel code can't avoid creating Python
objects. In that case, CPython's GIL makes threads unusable. However,
unlike with threading, sharing data among processes is something that
is non-trivial and platform-dependent.
Although numpy (and certainly some other packages) implement some
You have a large amount of data in a form of a numpy array and you
want to pass it to a function of an arbitrary Python module that also
expects numpy array (e.g. list of vertices coordinates as an input and
array of the corresponding polygon as an output). Here, it is clear
GIL is an issue you and since you want a numpy array on both ends, now
you would have to copy your numpy array to a multiprocessing.Array (to
pass the data) and then to convert it back to ndarray in the worker
process.
This contribution would streamline it a bit - you would create an
array as you are used to, pass it to the subprocess as you would do
with the multiprocessing.Array, and the process can work with a numpy
array right away.
The idea is to create a numpy array in a buffer that can be shared
among processes. Python has support for this in its standard library,
so the current solution creates a multiprocessing.Array and then
passes it as the "buffer" to the ndarray.__new__. That would be it on
Unixes, but on Windows, there has to be a a custom pickle method,
otherwise the array "forgets" that its buffer is that special and the
sharing doesn't work.
* ... I do see some value in providing a canonical right way to
construct shared memory arrays in NumPy, but I'm not very happy with
this solution, ... terrible code organization (with the global
* I understand that, however this is a pattern of Python
multiprocessing and everybody who wants to use the Pool and shared
data either is familiar with this approach or has to become familiar
with[2, 3]. The good compromise is to have a separate module for each
parallel calculation, so global variables are not a problem.
* Can you explain why the ndarray subclass is needed? Subclasses can
be rather annoying to get right, and also for other reasons.
* The shmarray class needs the custom pickler (but only on Windows).
* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
then yes, that would be great. But otherwise I'm not sure there's
anything to be gained by putting it in a library rather than referring
users to the examples on StackOverflow [1] [2].
* What about telling users: "You can use numpy with multiprocessing.
Remeber the multiprocessing.Value and multiprocessing.Aray classes?
numpy.shm works exactly the same way, which means that it shares their
limitations. Refer to an example: <link to numpy doc>." Notice that
although those SO links contain all of the information, it is very
difficult to get it up and running for a newcomer like me few years
ago.
* This needs tests and justification for custom pickling methods,
which are not used in any of the current examples. ...
* I am sorry, but don't fully understand that point. The custom
pickling method of shmarray has to be there on Windows, but users
don't have to know about it at all. As noted earlier, the global
variable is the only way of using standard Python multiprocessing.Pool
with shared objects.
http://stackoverflow.com/questions/10721915/shared-memory-objects-in-python-multiprocessing
http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing
http://stackoverflow.com/questions/1675766/how-to-combine-pool-map-with-array-shared-memory-in-python-multiprocessing
Allan Haldane
2016-05-11 18:01:02 UTC
Permalink
Post by Sturla Molden
4. The reason IPC appears expensive with NumPy is because multiprocessing
pickles the arrays. It is pickle that is slow, not the IPC. Some would say
that the pickle overhead is an integral part of the IPC ovearhead, but i
will argue that it is not. The slowness of pickle is a separate problem
alltogether.
That's interesting. I've also used multiprocessing with numpy and didn't
realize that. Is this true in python3 too?

In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.

a = np.arange(40*40)

%timeit pickle.dumps(a)
1000 loops, best of 3: 1.63 ms per loop

%timeit cPickle.dumps(a)
1000 loops, best of 3: 1.56 ms per loop

%timeit cPickle.dumps(a, protocol=2)
100000 loops, best of 3: 18.9 µs per loop

Python 3 uses protocol 3 by default:

%timeit pickle.dumps(a)
10000 loops, best of 3: 20 µs per loop
Post by Sturla Molden
5. Share memory does not improve on the pickle overhead because also NumPy
arrays with shared memory must be pickled. Multiprocessing can bypass
pickling the RawArray object, but the rest of the NumPy array is pickled.
Using shared memory arrays have no speed advantage over normal NumPy arrays
when we use multiprocessing.
6. It is much easier to write concurrent code that uses queues for message
passing than anything else. That is why using a Queue object has been the
popular Pythonic approach to both multitreading and multiprocessing. I
would like this to continue.
I am therefore focusing my effort on the multiprocessing.Queue object. If
What we really need is a specialized queue that has knowledge about NumPy
arrays and can bypass pickle. I am therefore focusing my efforts on
creating a NumPy aware queue object.
We are not doing the users a favor by encouraging the use of shared memory
arrays. They help with nothing.
Sturla Molden
Benjamin Root
2016-05-11 18:22:54 UTC
Permalink
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would
imagine that it is because the array is huge, right? So, the pickling
approach would copy that array for each process, which defeats the purpose,
right?

Ben Root
Post by Allan Haldane
Post by Sturla Molden
4. The reason IPC appears expensive with NumPy is because multiprocessing
pickles the arrays. It is pickle that is slow, not the IPC. Some would
say
Post by Sturla Molden
that the pickle overhead is an integral part of the IPC ovearhead, but i
will argue that it is not. The slowness of pickle is a separate problem
alltogether.
That's interesting. I've also used multiprocessing with numpy and didn't
realize that. Is this true in python3 too?
In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.
a = np.arange(40*40)
%timeit pickle.dumps(a)
1000 loops, best of 3: 1.63 ms per loop
%timeit cPickle.dumps(a)
1000 loops, best of 3: 1.56 ms per loop
%timeit cPickle.dumps(a, protocol=2)
100000 loops, best of 3: 18.9 µs per loop
%timeit pickle.dumps(a)
10000 loops, best of 3: 20 µs per loop
Post by Sturla Molden
5. Share memory does not improve on the pickle overhead because also
NumPy
Post by Sturla Molden
arrays with shared memory must be pickled. Multiprocessing can bypass
pickling the RawArray object, but the rest of the NumPy array is pickled.
Using shared memory arrays have no speed advantage over normal NumPy
arrays
Post by Sturla Molden
when we use multiprocessing.
6. It is much easier to write concurrent code that uses queues for
message
Post by Sturla Molden
passing than anything else. That is why using a Queue object has been the
popular Pythonic approach to both multitreading and multiprocessing. I
would like this to continue.
I am therefore focusing my effort on the multiprocessing.Queue object. If
What we really need is a specialized queue that has knowledge about NumPy
arrays and can bypass pickle. I am therefore focusing my efforts on
creating a NumPy aware queue object.
We are not doing the users a favor by encouraging the use of shared
memory
Post by Sturla Molden
arrays. They help with nothing.
Sturla Molden
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Feng Yu
2016-05-11 22:38:15 UTC
Permalink
Hi,

I've been thinking and exploring this for some time. If we are to
start some effort I'd like to help. Here are my comments, mostly
regarding to Sturla's comments.

1. If we are talking about shared memory and copy-on-write
inheritance, then we are using 'fork'. If we are free to use fork,
then a large chunk of the concerns regarding the python std library
multiprocessing is no longer relevant. Especially those functions must
be in a module limitation that tends to impose a special requirement
on the software design.

2. Picking of inherited shared memory array can be done minimally by
just picking the array_interface and the pointer address. It is
because the child process and the parent share the same address space
layout, guarenteed by the fork call.

3. The RawArray and RawValue implementation in std multiprocessing has
its own memory allocator for managing small variables. It is a huge
overkill (in terms of implementation) if we only care about very large
memory chunks.

4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice
is to defer the responsibility of avoiding racing to the developer.
Simple structs for working on slices of array in parallel can cover a
huge fraction of use cases and fully avoid this issue.

5. Whether to delegate parallelism to underlying low level
implementation or to implement the paralellism in python while
maintaining the underlying low level implementation sequential is
probably dependent on the problem. It may be convenient as of the
current state of parallelism support in Python to delegate, but will
it forever be the case?

For example, after the MPI FFTW binding stuck for a long time, someone
wrote a parallel python FFT package
(https://github.com/spectralDNS/mpiFFT4py) that uses FFTW for
sequential and write all parallel semantics in Python with mpi4py, and
it uses a more efficient domain decomposition.

6. If we are to define a set of operations I would recommend take a
look at OpenMP as a reference -- It has been out there for decades and
used widely. An equiavlant to the 'omp parallel for' construct in
Python will be a very good starting point and immediately useful.

- Yu
Post by Benjamin Root
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would
imagine that it is because the array is huge, right? So, the pickling
approach would copy that array for each process, which defeats the purpose,
right?
Ben Root
Post by Allan Haldane
Post by Sturla Molden
4. The reason IPC appears expensive with NumPy is because
multiprocessing
pickles the arrays. It is pickle that is slow, not the IPC. Some would say
that the pickle overhead is an integral part of the IPC ovearhead, but i
will argue that it is not. The slowness of pickle is a separate problem
alltogether.
That's interesting. I've also used multiprocessing with numpy and didn't
realize that. Is this true in python3 too?
In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.
a = np.arange(40*40)
%timeit pickle.dumps(a)
1000 loops, best of 3: 1.63 ms per loop
%timeit cPickle.dumps(a)
1000 loops, best of 3: 1.56 ms per loop
%timeit cPickle.dumps(a, protocol=2)
100000 loops, best of 3: 18.9 µs per loop
%timeit pickle.dumps(a)
10000 loops, best of 3: 20 µs per loop
Post by Sturla Molden
5. Share memory does not improve on the pickle overhead because also NumPy
arrays with shared memory must be pickled. Multiprocessing can bypass
pickling the RawArray object, but the rest of the NumPy array is pickled.
Using shared memory arrays have no speed advantage over normal NumPy arrays
when we use multiprocessing.
6. It is much easier to write concurrent code that uses queues for message
passing than anything else. That is why using a Queue object has been the
popular Pythonic approach to both multitreading and multiprocessing. I
would like this to continue.
I am therefore focusing my effort on the multiprocessing.Queue object. If
What we really need is a specialized queue that has knowledge about NumPy
arrays and can bypass pickle. I am therefore focusing my efforts on
creating a NumPy aware queue object.
We are not doing the users a favor by encouraging the use of shared memory
arrays. They help with nothing.
Sturla Molden
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sturla Molden
2016-05-11 23:02:06 UTC
Permalink
Post by Feng Yu
1. If we are talking about shared memory and copy-on-write
inheritance, then we are using 'fork'.
Not available on Windows. On Unix it only allows one-way communication,
from parent to child.
Post by Feng Yu
2. Picking of inherited shared memory array can be done minimally by
just picking the array_interface and the pointer address. It is
because the child process and the parent share the same address space
layout, guarenteed by the fork call.
Again, not everyone uses Unix.

And on Unix it is not trival to pass data back from the child process. I
solved that problem with Sys V IPC (pickling the name of the segment).
Post by Feng Yu
6. If we are to define a set of operations I would recommend take a
look at OpenMP as a reference -- It has been out there for decades and
used widely. An equiavlant to the 'omp parallel for' construct in
Python will be a very good starting point and immediately useful.
If you are on Unix, you can just use a context manager. Call os.fork in
__enter__ and os.waitpid in __exit__.

Sturla
Niki Spahiev
2016-05-12 07:06:27 UTC
Permalink
Post by Sturla Molden
Post by Feng Yu
1. If we are talking about shared memory and copy-on-write
inheritance, then we are using 'fork'.
Not available on Windows. On Unix it only allows one-way communication,
from parent to child.
Apparently next Win10 will have fork as part of bash integration.

Niki
Sturla Molden
2016-05-12 23:14:35 UTC
Permalink
Post by Niki Spahiev
Apparently next Win10 will have fork as part of bash integration.
That would be great. The lack of fork on Windows is very annoying.

Sturla
Sturla Molden
2016-05-13 17:29:13 UTC
Permalink
Post by Niki Spahiev
Apparently next Win10 will have fork as part of bash integration.
It is Interix/SUA rebranded "Subsystem for Linux". It remains to be seen
how long it will stay this time. Also a Python built for this subsystem
will not run on the Win32 subsystem, so there is no graphics. Also it will
not be installed by default, just like SUA.
Feng Yu
2016-05-12 20:46:14 UTC
Permalink
Post by Sturla Molden
Again, not everyone uses Unix.
And on Unix it is not trival to pass data back from the child process. I
solved that problem with Sys V IPC (pickling the name of the segment).
I wonder if it is neccessary insist being able to pass large amount of data
back from child to the parent process.

In most (half?) situations the result can be directly write back via
preallocated shared array before works are spawned. Then there is no
need to pass data back with named segments.

Here I am just doodling some possible use cases along the OpenMP line.
The sample would just copy the data from s to r, in two different
ways. On systems that does not support multiprocess + fork, the
semantics is still well preserved if threading is used.

```
import ...... as mp

# the access attribute of inherited variables is at least 'privatecopy'
# but with threading backend it becomes 'shared'
s = numpy.arange(10000)

with mp.parallel(num_threads=8) as section:
r = section.empty(10000) # variables defined via section.empty
will always be 'shared'
def work():
# variables defined in the body is 'private'
tid = section.get_thread_num()
size = section.get_num_threads()
sl = slice(tid * r.size // size, (tid + 1) * r.size // size)
r[sl] = s[sl]

status = section.run(work)
assert not any(status.errors)

# the support to the following could be implemented with section.run

chunksize = 1000
def work(i):
sl = slice(i, i + chunksize)
r[sl] = s[sl]
return s[sl].sum()
status = section.loop(work, range(0, r.size, chunksize), schedule='static')
assert not any(status.errors)
total = sum(status.results)
```
Post by Sturla Molden
Post by Feng Yu
6. If we are to define a set of operations I would recommend take a
look at OpenMP as a reference -- It has been out there for decades and
used widely. An equiavlant to the 'omp parallel for' construct in
Python will be a very good starting point and immediately useful.
If you are on Unix, you can just use a context manager. Call os.fork in
__enter__ and os.waitpid in __exit__.
Sturla
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sturla Molden
2016-05-12 23:32:35 UTC
Permalink
Post by Feng Yu
In most (half?) situations the result can be directly write back via
preallocated shared array before works are spawned. Then there is no
need to pass data back with named segments.
You can work around it in various ways, this being one of them.

Personally I prefer a parallel programming style with queues – either to
scatter arrays to workers and collecting arrays from workers, or to chain
workers together in a pipeline (without using coroutines). But exactly how
you program is a matter of taste. I want to make it as inexpensive as
possible to pass a NumPy array through a queue. If anyone else wants to
help improve parallel programming with NumPy using a different paradigm,
that is fine too. I just wanted to clarify why I stopped working on shared
memory arrays.

(As for the implementation, I am also experimenting with platform dependent
asynchronous I/O (IOCP, GCD or kqueue, epoll) to pass NumPy arrays though a
queue as inexpensively and scalably as possible. And no, there is no public
repo, as I like to experiment with my pet project undisturbed before I let
it out in the wild.)


Sturla
Feng Yu
2016-05-13 19:44:34 UTC
Permalink
Post by Sturla Molden
Personally I prefer a parallel programming style with queues – either to
scatter arrays to workers and collecting arrays from workers, or to chain
workers together in a pipeline (without using coroutines). But exactly how
you program is a matter of taste. I want to make it as inexpensive as
possible to pass a NumPy array through a queue. If anyone else wants to
help improve parallel programming with NumPy using a different paradigm,
that is fine too. I just wanted to clarify why I stopped working on shared
memory arrays.
Even I am not very obsessed with functional and queues, I still have
to agree with you
queues tend to produce more readable and less verbose code -- if there
is the right tool.
Post by Sturla Molden
(As for the implementation, I am also experimenting with platform dependent
asynchronous I/O (IOCP, GCD or kqueue, epoll) to pass NumPy arrays though a
queue as inexpensively and scalably as possible. And no, there is no public
repo, as I like to experiment with my pet project undisturbed before I let
it out in the wild.)
It will be wonderful if there is a way to pass numpy array around
without a huge dependency list.

After all, we know the address of the array and, in principle we are
able to find the physical pages and map them in the receiver side.

Also, did you checkout http://zeromq.org/blog:zero-copy ?
ZeroMQ is a dependency of Jupyter, so it is quite available.

- Yu
Post by Sturla Molden
Sturla
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Sturla Molden
2016-05-13 23:15:59 UTC
Permalink
Post by Feng Yu
Also, did you checkout http://zeromq.org/blog:zero-copy ?
ZeroMQ is a dependency of Jupyter, so it is quite available.
ZeroMQ is great, but it lacks some crucial features. In particular it does
not support IPC on Windows. Ideally one should e.g. use Unix doman sockets
on Linux and named pipes on Windows. Most MPI implementations seems to
prefer shared memory over these mechanisms, though. Also I am not sure
about ZeroMQ and asynch i/o. I would e.g. like to use IOCP on Windows, GCD
on Mac, and a threadpool plus epoll on Linux.

Sturla
Sturla Molden
2016-05-11 22:48:23 UTC
Permalink
Post by Benjamin Root
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would
imagine that it is because the array is huge, right?
That is a case for shared memory, but what. i was taking about is more
common than this. In order for processes to cooperate, they must
communicate. So we need a way to pass around NumPy arrays quickly.
Sometimes we want to use shared memory because of the size of the data, but
more often it is just used as a form of inexpensive IPC.
Post by Benjamin Root
So, the pickling
approach would copy that array for each process, which defeats the purpose,
right?
I am not sure what you mean. When I made shared memory arrays I used named
segments, and made sure only the name of the segments were pickled, not the
contents of the buffers.

Sturla
Joe Kington
2016-05-11 22:39:49 UTC
Permalink
In python2 it appears that multiprocessing uses pickle protocol 0 which
Post by Allan Haldane
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.
Even on Python 2.x, multiprocessing uses protocol 2, not protocol 0. The
default for the `pickle` module changed, but multiprocessing has always
used a binary pickle protocol to communicate between processes. Have a
look at multiprocessing's forking.py in Python 2.7.

As some context here for folks that may not be aware, Sturla is referring
to his earlier shared memory implementation
<https://github.com/sturlamolden/sharedmem-numpy> he wrote that avoids
actually pickling the data, and instead essentially pickles a pointer to an
array in shared memory. As Sturla very nicely summed up, it saves memory
usage, but doesn't help the deeper issues. You're far better off just
communicating between processes as opposed to using shared memory.
Sturla Molden
2016-05-11 23:02:05 UTC
Permalink
Post by Joe Kington
You're far better off just
communicating between processes as opposed to using shared memory.
Yes.
Allan Haldane
2016-05-11 23:26:03 UTC
Permalink
Post by Allan Haldane
In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.
Even on Python 2.x, multiprocessing uses protocol 2, not protocol 0.
The default for the `pickle` module changed, but multiprocessing has
always used a binary pickle protocol to communicate between processes.
Have a look at multiprocessing's forking.py <http://forking.py> in
Python 2.7.
Are you sure? As far as I understood the code, it uses the default
protocol 0. The file forking.py no longer exists, also.

https://github.com/python/cpython/tree/master/Lib/multiprocessing
(see reduction.py and queue.py)
http://bugs.python.org/issue23403
Sturla Molden
2016-05-11 22:48:23 UTC
Permalink
Post by Allan Haldane
That's interesting. I've also used multiprocessing with numpy and didn't
realize that. Is this true in python3 too?
I am not sure. As you have noticed, pickle is faster by to orders of
magnitude on Python 3. But several microseconds is also a lot, particularly
if we are going to do this often during a computation.

Sturla
Matěj Týč
2016-05-17 09:04:07 UTC
Permalink
I did some work on this some years ago. ...
I am sorry, I have missed this discussion when it started.

There are two cases when I had feeling that I had to use this functionality:

- Parallel processing of HUGE data, and

- using parallel processing in an application that had plug-ins which
operated on one shared array (that was updated every one and then - it
was a producer-consumer pattern thing). As everything got set up, it
worked like a charm.

The thing I especially like about the proposed module is the lack of
external dependencies + it works if one knows how to use it.

The bad thing about it is its fragility - I admit that using it as it is
is not particularly intuitive. Unlike Sturla, I think that this is not a
dead end, but it indeed feels clumsy. However, I dislike the necessity
of writing Cython or C to get true multithreading for reasons I have
mentioned - what if you want to run high-level Python functions in parallel?

So, what I would really like to see is some kind of numpy documentation
on how to approach parallel computing with numpy arrays (depending on
what kind of task one wants to achieve). Maybe just using the queue is
good enough, or there are those 3-rd party modules with known
limitations? Plenty of people start off with numpy, so some kind of
overview should be part of numpy docs.
Sturla Molden
2016-05-17 12:13:42 UTC
Permalink
Post by Matěj Týč
- Parallel processing of HUGE data, and
This is mainly a Windows problem, as copy-on-write fork() will solve this
on any other platform. I am more in favor of asking Microsoft to fix their
broken OS.

Also observe that the usefulness of shared memory is very limited on
Windows, as we in practice never get the same base address in a spawned
process. This prevents sharing data structures with pointers and Python
objects. Anything more complex than an array cannot be shared.

What this means is that shared memory is seldom useful for sharing huge
data, even on Windows. It is only useful for this on Unix/Linux, where base
addresses can stay they same. But on non-Windows platforms, the COW will in
99.99% of the cases be sufficient, thus make shared memory superfluous
anyway. We don't need shared memory to scatter large data on Linux, only
fork.

As I see it. shared memory is mostly useful as a means to construct an
inter-process communication (IPC) protocol.

Sturla
Matěj Týč
2016-05-17 20:49:20 UTC
Permalink
Post by Sturla Molden
Post by Matěj Týč
- Parallel processing of HUGE data, and
This is mainly a Windows problem, as copy-on-write fork() will solve this
on any other platform. ...
That sounds interesting, could you elaborate on it a bit? Does it mean
that if you pass the numpy array to the child process using Queue, no
significant amount of data will flow through it? Or I shouldn't pass it
using Queue at all and just rely on inheritance? Finally, I assume that
passing it as an argument to the Process class is the worst option,
because it will be pickled and unpickled.

Or maybe you refer to modules s.a. joblib that use this functionality
and expose only a nice interface?
And finally, cow means that returning large arrays still involves data
moving between processes, whereas the shm approach has the workaround
that you can preallocate the result array by the parent process, where
the worker process can write to.
Post by Sturla Molden
What this means is that shared memory is seldom useful for sharing huge
data, even on Windows. It is only useful for this on Unix/Linux, where base
addresses can stay they same. But on non-Windows platforms, the COW will in
99.99% of the cases be sufficient, thus make shared memory superfluous
anyway. We don't need shared memory to scatter large data on Linux, only
fork.
I am actually quite comfortable with sharing numpy arrays only. It is a
nice format for sharing large amounts of numbers, which is what I want
and what many modules accept as input (e.g. the "shapely" module).
Sturla Molden
2016-05-17 21:03:04 UTC
Permalink
Post by Matěj Týč
Does it mean
that if you pass the numpy array to the child process using Queue, no
significant amount of data will flow through it?
This is what my shared memory arrayes do.
Post by Matěj Týč
Or I shouldn't pass it
using Queue at all and just rely on inheritance?
This is what David Baddeley's shared memory arrays do.
Post by Matěj Týč
Finally, I assume that
passing it as an argument to the Process class is the worst option,
because it will be pickled and unpickled.
My shared memory arrays only pickles the metadata, and can be used in this
way.
Post by Matěj Týč
Or maybe you refer to modules s.a. joblib that use this functionality
and expose only a nice interface?
Joblib creates "share memory" by memory mapping a temporary file, which is
back by RAM on Libux (tempfs). It is backed by a physical file on disk on
Mac and Windows. In this resepect, joblib is much better on Linux than Mac
or Windows.
Post by Matěj Týč
And finally, cow means that returning large arrays still involves data
moving between processes, whereas the shm approach has the workaround
that you can preallocate the result array by the parent process, where
the worker process can write to.
My shared memory arrays need no workaround dor this. They also allow shared
memory arrays to be returned to the parent process. No preallocation is
needed.


Sturla
Elliot Hallmark
2016-05-11 14:41:46 UTC
Permalink
Strula, this sounds brilliant! To be clear, you're talking about
serializing the numpy array and reconstructing it in a way that's faster
than pickle? Or using shared memory and signaling array creation around
that shared memory rather than using pickle?

For what it's worth, I have used shared memory with numpy arrays as IPC (no
queue), with one process writing to it and one process reading from it, and
liked it. Your point #5 did not apply because I was reusing the shared
memory.

Do you have a public repo where you are working on this?

Thanks!
Elliot
Post by Sturla Molden
I did some work on this some years ago. I have more or less concluded that
it was a waste of effort. But first let me explain what the suggested
approach do not work. As it uses memory mapping to create shared memory
(i.e. shared segments are not named), they must be created ahead of
spawning processes. But if you really want this to work smoothly, you want
named shared memory (Sys V IPC or posix shm_open), so that shared arrays
can be created in the spawned processes and passed back.
Now for the reason I don't care about shared memory arrays anymore, and
1. I have come across very few cases where threaded code cannot be used in
numerical computing. In fact, multithreading nearly always happens in the
code where I write pure C or Fortran anyway. Most often it happens in
library code that are already multithreaded (Intel MKL, Apple Accelerate
Framework, OpenBLAS, etc.), which means using it requires no extra effort
from my side. A multithreaded LAPACK library is not less multithreaded if I
call it from Python.
2. Getting shared memory right can be difficult because of hierarchical
memory and false sharing. You might not see it if you only have a multicore
CPU with a shared cache. But your code might not scale up on computers with
more than one physical processor. False sharing acts like the GIL, except
it happens in hardware and affects your C code invisibly without any
explicit locking you can pinpoint. This is also why MPI code tends to scale
much better than OpenMP code. If nothing is shared there will be no false
sharing.
3. Raw C level IPC is cheap – very, very cheap. Even if you use pipes or
sockets instead of shared memory it is cheap. There are very few cases
where the IPC tends to be a bottleneck.
4. The reason IPC appears expensive with NumPy is because multiprocessing
pickles the arrays. It is pickle that is slow, not the IPC. Some would say
that the pickle overhead is an integral part of the IPC ovearhead, but i
will argue that it is not. The slowness of pickle is a separate problem
alltogether.
5. Share memory does not improve on the pickle overhead because also NumPy
arrays with shared memory must be pickled. Multiprocessing can bypass
pickling the RawArray object, but the rest of the NumPy array is pickled.
Using shared memory arrays have no speed advantage over normal NumPy arrays
when we use multiprocessing.
6. It is much easier to write concurrent code that uses queues for message
passing than anything else. That is why using a Queue object has been the
popular Pythonic approach to both multitreading and multiprocessing. I
would like this to continue.
I am therefore focusing my effort on the multiprocessing.Queue object. If
What we really need is a specialized queue that has knowledge about NumPy
arrays and can bypass pickle. I am therefore focusing my efforts on
creating a NumPy aware queue object.
We are not doing the users a favor by encouraging the use of shared memory
arrays. They help with nothing.
Sturla Molden
Sturla Molden
2016-05-11 22:48:22 UTC
Permalink
Post by Elliot Hallmark
Strula, this sounds brilliant! To be clear, you're talking about
serializing the numpy array and reconstructing it in a way that's faster
than pickle?
Yes. We know the binary format of NumPy arrays. We don't need to invoke the
machinery of pickle to serialize an array and write the bytes to some IPC
mechanism (pipe, tcp socket, unix socket, shared memory). The choise of IPC
mechanism might not even be relevant, and could even be deferred to a
library like ZeroMQ. The point is that if multiple peocesses are to
cooperate efficiently, we need a way to let them communicate NumPy arrays
quickly. That is where using multiprocessing hurts today, and shared memory
does not help here.

Sturla
Allan Haldane
2016-05-11 23:30:14 UTC
Permalink
Post by Sturla Molden
Post by Elliot Hallmark
Strula, this sounds brilliant! To be clear, you're talking about
serializing the numpy array and reconstructing it in a way that's faster
than pickle?
Yes. We know the binary format of NumPy arrays. We don't need to invoke the
machinery of pickle to serialize an array and write the bytes to some IPC
mechanism (pipe, tcp socket, unix socket, shared memory). The choise of IPC
mechanism might not even be relevant, and could even be deferred to a
library like ZeroMQ. The point is that if multiple peocesses are to
cooperate efficiently, we need a way to let them communicate NumPy arrays
quickly. That is where using multiprocessing hurts today, and shared memory
does not help here.
Sturla
You probably already know this, but I just wanted to note that the
mpi4py module has worked around pickle too. They discuss how they
efficiently transfer numpy arrays in mpi messages here:
http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-python-objects-and-array-data

Of course not everyone is able to install mpi easily.
Sturla Molden
2016-05-12 06:27:43 UTC
Permalink
Post by Allan Haldane
You probably already know this, but I just wanted to note that the
mpi4py module has worked around pickle too. They discuss how they
http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-python-objects-and-array-data
Unless I am mistaken, they use the PEP 3118 buffer interface to support
NumPy as well as a number of other Python objects. However, this protocol
makes buffer aquisition an expensive operation. You can see this in Cython
if you use typed memory views. Assigning a NumPy array to a typed
memoryview (i,e, buffer acqisition) is slow. They are correct that avoiding
pickle means we save some memory. It also avoids creating and destroying
temporary Python objects, and associated reference counting. However,
because of the expensive buffer acquisition, I am not sure how much faster
their apporach will be. I prefer to use the NumPy C API, and bypass any
unneccesary overhead. The idea is to make IPC of NumPy arrays fast, and
then we cannot have an expensive buffer acquisition in there.

Sturla
Antoine Pitrou
2016-05-12 15:38:10 UTC
Permalink
On Thu, 12 May 2016 06:27:43 +0000 (UTC)
Post by Sturla Molden
Post by Allan Haldane
You probably already know this, but I just wanted to note that the
mpi4py module has worked around pickle too. They discuss how they
http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-python-objects-and-array-data
Unless I am mistaken, they use the PEP 3118 buffer interface to support
NumPy as well as a number of other Python objects. However, this protocol
makes buffer aquisition an expensive operation.
Can you define "expensive"?
Post by Sturla Molden
You can see this in Cython
if you use typed memory views. Assigning a NumPy array to a typed
memoryview (i,e, buffer acqisition) is slow.
You're assuming this is the cost of "buffer acquisition", while most
likely it's the cost of creating the memoryview object itself.

Buffer acquisition itself only calls a single C callback and uses a
stack-allocated C structure. It shouldn't be "expensive".

Regards

Antoine.
Sturla Molden
2016-05-12 23:14:36 UTC
Permalink
Post by Antoine Pitrou
Can you define "expensive"?
Slow enough to cause complaints on the Cython mailing list.
Post by Antoine Pitrou
You're assuming this is the cost of "buffer acquisition", while most
likely it's the cost of creating the memoryview object itself.
Constructing a typed memoryview from a typed memoryview or a slice is fast.
Numerical code doing this intensively is still within 80-90% of the speed
of plain c code using pointer arithmetics.
Post by Antoine Pitrou
Buffer acquisition itself only calls a single C callback and uses a
stack-allocated C structure. It shouldn't be "expensive".
I don't know the reason, only that buffer acquisition from NumPy arrays
with typed memoryviews is very expensive compared to assigning a typed
memoryview to another or slicing a typed memoryview.

Sturla
Antoine Pitrou
2016-05-24 11:22:51 UTC
Permalink
On Thu, 12 May 2016 23:14:36 +0000 (UTC)
Post by Sturla Molden
Post by Antoine Pitrou
Can you define "expensive"?
Slow enough to cause complaints on the Cython mailing list.
What kind of complaints? Please be specific.
Post by Sturla Molden
Post by Antoine Pitrou
Buffer acquisition itself only calls a single C callback and uses a
stack-allocated C structure. It shouldn't be "expensive".
I don't know the reason, only that buffer acquisition from NumPy arrays
with typed memoryviews
Again, what have memoryviews to do with it? "Acquiring a buffer" is
just asking the buffer provider (the Numpy array) to fill a Py_buffer
structure's contents. That has nothing to do with memoryviews.

When writing C code to interact with buffer-providing objects, you
usually don't bother with memoryviews at all. You just use a Py_buffer
structure.

Regards

Antoine.
Sturla Molden
2016-05-24 21:03:44 UTC
Permalink
Post by Antoine Pitrou
When writing C code to interact with buffer-providing objects, you
usually don't bother with memoryviews at all. You just use a Py_buffer
structure.
I was taking about "typed memoryviews" which is a Cython abstraction for a
Py_buffer struct. I was not taking about Python memoryviews, which is
something else. When writing Cython code that interacts with a buffer we
usually use typed memoryviews, not Py_buffer structs directly.

Sturla

Dave
2016-05-12 23:25:55 UTC
Permalink
Post by Antoine Pitrou
On Thu, 12 May 2016 06:27:43 +0000 (UTC)
Post by Sturla Molden
Post by Allan Haldane
You probably already know this, but I just wanted to note that the
mpi4py module has worked around pickle too. They discuss how they
http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-
python-objects-and-array-data
Post by Antoine Pitrou
Post by Sturla Molden
Unless I am mistaken, they use the PEP 3118 buffer interface to support
NumPy as well as a number of other Python objects. However, this protocol
makes buffer aquisition an expensive operation.
Can you define "expensive"?
Post by Sturla Molden
You can see this in Cython
if you use typed memory views. Assigning a NumPy array to a typed
memoryview (i,e, buffer acqisition) is slow.
You're assuming this is the cost of "buffer acquisition", while most
likely it's the cost of creating the memoryview object itself.
Buffer acquisition itself only calls a single C callback and uses a
stack-allocated C structure. It shouldn't be "expensive".
Regards
Antoine.
When I looked at it, using a typed memoryview was between 7-50 times
slower than using numpy directly:

http://thread.gmane.org/gmane.comp.python.cython.devel/14626


It looks like there was some improvement since then:

https://github.com/numpy/numpy/pull/3779


...and repeating my experiment shows the deficit is down to 3-11 times
slower.


In [5]: x = randn(10000)

In [6]: %timeit echo_memview(x)
The slowest run took 14.98 times longer than the fastest. This could
mean that an intermediate result is being cached.
100000 loops, best of 3: 5.31 µs per loop

In [7]: %timeit echo_memview_nocast(x)
The slowest run took 10.80 times longer than the fastest. This could
mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.58 µs per loop

In [8]: %timeit echo_numpy(x)
The slowest run took 58.81 times longer than the fastest. This could
mean that an intermediate result is being cached.
1000000 loops, best of 3: 474 ns per loop



-Dave
Loading...