Julian Taylor
2017-03-13 11:21:24 UTC
Hi,
As numpy often allocates large arrays and one factor in its performance
is faulting memory from the kernel to the process. This has some cost
that is relatively significant. For example in this operation on large
arrays it accounts for 10-15% of the runtime:
import numpy as np
a = np.ones(10000000)
b = np.ones(10000000)
%timeit (a * b)**2 + 3
54.45% ipython umath.so [.] sse2_binary_multiply_DOUBLE
20.43% ipython umath.so [.] DOUBLE_add
16.66% ipython [kernel.kallsyms] [k] clear_page
The reason for this is that the glibc memory allocator uses memory
mapping for large allocations instead of reusing already faulted memory.
The reason for this is to return memory back to the system immediately
when it is free to keep the whole system more robust.
This makes a lot of sense in general but not so much for many numerical
applications that often are the only thing running.
But despite if have been shown in an old paper that caching memory in
numpy speeds up many applications, numpys usage is diverse so we
couldn't really diverge from the glibc behaviour.
Until Linux 4.5 added support for madvise(MADV_FREE). This flag of the
madvise syscall tells the kernel that a piece of memory can be reused by
other processes if there is memory pressure. Should another process
claim the memory and the original process want to use it again the
kernel will fault new memory into its place so it behaves exactly as if
it was just freed regularly.
But when no other process claims the memory and the original process
wants to reuse it, the memory do not need to be faulted again.
So effectively this flag allows us to cache memory inside numpy that can
be reused by the rest of the system if required.
Doing gives the expected speedup in the above example.
An issue is that the memory usage of numpy applications will seem to
increase. The memory that is actually free will still show up in the
usual places you look at memory usage. Namely the resident memory usage
of the process in top, /proc etc. The usage will only go down when the
memory is actually needed by other processes.
This probably would break some of the memory profiling tools so probably
we need a switch to disable the caching for the profiling tools to use.
Another concern is that using this functionality is actually the job of
the system memory allocator but I had a look at glibcs allocator and it
does not look like an easy job to make good use of MADV_FREE
retroactively, so I don't expect this to happen anytime soon.
Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).
So what do you think of this idea?
cheers,
Julian
As numpy often allocates large arrays and one factor in its performance
is faulting memory from the kernel to the process. This has some cost
that is relatively significant. For example in this operation on large
arrays it accounts for 10-15% of the runtime:
import numpy as np
a = np.ones(10000000)
b = np.ones(10000000)
%timeit (a * b)**2 + 3
54.45% ipython umath.so [.] sse2_binary_multiply_DOUBLE
20.43% ipython umath.so [.] DOUBLE_add
16.66% ipython [kernel.kallsyms] [k] clear_page
The reason for this is that the glibc memory allocator uses memory
mapping for large allocations instead of reusing already faulted memory.
The reason for this is to return memory back to the system immediately
when it is free to keep the whole system more robust.
This makes a lot of sense in general but not so much for many numerical
applications that often are the only thing running.
But despite if have been shown in an old paper that caching memory in
numpy speeds up many applications, numpys usage is diverse so we
couldn't really diverge from the glibc behaviour.
Until Linux 4.5 added support for madvise(MADV_FREE). This flag of the
madvise syscall tells the kernel that a piece of memory can be reused by
other processes if there is memory pressure. Should another process
claim the memory and the original process want to use it again the
kernel will fault new memory into its place so it behaves exactly as if
it was just freed regularly.
But when no other process claims the memory and the original process
wants to reuse it, the memory do not need to be faulted again.
So effectively this flag allows us to cache memory inside numpy that can
be reused by the rest of the system if required.
Doing gives the expected speedup in the above example.
An issue is that the memory usage of numpy applications will seem to
increase. The memory that is actually free will still show up in the
usual places you look at memory usage. Namely the resident memory usage
of the process in top, /proc etc. The usage will only go down when the
memory is actually needed by other processes.
This probably would break some of the memory profiling tools so probably
we need a switch to disable the caching for the profiling tools to use.
Another concern is that using this functionality is actually the job of
the system memory allocator but I had a look at glibcs allocator and it
does not look like an easy job to make good use of MADV_FREE
retroactively, so I don't expect this to happen anytime soon.
Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).
So what do you think of this idea?
cheers,
Julian