Discussion:
[Numpy-discussion] caching large allocations on gnu/linux
Julian Taylor
2017-03-13 11:21:24 UTC
Permalink
Hi,
As numpy often allocates large arrays and one factor in its performance
is faulting memory from the kernel to the process. This has some cost
that is relatively significant. For example in this operation on large
arrays it accounts for 10-15% of the runtime:

import numpy as np
a = np.ones(10000000)
b = np.ones(10000000)

%timeit (a * b)**2 + 3

54.45% ipython umath.so [.] sse2_binary_multiply_DOUBLE
20.43% ipython umath.so [.] DOUBLE_add
16.66% ipython [kernel.kallsyms] [k] clear_page

The reason for this is that the glibc memory allocator uses memory
mapping for large allocations instead of reusing already faulted memory.
The reason for this is to return memory back to the system immediately
when it is free to keep the whole system more robust.
This makes a lot of sense in general but not so much for many numerical
applications that often are the only thing running.
But despite if have been shown in an old paper that caching memory in
numpy speeds up many applications, numpys usage is diverse so we
couldn't really diverge from the glibc behaviour.

Until Linux 4.5 added support for madvise(MADV_FREE). This flag of the
madvise syscall tells the kernel that a piece of memory can be reused by
other processes if there is memory pressure. Should another process
claim the memory and the original process want to use it again the
kernel will fault new memory into its place so it behaves exactly as if
it was just freed regularly.
But when no other process claims the memory and the original process
wants to reuse it, the memory do not need to be faulted again.

So effectively this flag allows us to cache memory inside numpy that can
be reused by the rest of the system if required.
Doing gives the expected speedup in the above example.

An issue is that the memory usage of numpy applications will seem to
increase. The memory that is actually free will still show up in the
usual places you look at memory usage. Namely the resident memory usage
of the process in top, /proc etc. The usage will only go down when the
memory is actually needed by other processes.
This probably would break some of the memory profiling tools so probably
we need a switch to disable the caching for the profiling tools to use.
Another concern is that using this functionality is actually the job of
the system memory allocator but I had a look at glibcs allocator and it
does not look like an easy job to make good use of MADV_FREE
retroactively, so I don't expect this to happen anytime soon.


Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).

So what do you think of this idea?

cheers,
Julian
Anne Archibald
2017-03-13 15:21:57 UTC
Permalink
On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor <
***@googlemail.com> wrote:

Should it be agreed that caching is worthwhile I would propose a very
Post by Julian Taylor
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).
So what do you think of this idea?
This is an interesting thought, and potentially a nontrivial speedup with
zero user effort. But coming up with an appropriate caching policy is going
to be tricky. The thing is, for each array, numpy grabs a block "the right
size", and that size can easily vary by orders of magnitude, even within
the temporaries of a single expression as a result of broadcasting. So
simply giving each new array the smallest cached block that will fit could
easily result in small arrays in giant allocated blocks, wasting
non-reclaimable memory. So really you want to recycle blocks of the same
size, or nearly, which argues for a fairly large cache, with smart indexing
of some kind.

How much difference is this likely to make? Note that numpy is now in some
cases able to eliminate allocation of temporary arrays.

I think the only way to answer these questions is to set up a trial
implementation, with user-switchable behaviour (which should include the
ability for users to switch it on even when MADV_FREE is not available) and
sensible statistics reporting. Then volunteers can run various numpy
workloads past it.

Anne
Julian Taylor
2017-03-13 17:11:40 UTC
Permalink
Post by Anne Archibald
On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor
Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).
So what do you think of this idea?
This is an interesting thought, and potentially a nontrivial speedup
with zero user effort. But coming up with an appropriate caching policy
is going to be tricky. The thing is, for each array, numpy grabs a block
"the right size", and that size can easily vary by orders of magnitude,
even within the temporaries of a single expression as a result of
broadcasting. So simply giving each new array the smallest cached block
that will fit could easily result in small arrays in giant allocated
blocks, wasting non-reclaimable memory. So really you want to recycle
blocks of the same size, or nearly, which argues for a fairly large
cache, with smart indexing of some kind.
The nice thing about MADV_FREE is that we don't need any clever cache.
The same process that marked the pages free can reclaim them in another
allocation, at least that is what my testing indicates it allows.
So a small allocation getting a huge memory block does not waste memory
as the top unused part will get reclaimed when needed, either by numpy
itself doing another allocation or a different program on the system.

An issue that does arise though is that this memory is not available for
the page cache used for caching on disk data. A too large cache might
then be detrimental for IO heavy workloads that rely on the page cache.
So we might want to cap it to some max size, provide an explicit on/off
switch and/or have numpy IO functions clear the cache.
Francesc Alted
2017-03-13 18:54:20 UTC
Permalink
Post by Julian Taylor
Post by Anne Archibald
On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor
Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful
of
Post by Anne Archibald
array data pointers for the fast allocate deallocate cycle that
appear
Post by Anne Archibald
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory
block
Post by Anne Archibald
of sufficient size.
The cache would only be active on systems that support MADV_FREE
(which
Post by Anne Archibald
is linux 4.5 and probably BSD too).
So what do you think of this idea?
This is an interesting thought, and potentially a nontrivial speedup
with zero user effort. But coming up with an appropriate caching policy
is going to be tricky. The thing is, for each array, numpy grabs a block
"the right size", and that size can easily vary by orders of magnitude,
even within the temporaries of a single expression as a result of
broadcasting. So simply giving each new array the smallest cached block
that will fit could easily result in small arrays in giant allocated
blocks, wasting non-reclaimable memory. So really you want to recycle
blocks of the same size, or nearly, which argues for a fairly large
cache, with smart indexing of some kind.
The nice thing about MADV_FREE is that we don't need any clever cache.
The same process that marked the pages free can reclaim them in another
allocation, at least that is what my testing indicates it allows.
So a small allocation getting a huge memory block does not waste memory
as the top unused part will get reclaimed when needed, either by numpy
itself doing another allocation or a different program on the system.
​Well, what you say makes a lot of sense to me, so if you have tested that
then I'd say that this is worth a PR and see how it works on different
workloads​.
Post by Julian Taylor
An issue that does arise though is that this memory is not available for
the page cache used for caching on disk data. A too large cache might
then be detrimental for IO heavy workloads that rely on the page cache.
​Yeah. Also, memory mapped arrays use the page cache intensively, so we
should test this use case​ and see how the caching affects memory map
performance.
Post by Julian Taylor
So we might want to cap it to some max size, provide an explicit on/off
switch and/or have numpy IO functions clear the cache.
​Definitely​ dynamically
allowing the disabling
​this feature would be desirable. That would provide an easy path for
testing how it affects performance. Would that be feasible?


Francesc
Julian Taylor
2017-03-14 15:21:33 UTC
Permalink
Post by Julian Taylor
Post by Anne Archibald
On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor
Should it be agreed that caching is worthwhile I would propose a very
simple implementation. We only really need to cache a small handful of
array data pointers for the fast allocate deallocate cycle that appear
in common numpy usage.
For example a small list of maybe 4 pointers storing the 4 largest
recent deallocations. New allocations just pick the first memory block
of sufficient size.
The cache would only be active on systems that support MADV_FREE (which
is linux 4.5 and probably BSD too).
So what do you think of this idea?
This is an interesting thought, and potentially a nontrivial speedup
with zero user effort. But coming up with an appropriate caching policy
is going to be tricky. The thing is, for each array, numpy grabs a block
"the right size", and that size can easily vary by orders of magnitude,
even within the temporaries of a single expression as a result of
broadcasting. So simply giving each new array the smallest cached block
that will fit could easily result in small arrays in giant allocated
blocks, wasting non-reclaimable memory. So really you want to recycle
blocks of the same size, or nearly, which argues for a fairly large
cache, with smart indexing of some kind.
The nice thing about MADV_FREE is that we don't need any clever cache.
The same process that marked the pages free can reclaim them in another
allocation, at least that is what my testing indicates it allows.
So a small allocation getting a huge memory block does not waste memory
as the top unused part will get reclaimed when needed, either by numpy
itself doing another allocation or a different program on the system.
​Well, what you say makes a lot of sense to me, so if you have tested
that then I'd say that this is worth a PR and see how it works on
different workloads​.
An issue that does arise though is that this memory is not available for
the page cache used for caching on disk data. A too large cache might
then be detrimental for IO heavy workloads that rely on the page cache.
​Yeah. Also, memory mapped arrays use the page cache intensively, so we
should test this use case​ and see how the caching affects memory map
performance.
So we might want to cap it to some max size, provide an explicit on/off
switch and/or have numpy IO functions clear the cache.
​Definitely​ dynamically
allowing the disabling
​this feature would be desirable. That would provide an easy path for
testing how it affects performance. Would that be feasible?
I have created a PR with such a simple cache implemented:
https://github.com/numpy/numpy/pull/8783

This sets the max amount of memory pointers to save and returns the old
value:
np.core.multiarray.set_memory_cache_size(4)
On system where it works it return a value greater 0 (4 currently).
The size of the cache in bytes is currently unbounded.
Setting the value to 0 clears and disables the cache.

You should probably not expect too large performance improvements. It
will only have an effect in applications that have measurable page
faulting overhead which only happens if you have lots of relatively
short operations that create copies of large arrays. So mostly
operations with temporaries and maybe some indexing operations.

Loading...