Discussion:
[Numpy-discussion] record array performance issue / bug
G Jones
2015-11-22 03:54:27 UTC
Permalink
Hi,
Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
the following code works OK if npackets = 2, but acts bizarrely if npackets
is large (2**12):

-----------

npackets = 2**12
dlen=2048
PacketType = np.dtype([('timestamp','float64'),
('pkts',np.dtype(('int8',(npackets,dlen)))),
('data',np.dtype(('int8',(npackets*dlen,)))),
])

b = np.zeros((1,),dtype=PacketType)

b['timestamp'] # Should return array([0.0])

----------------

Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
b['timestamp'] results in 100% CPU usage while the memory consumption is
increasing by hundreds of MB per second. When I interrupt, I find the
traceback in numpy/core/_internal.pyc : _get_all_field_offsets
Since it seems to work for small values of npackets, I suspect that if I
had the memory and time, the access to b['timestamp'] would eventually
return, so I think the issue is that the algorithm doesn't scale well with
record dtypes made up of lots of bytes.
Looking on Github, I can see this code has been in flux recently, but I
can't quite tell if the issue I'm seeing is addressed by the issues being
discussed and tackled there.

Thanks,
Glenn
Charles R Harris
2015-11-22 16:52:04 UTC
Permalink
Post by G Jones
Hi,
Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
the following code works OK if npackets = 2, but acts bizarrely if npackets
-----------
npackets = 2**12
dlen=2048
PacketType = np.dtype([('timestamp','float64'),
('pkts',np.dtype(('int8',(npackets,dlen)))),
('data',np.dtype(('int8',(npackets*dlen,)))),
])
b = np.zeros((1,),dtype=PacketType)
b['timestamp'] # Should return array([0.0])
----------------
Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
b['timestamp'] results in 100% CPU usage while the memory consumption is
increasing by hundreds of MB per second. When I interrupt, I find the
traceback in numpy/core/_internal.pyc : _get_all_field_offsets
Since it seems to work for small values of npackets, I suspect that if I
had the memory and time, the access to b['timestamp'] would eventually
return, so I think the issue is that the algorithm doesn't scale well with
record dtypes made up of lots of bytes.
Looking on Github, I can see this code has been in flux recently, but I
can't quite tell if the issue I'm seeing is addressed by the issues being
discussed and tackled there.
This should be fixed in 1.10.2. 1.10.2rc1 is up on sourceforge if you want
to test it.

Chuck

Loading...