[Numpy-discussion] record array performance issue / bug

Charles R Harris

2015-11-22 16:52:04 UTC

Post by G Jones
Hi,
Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
the following code works OK if npackets = 2, but acts bizarrely if npackets
-----------
npackets = 2**12
dlen=2048
PacketType = np.dtype([('timestamp','float64'),
('pkts',np.dtype(('int8',(npackets,dlen)))),
('data',np.dtype(('int8',(npackets*dlen,)))),
])
b = np.zeros((1,),dtype=PacketType)
b['timestamp'] # Should return array([0.0])
----------------
Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
b['timestamp'] results in 100% CPU usage while the memory consumption is
increasing by hundreds of MB per second. When I interrupt, I find the
traceback in numpy/core/_internal.pyc : _get_all_field_offsets
Since it seems to work for small values of npackets, I suspect that if I
had the memory and time, the access to b['timestamp'] would eventually
return, so I think the issue is that the algorithm doesn't scale well with
record dtypes made up of lots of bytes.
Looking on Github, I can see this code has been in flux recently, but I
can't quite tell if the issue I'm seeing is addressed by the issues being
discussed and tackled there.

This should be fixed in 1.10.2. 1.10.2rc1 is up on sourceforge if you want
to test it.

Chuck