G Jones
2015-11-22 03:54:27 UTC
Hi,
Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
the following code works OK if npackets = 2, but acts bizarrely if npackets
is large (2**12):
-----------
npackets = 2**12
dlen=2048
PacketType = np.dtype([('timestamp','float64'),
('pkts',np.dtype(('int8',(npackets,dlen)))),
('data',np.dtype(('int8',(npackets*dlen,)))),
])
b = np.zeros((1,),dtype=PacketType)
b['timestamp'] # Should return array([0.0])
----------------
Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
b['timestamp'] results in 100% CPU usage while the memory consumption is
increasing by hundreds of MB per second. When I interrupt, I find the
traceback in numpy/core/_internal.pyc : _get_all_field_offsets
Since it seems to work for small values of npackets, I suspect that if I
had the memory and time, the access to b['timestamp'] would eventually
return, so I think the issue is that the algorithm doesn't scale well with
record dtypes made up of lots of bytes.
Looking on Github, I can see this code has been in flux recently, but I
can't quite tell if the issue I'm seeing is addressed by the issues being
discussed and tackled there.
Thanks,
Glenn
Using the latest numpy from anaconda (1.10.1) on Python 2.7, I found that
the following code works OK if npackets = 2, but acts bizarrely if npackets
is large (2**12):
-----------
npackets = 2**12
dlen=2048
PacketType = np.dtype([('timestamp','float64'),
('pkts',np.dtype(('int8',(npackets,dlen)))),
('data',np.dtype(('int8',(npackets*dlen,)))),
])
b = np.zeros((1,),dtype=PacketType)
b['timestamp'] # Should return array([0.0])
----------------
Specifically, if npackets is large, i.e. 2**12 or 2**16, trying to access
b['timestamp'] results in 100% CPU usage while the memory consumption is
increasing by hundreds of MB per second. When I interrupt, I find the
traceback in numpy/core/_internal.pyc : _get_all_field_offsets
Since it seems to work for small values of npackets, I suspect that if I
had the memory and time, the access to b['timestamp'] would eventually
return, so I think the issue is that the algorithm doesn't scale well with
record dtypes made up of lots of bytes.
Looking on Github, I can see this code has been in flux recently, but I
can't quite tell if the issue I'm seeing is addressed by the issues being
discussed and tackled there.
Thanks,
Glenn