[Numpy-discussion] read not byte aligned records

Discussion:

Gmail

10 years ago

Hi,

I am developping a code to read binary files (MDF, Measurement Data File).
In its previous version 3, data was always byte aligned. I used widely
numpy.core.records module (fromstring, fromfile) showing good
performance to read and unpack data on the fly.
However, in the latest version 4, not byte aligned data is possible. It
allows to reduce size of file, especially when raw data is not actually
recorded on bytes, like 10bits for analog converter. For instance, a
record structure could be:
uint64, float32, uint8, unit10, padding 6bits, uint9, padding 7bits,
uint24, uint24, uint24, etc.

I found a way using instead of numpy.core.records the bitstring module
to read these records when not aligned but performance is much worse (I
did not try cython implementation though but in python like x10) ?

Would there be a pure numpy way to do ?

Regards

Aymeric

Jerome Kieffer

10 years ago

Permalink

Hi,
If you want to play with 10 bits data-blocks, read 5 bytes and work with 4 entries at a time...

--
Jérôme Kieffer
Data analysis unit - ESRF

Nathaniel Smith

10 years ago

Permalink

Post by Jerome Kieffer
Hi,
If you want to play with 10 bits data-blocks, read 5 bytes and work with 4 entries at a time...

NumPy arrays don't have any support for sub-byte alignment. So if you
want to handle such data, you either need to write some manual
packing/unpacking code (using bitshift operators, or perhaps
np.unpackbits, or whatever), or use another library designed for doing
this. You may find Cython useful to write the core packing/unpacking,
since bit-by-bit processing in a for loop is not something that
CPython is super well suited to.

Good luck,
-n

--
Nathaniel J. Smith -- http://vorpus.org

a***@gmail.com

10 years ago

Permalink

Hi,
To answer Jerome (I hope), data is sometime spread on bytes shared by other data in the whole record. 10 bits was an example, sometimes, 24, 2, 8, 7 etc. all combined including some padding between them. I am not sure to have understood...

To Nathaniel, yes indeed I could read the records in big/long bytes and apply right_shift and bitwise_and functions to extract each channels. I am a bit afraid of performance though.

I am currently using bitstring module which is doing exactly this bits handling. It is implemented in both pure python and cython.
I tried to use the pure python and performance drawback compared to byte aligned data is around 2-3x for similar file sizes.
--> I will try with bitstring's cython implementation.
--> I will also try the way using right_shift and bitwise_and
Best will win but at least I am sure I am not missing any trick or optimisation and I am in the right direction from your answers.
Thanks !
Regards
Aymeric

...

Benjamin Root

10 years ago

Permalink

I have been very happy with the bitarray package. I don't know if it is
faster than bitstring, but it is worth a mention. Just watch out for any
hashing operations on its objects, it doesn't seem to do them right (set(),
dict(), etc...), but comparison operations work just fine.

Ben Root

Gmail

10 years ago

Permalink

For the archive, I tried to use bitarray instead of bitstring and for
same file parsing went from 180ms to 60ms. Code was finally shorter and
more simple but less easy to jump into (documentation).

Performance is still far from using fromstring or fromfile which gives
like 5ms for similar size of file but byte aligned.

Aymeric

my code is below:

def readBitarray(self, bita, channelList=None):
""" reads stream of record bytes using bitarray module needed
for not byte aligned data

Parameters
------------
bitarray : stream
stream of bytes
channelList : List of str, optional

Returns
--------
rec : numpy recarray
contains a matrix of raw data in a recarray (attributes
corresponding to channel name)
"""
from bitarray import bitarray
B = bitarray(endian="little") # little endian by default
B.frombytes(bytes(bita))
# initialise data structure
if channelList is None:
channelList = self.channelNames
format = []
for channel in self:
if channel.name in channelList:
format.append(channel.RecordFormat)
buf = recarray(self.numberOfRecords, format)
# read data
for chan in range(len(self)):
if self[chan].name in channelList:
record_bit_size = self.CGrecordLength * 8
temp = [B[self[chan].posBitBeg + record_bit_size * i:\
self[chan].posBitEnd + record_bit_size * i]\
for i in range(self.numberOfRecords)]
nbytes = len(temp[0].tobytes())
if not nbytes == self[chan].nBytes and \
self[chan].signalDataType not in (6, 7, 8, 9,
10, 11, 12): # not Ctype byte length
byte = 8 * (self[chan].nBytes - nbytes) *
bitarray([False])
for i in range(self.numberOfRecords): # extend data
of bytes to match numpy requirement
temp[i].append(byte)
temp = [self[chan].CFormat.unpack(temp[i].tobytes())[0] \
for i in range(self.numberOfRecords)]
buf[self[chan].name] = asarray(temp)
return buf

Post by Benjamin Root
I have been very happy with the bitarray package. I don't know if it
is faster than bitstring, but it is worth a mention. Just watch out
for any hashing operations on its objects, it doesn't seem to do them
right (set(), dict(), etc...), but comparison operations work just fine.
Ben Root
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Continue reading on narkive:

Search results for '[Numpy-discussion] read not byte aligned records' (Questions and Answers)

replies

Explain this to me - about tests to see what candidate you should support?

started 17 years ago

politics

replies

how is a data written on a hard disk?

started 16 years ago

hardware

replies

Do they have a yahoo Answers where you live?