views:

462

answers:

3

Using Python (3.1 or 2.6), I'm trying to read data from binary data files produced by a GPS receiver. Data for each hour is stored in a separate file, each of which is about 18 MiB. The data files have multiple variable-length records, but for now I need to extract data from just one of the records.

I've got as far as being able to decode, somewhat, the header. I say somewhat because some of the numbers don't make sense, but most do. After spending a few days on this (I've started learning to program using Python), I'm not making progress, so it's time to ask for help.

The reference guide gives me the message header structure and the record structure. Headers can be variable length but are usually 28 bytes.

Header
Field #  Field Name    Field Type    Desc                 Bytes    Offset
1        Sync          char          Hex 0xAA             1        0
2        Sync          char          Hex 0x44             1        1
3        Sync          char          Hex 0x12             1        2
4        Header Lgth   uchar         Length of header     1        3
5        Message ID    ushort        Message ID of log    2        4
8        Message Lgth  ushort        length of message    2        8
11       Time Status   enum          Quality of GPS time  1        13
12       Week          ushort        GPS week number      2        14
13       Milliseconds  GPSec         Time in ms           4        16


Record
Field #  Data                        Bytes         Format     Units       Offset
1        Header                                                           0
2        Number of SV Observations   4             integer    n/a         H
         *For first SV Observation*  
3        PRN                         4             integer    n/a         H+4
4        SV Azimuth angle            4             float      degrees     H+8
5        SV Elevation angle          4             float      degrees     H+12
6        C/N0                        8             double     db-Hz       H+16
7        Total S4                    8             double     n/a         H+24
...
27       L2 C/N0                     8             double     db-Hz       H+148
28       *For next SV Observation*
         SV Observation is satellite - there could be anywhere from 8 to 13 
         in view.

Here's my code for trying to make sense of the header:

import struct

filename = "100301_110000.nvd"

f = open(filename, "rb")
s = f.read(28)
x, y, z, lgth, msg_id, mtype, port, mlgth, seq, idletime, timestatus, week, millis,    recstatus, reserved, version = struct.unpack("<cccBHcBHHBcHLLHH", s)

print(x, y, z, lgth, msg_id, mtype, port, mlgth, seq, idletime, timestatus, week, millis, recstatus, reserved, version)

It outputs:

b'\xaa' b'D' b'\x12' 28 274 b'\x02' 32 1524 0 78 b'\xa0' 1573 126060000 10485760 3545 35358

The 3 sync fields should return xAA x44 x12. (D is the ascii equiv of x44 - I assume.)

The record ID for which I'm looking is 274 - that seems correct.

GPS week is returned as 1573 - that seems correct.

Milliseconds is returned as 126060000 - I was expecting 126015000.

How do I go about finding the records identified as 274 and extracting them? (I'm learning Python, and programming, so keep in mind the answer you give an experienced coder might be over my head.)

+2  A: 

18 MB should fit comfortably in memory, so I'd just gulp the whole thing into one big string of bytes with a single with open(thefile, 'rb') as f: data = f.read() and then perform all the "parsing" on slices to advance record by record. It's more convenient, and may well be faster than doing many small reads from here and there in the file (though it doesn't affect the logic below, because in either case the "current point of interest in the data" is always moving [[always forward, as it happens]] by amounts computed based on the struct-unpacking of a few bytes at a time, to find the lengths of headers and records).

Given the "start of a record" offset, you can determine its header's length by looking at just one byte ("field four", offset 3 from start of header that's the same as start of record) and look at message ID (next field, 2 bytes) to see if it's the record you care about (so a struct unpack of just those 3 bytes should suffice for that).

Whether it's the record you want or not, you next need to compute the record's length (either to skip it or to get it all); for that, you compute the start of the actual record data (start of record plus length of header plus the next field of the record (the 4 bytes right after the header) times the length of an observation (32 bytes if I read you correctly).

This way you either isolate the substring to be given to struct.unpack (when you've finally reached the record you want), or just add the total length of header + record to the "start of record" offset, to get the offset for the start of the next record.

Alex Martelli
The OS does some buffering, the overhead by using the read is not so high.
ondra
@ondra, OS's do different amounts of buffering (especially if you `seek` within the file), and in my experience it's often faster (and generally handier), for files of up to a few megabytes, to slurp them all in at once and then work on in-memory data. If performance is a crucial bottleneck, of course, it's well worth benchmarking both possibilities!
Alex Martelli
Thank you Alex. The length of an observation is 148 bytes. I tried to list just a few fields to give a sense of the data with which I'm working - edited my question to show the correct length of the observation.
ljt
@ljt, my answer is still fully applicable if you use 148 as the "length of an observation" multiplier instead of 32!-)
Alex Martelli
+4  A: 

You have to read in pieces. Not because of memory constraints, but because of the parsing requirements. 18MiB fits in memory easily. On a 4Gb machine it fits in memory 200 times over.

Here's the usual design pattern.

  1. Read the first 4 bytes only. Use struct to unpack just those bytes. Confirm the sync bytes and get the header length.

    If you want the rest of the header, you know the length, read the rest of the bytes.

    If you don't want the header, use seek to skip past it.

  2. Read the first four bytes of a record to get the number of SV Observations. Use struct to unpack it.

    Do the math and read the indicated number of bytes to get all the SV Observations in the record.

    Unpack them and do whatever it is you're doing.

    I strongly suggest building namedtuple objects from the data before doing anything else with it.

If you want all the data, you have to actually read all the data.

"and without reading an 18 MiB file one byte at a time)?" I don't understand this constraint. You have to read all the bytes to get all the bytes.

You can use the length information to read the bytes in meaningful chunks. But you can't avoid reading all the bytes.

Also, lots of reads (and seeks) are often fast enough. Your OS buffers for you, so don't worry about trying to micro-optimize the number of reads.

Just follow the "read length -- read data" pattern.

S.Lott
Thanks. By one byte at a time, I meant I'm not constrained by memory or CPU.
ljt
@ljt: "Constrained"? 18MiB is hardly in the neighborhood of a "constraint". If you're as old as I am, you might remember when 18K was a back-breakingly huge file. My laptop has 4G of RAM; an 18M file fits in there 200 times over. You don't need to worry about "constrained" until your files get into the 1Gb or larger size.
S.Lott
+1  A: 

Apart from writing a parser that correctly reads the file, you may try a somewhat brute-force approach...read the data to the memory and split it using the 'Sync' sentinel. Warning - you might get some false positives. But...

f = open('filename')
data = f.read()
messages = data.split('\xaa\x44\x12')
mymessages = [ msg for msg in messages if len(msg) > 5 and msg[4:5] == '\x12\x01' ]

But it is rather a very nasty hack...

ondra
Thank you. While I'm not convinced I know what I'm doing, I'm perhaps a little less confused than I was yesterday.
ljt