tags:

views:

48

answers:

2

I need to extract financial price data from a binary file. This price data is normally extracted by a piece of C# code. The biggest problem I'm having is getting a meaningful datetime.

The binary data looks like this:

'\x14\x11\x00\x00{\x14\xaeG\xe1z(@\x9a\x99\x99\x99\x99\x99(@q=\n\xd7\xa3p(@\x9a\x99\x99\x99\x99\x99(@\xac\x00\x19\x00\x00\x00\x00\x00\x08\x01\x00\x00\x00"\xd8\x18\xe0\xdc\xcc\x08'

The C# code that extracts it correctly is:

StockID = reader.ReadInt32();
Open = reader.ReadDouble();
High = reader.ReadDouble();
Low = reader.ReadDouble();
Close = reader.ReadDouble();
Volume = reader.ReadInt64();
TotalTrades = reader.ReadInt32();
Timestamp = reader.ReadDateTime();

This is where I've gotten in python. I have a couple concerns about it.

In [1]: barlength = 56; barformat = 'i4dqiq'
In [2]: pricebar = f.read(barlength)
In [3]: pricebar
Out[3]: '\x95L\x00\x00)\\\x8f\xc2\xf5\xc8N@D\x1c\xeb\xe26\xcaN@\x7fj\xbct\x93\xb0N@\xd7\xa3p=\n\xb7N@\xf6\xdb\x02\x00\x00\x00\x00\x00J\x03\x00\x00\x00"\xd8\x18\xe0\xdc\xcc\x08'
In [4]: struct.unpack(barformat, pricebar)
Out[4]: 
(19605,                # stock id
 61.57,                # open
 61.579800000000006,   # high
 61.3795,              # low
 61.43,                # close
 187382,               # volume -- seems reasonable
 842,                  # TotalTrades -- seems reasonable
 634124502600000000L   # datetime -- no idea what this means!
)

I used python's built in struct module but have some concerns about it.

  1. I'm not sure what format characters correspond to Int32 vs Int64 in the C# code, though several different tries returned the same python tuple.

  2. I'm concerned though since the output for some of the fields doesn't seem to be very sensitive to the format I specify: For example, the TotalTrades field returns the same amount if i specify it as either signed or unsigned int OR signed or unsigned long (l, L, i, or I)

  3. I can't make any sense of the date return field. This is actually my biggest problem.

A: 

Without seeing the C# source containing the ReadInt32, ReadDouble, ReadDateTime etc methods it will be impossible to give a definitive answer, but...

  1. I'm not really sure what the difference is between the i and l format characters, but I think you're correct in using i/l for Int32 and q for Int64.

  2. Again, I don't know the difference between the i/l or I/L format characters, but since they all represent 32-bit integers then their binary representation should be the same for all values between 0 and 2147483647 inclusive. If it's possible for TotalTrades to be negative, or exceed 2147483647, then you should investigate further. If not then don't worry about it.

  3. It looks to me like your serialized date field is probably equivalent to DateTime.Ticks.

    If that's the case then the serialized value will be the number of ticks -- that is, the number of 100 nanosecond intervals -- since 00:00:00 on 1 January 0001.

    By that reckoning, the value shown in your question -- 634124502600000000 -- would represent 09:31:00 on 18 June 2010.

LukeH
i/I and l/L are for signed/unsigned int and long respectively.thanks for the response.
Arthur Dent
@Arthur: I meant that I wasn't sure of the difference between `i` and `l` (both described as signed 32-bit integers) or between `I` and `L` (both described as unsigned 32-bit integers). I guess the naming is a throwback to C/C++ where the size of ints and longs is implementation-dependent, although as far as the `struct` module is concerned they appear to be exactly the same.
LukeH
+1  A: 

As far as I know, .net timestamps are ticks since 0001-01-01T00:00:00Z where a tick is 100 nanoseconds. So:

>>> x = 634124502600000000
>>> secs = x / 10.0 ** 7
>>> secs
63412450260.0
>>> import datetime
>>> delta = datetime.timedelta(seconds=secs)
>>> delta
datetime.timedelta(733940, 34260)
>>> ts = datetime.datetime(1,1,1) + delta
>>> ts
datetime.datetime(2010, 6, 18, 9, 31)
>>>

The date part is 2010-06-18. Are you in a timezone that's 9.5 hours away from UTC? It would be rather useful in verifying this calculation if you were to supply TWO timestamp values together with the expected answers.

Addressing your concern """I'm concerned though since the output for some of the fields doesn't seem to be very sensitive to the format I specify: For example, the TotalTrades field returns the same amount if i specify it as either signed or unsigned int OR signed or unsigned long (l, L, i, or I)""": They are not sensitive because (1) "long" and "int" mean the same (32 bits) and (2) the smaller half of all possible unsigned numbers have the same representation as signed numbers. For example, in 8-bit numbers, the numbers 0 to 127 inclusive have the same bit pattern whether signed or unsigned.

John Machin
thank you for the explanation about signed/unsigned integers. i didn't really know, but now i'm fairly certain that i should be using the unsigned ones, since total trades should never be negative.
Arthur Dent