views:

155

answers:

4

i have a task to compress a stock market data somehow...the data is in a file where the stock value for each day is given in one line and so on...so it's a really big file.

Eg,
123.45
234.75
345.678
889.56
.....

now the question is how to compress the data (aka reduce the redundancy) using standard algorithms like Huffman or Arithmetic coding or LZ coding...which coding is most preferable for this sort of data??...

I have noticed that if i take the first data and then consider the difference between each consecutive data, there is lot of repetition in the difference values...this makes me wonder if first taking these differences, finding their frequency and hence probalility and then using huffman coding would be a way??...

Am i right?...can anyone give me some suggestions.

+2  A: 

Many compression tools these days use a combination of these techniques to give good ratios on a variety of data. Might be worth starting out with something fairly general and modern like bzip2 which uses Huffman coding combined with various tricks that shuffle the data around to bring out various kinds of redundancy (page contains links to various implementations further down).

SimonJ
A: 

Run length encoding might be suitable? Check it out here. To give an extreme simple example of how it works, here's a line of data in ascii code...30 bytes long

HHHHHHHHEEEEEEELLLLLLLLOOOOOO

Apply RLE to it and you get this in 8 bytes:

9H7E8L6O
  • Nine H's
  • Seven E's
  • Eight L's
  • Six O's

A reduction of about 27% as a result (compression ratio for the example line is 8/30)

What do you think?

Hope this helps, Best regards, Tom.

tommieb75
A: 

Caculate the difference of consecutive data , and then use Run Length Encoding (RLE).

And you also need to convert the data to integer and then caculate the difference.

pierr
+2  A: 

I think your problem is more complex than merely subtracting the stock prices. You also need to store the date (unless you have a consistent time span that can be inferred from the file name).

The amount of data is not very large, though. Even if you have data every second for every day for every year for the last 30 years for 300 stockd, you could still manage to store all that in a higher end home computer (say, a MAC Pro), as that amounts to 5Tb UNCOMPRESSED.

I wrote a quick and dirty script which will chase the IBM stock in Yahoo for every day, and store it "normally" (only the adjusted close) and using the "difference method" you mention, then compressing them using gzip. You do obtain savings: 16K vs 10K. The problem is that I did not store the date, and I don't know what value correspond to what date, you would have to include this, of course.

Good luck.

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

Now compare the "raw data" (raw.dat) versus the "compressed format" you propose (comp.dat)

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
Arrieta
Notice that my comment for the stock dictionary says "2009" for the starting date, when in fact it is 1978... so I am chasing 30 years of daily stock data for IBM.
Arrieta
hey thanx a lot this is really helpful...well for now the date is actually not important...my main focus is to just compress the stock index data as much as possible...and i have to develop(code) the compression code myself cannot use a prebuilt software...i will only be using one algorithm...so my problem will be to get the probabilistic distribution of this 'difference' values for Huffman coding say for now and then to decode (decompress) it also.
sfactor