ansaurus

Question

Script to convert Huge Three column table into table

Answer 1

+1 A:

When all you have is a hammer . . . . .

Conceptually, what you are trying to do is simple but because of the size of your data, it is computationally difficult. I tend to use R for it's analytic and graphics capacity, not it's data wrangling skills. When I need to move around a bunch of data, I usually just stick everything into a database.

Lately I have had quite a bit of success with SQLite and R. The best part is that you can actually use R to read in your data, which makes it easy to import large SPSS files or other sources of data that SQLite can't really handle but R can.

http://cran.r-project.org/web/packages/RSQLite/index.html

Here's my recommended work flow.

Import your data into R. (Done)
Library(RSQLite)
Move your data frame to SQLite.
Create Indexes on columns A and B.
Create a view that builds your table.
Query your view from R and coerce the returns into a table.

Choens 2010-10-28 16:21:13

Another nice aspect of dumping all your data into SQLite - Once your data is in SQLite you can easily use Python or Perl to play with your data. You can even use other tools like SOFA Statistics. Connectivity with SQLite is very widespread. In contrast, data stored in R's native data structures are not easily accessible to other programming languages or tools.

Choens 2010-10-28 20:02:51

Answer 2

A:

If you could use table(x,y,z) in R, then how about trying out the R out of memory packages that handle such huge data sets? Use the read.big.matrix function in the package bigmemory to read in the data set and the bigtable function in the package bigtabulate to create the table.

See vignettes.

eyjo 2010-10-28 16:23:45

Answer 3

+1 A:

In R I can do this:

N <- 1000000
x <- sample(1:400,N,TRUE)
y <- sample(1:400,N,TRUE)
z <- sample(1:400,N,TRUE)

w <- table(x,y,z)

And memory peak is lower then 800MB.

So what limitations you have?

EDIT. This peace of R-code:

N <- 1000000
mydata <- data.frame(
    A=sample(runif(400),N,TRUE),
    B=sample(runif(400),N,TRUE),
    C=runif(N)
)

require(reshape)
results <- cast(mydata, A~B, value="C")
write.table(as.matrix(results),na="",sep="\t",file="results.txt")

create what you want with less then 300MB of RAM.

On my data it gives warning cause there are non-unique A-B combinations but for yours should be ok.

Marek 2010-10-28 16:24:47

Those are all integer vectors. Does it still work if you convert to numeric?

Joshua Ulrich 2010-10-28 16:46:24

@Joshua I replace `1:400` with `rnorm(400)` and it still works. It use more RAM ~1.1GB.

Marek 2010-10-28 19:57:43

This was tested with a *partial* set of my data: summary(mydata)> names(mydata)[1] "A" "B" "C"> nrow(mydata)[1] 180227newtab<-table(mydata$A, mydata$B, mydata$C)Error: cannot allocate vector of size 1.1 Gb>

xtracto 2010-10-29 09:48:35

@xtracto As I see your edit you don't want `table(x,y,z)`.

Marek 2010-10-29 10:51:16

Answer 4

A:

Your example of desired output doesn't look like a 3-way contingency table to me. That would be a mapping from (key1, key2, key3) to a count of occurences. Your example looks like a mapping from (key1, key2) to some number. You don't say what to do when (key1, key2) is duplicated: average, total, something else?

Assuming that you want a total, here's one memory-saving approach in Python, using nested defaultdicts:

>>> from collections import defaultdict as dd
>>> d = dd(lambda: dd(float))
>>> d[3277][4733] += 54.1
>>> d
defaultdict(<function <lambda> at 0x00D61DF0>, {3277: defaultdict(<type 'float'>, {4733: 54.1})})
>>> d[3278][4741] += 51.0
>>> d
defaultdict(<function <lambda> at 0x00D61DF0>, {3277: defaultdict(<type 'float'>, {4733: 54.1}), 3278: defaultdict(<type 'float'>, {4741: 51.0})})
>>>

and another approach using a single defaultdict with a composite key:

>>> d2 = dd(float)
>>> d2[3277,4733] += 54.1
>>> d2
defaultdict(<type 'float'>, {(3277, 4733): 54.1})
>>> d2[3278,4741] += 51.0
>>> d2
defaultdict(<type 'float'>, {(3277, 4733): 54.1, (3278, 4741): 51.0})
>>>

It might help if you were to say what you want to do with this data after you've got it grouped together ...

If you want (for example) an average, you have two options: (1) two data structures, one for total, one for count, then do "average = total - count" (2) sort your data on the first 2 columns, user itertools.groupby to collect your duplicates together, do your calculation, and add the results into your "average" data structure. Which of these approaches would use less memory is hard to tell; Python being Python you could try both rather quickly.

John Machin 2010-10-28 20:09:07

Thanks for your reply... Indeed what I am looking for is not a contingency table but a mapping of C to A and B in row,column. Thanks for the correction. Hence, I do not need to add up any value but only *position* the values in the correct row/column combination. There are also no repeated keys. Ideally I am looking for a simple way to transform an input file with A,B,C to an output file with the table.

xtracto 2010-10-29 09:40:45

Answer 5

A:

A small subclasse of dict can provide you a confortable object to work with the table. 500.000 items should not be a problem on a desktop PC - if you happen to have 500.000.000 items, a similar class could map from the keys to positions in the file itself (that would be way more cool to implement :-) )

import csv
class ContingencyTable(dict):
    def __init__(self):
        self.a_keys=set()
        self.b_keys=set()
        dict.__init__(self)
    def __setitem__(self, key,value):
        self.a_keys.add(key[0])
        self.b_keys.add(key[1])
        dict.__setitem__(self, key, value)
    def feed(self, file):
        reader = csv.reader(file)
        reader.next()
        for a, b, c in reader:
            self[int(a),int(b)] = float(c)

table = ContingencyTable()
table.feed(open("yourfile.csv"))

jsbueno 2010-10-28 20:16:17

Answer 6

A:

Whole new story deserves a whole new answer.

Don't need defaultdict, don't even want defaultdict, because using it carelessly would suck memory like the Death Star's tractor beam.

This code is untested, may not even compile; I may have swapped rows and columns somewhere; fixes/explanations later ... must rush ...

d = {}
col_label_set = set()
row_label_set = set()
input =  open("input.txt")
output = open("output.txt","w")
for line in input:
    line = line.strip()
    splat = line.split(',')
    if len(splat) != 3:
        break # error message???
    k1, k2, v = splat
    try:
        subdict = d[k1]
    except KeyError:
        subdict = {}
        d[k1] = subdict
    subdict[k2] = v
    row_label_set.add(k1)
    col_label_set.add(k2)
col_labels = sorted(col_label_set)
row_labels = sorted(row_label_set
output.write("\t")
for v in col_labels::
    output.write(v + "\t")
output.write("\n")
for r in row_labels:
    output.write(r + "\t")
    for c in col_labels:
        output.write(d[r].get(c, "") + "\t")
    output.write("\n")

Update Here's a fixed and refactored version, tested to the extent shown:

class SparseTable(object):

    def __init__(self, iterable):
        d = {}
        col_label_set = set()
        for row_label, col_label, value in iterable:
            try:
                subdict = d[row_label]
            except KeyError:
                subdict = {}
                d[row_label] = subdict
            subdict[col_label] = value
            col_label_set.add(col_label)
        self.d = d
        self.col_label_set = col_label_set

    def tabulate(self, row_writer, corner_label=u"", missing=u""):
        d = self.d
        col_labels = sorted(self.col_label_set)
        row_labels = sorted(d.iterkeys())
        orow = [corner_label] + col_labels
        row_writer(orow)
        for row_label in row_labels:
            orow = [row_label]
            subdict = d[row_label]
            for col_label in col_labels:
                orow.append(subdict.get(col_label, missing))
            row_writer(orow)

if __name__ == "__main__":

    import sys

    test_data = u"""
    3277,4733,54.1
    3278,4741,51.0
    3278,4750,28.4
    3278,4768,36.0
    3278,4776,50.1
    3278,4784,51.4
    3279,4792,82.6
    3279,4806,78.2
    3279,4814,36.4
    """.splitlines(True)

    def my_writer(row):
        sys.stdout.write(u"\t".join(row))
        sys.stdout.write(u"\n")

    def my_reader(iterable):
        for line in iterable:
            line = line.strip()
            if not line: continue
            splat = line.split(u",")
            if len(splat) != 3:
                raise ValueError(u"expected 3 fields, found %d" % len(splat))
            yield splat

    table = SparseTable(my_reader(test_data))
    table.tabulate(my_writer, u"A/B", u"....")

Here's the output:

A/B     4733    4741    4750    4768    4776    4784    4792    4806    4814
3277    54.1    ....    ....    ....    ....    ....    ....    ....    ....
3278    ....    51.0    28.4    36.0    50.1    51.4    ....    ....    ....
3279    ....    ....    ....    ....    ....    ....    82.6    78.2    36.4

John Machin 2010-10-29 11:26:29

Yay! you are my hero today. Yep, this works like acharm (just after some minor corrections :: and a missing ) Thanks a lot!

xtracto 2010-10-29 11:38:17

@xtracto: I'm glad I could help. Please leave a comment about the minor corrections and I'll tidy the answer up in the morning (it's sleep time in this TZ) so we don't leave a buggy non-compiling answer for other people to stumble over.

John Machin 2010-10-29 11:54:33

ansaurus

tags:

views:

answers:

Script to convert Huge Three column table into table

related questions