ansaurus

Question

Answer 1

+2 A:

Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.

With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).

Paul McGuire 2010-08-04 08:47:49

+1 for pointing to C. For opening such large files the `_FILE_OFFSET_FLAG` might be interesting.

phimuemue 2010-08-04 08:51:03

Answer 2

+1 A:

If I had to solve this on my current hardware, I'd probably write a few small programs:

The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)

Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)

Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)

Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:

and you'll need to return:

5 1 2
12 1 2

(This won't take much memory.)

By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.

sarnold 2010-08-04 09:00:18

Even I considered solution based on sorting, however then elegant way to do it seems to be via hadoop since it has a sort as an intermediate step.

largescaled 2010-08-04 12:27:07

Answer 3

+1 A:

Maybe you can do a multi-pass through the file.

Do a range of keys each pass through the file, for example if you picked a range size of 100

1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.

for your sample, the 1st pass would output

5 1 2
12 1 2

and the 4th pass would output

341 2

Choose the range size so that the dict you are creating fits into your RAM

I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk

gnibbler 2010-08-04 09:56:20

Nice way to avoid the mess of 70+ temporary files that my solution involves -- but it works only if keys don't go up to MAX_INT. :)

sarnold 2010-08-04 10:10:39

@sarnold, It will work fine with `long`. Why do you think maxint is a problem? It's pretty large on 64 bit builds anyway :) I would hope that someone processing 34GB files is using 64bit platform (with lots of RAM) these days

gnibbler 2010-08-04 10:15:42

@gnibbler, going from 0 to 2^31 by increments of 100 would take 21 million passes over the file. By 100,000 it would still take 214 thousand passes over the file. So, I hope he doesn't have to count his way up to huge numbers.

sarnold 2010-08-04 10:29:35

@sarnold, the idea would be to choose the range as large as possible while not exceeding the available RAM. Can't tell from the sample data given whether a range of 100 or a range of 100,000 would be better. To process 34GB of data with say 2GB of RAM you'd expect to need quite a few passes though.

gnibbler 2010-08-04 11:12:07

since values in second column range from 0 - 45 million, I had tried using steps of 10 million, maybe because the whole program was written in Python it turned out to be slow.

largescaled 2010-08-04 12:32:39

@largescaled, at steps of 10 million are you sure you are not hitting swap? keep an eye on you memory consumption. you should try 1 million or even 100,000

gnibbler 2010-08-04 23:19:50

@gnibbler I have 6 GB ram, and I could load the whole wikipedia link dataset, which is around ~10 million nodes and nearly same degree as this one on average. I also monitored the memory usage, it was no where near 6 Gb and it was past nearly 2 hours.

largescaled 2010-08-05 02:30:55

Answer 4

+1 A:

If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.

THAT SAID HOWEVER, I really do think that: (a) for such a large data set, exact resolution is not appropriate and that (b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.

Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.

For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.

Jérémie 2010-08-04 14:26:28

ansaurus

tags:

views:

answers:

How can I group a large dataset

related questions