views:

85

answers:

4

I have simple text file containing two columns, both integers

1 5
1 12
2 5
2 341
2 12

and so on..

I need to group the dataset by second value, such that the output will be.

5 1 2
12 1 2
341 2

Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.

I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).

data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';

I wanted to know if there was any simpler way of doing this.

Update: keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.

The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].

For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.

Solution which I had used:

class AdjDict(dict):
    """
     A special Dictionary Class to hold adjecancy list
    """
    def __missing__(self, key):
        """
        Missing is changed such that when a key is not found an integer array is initialized
        """
        self.__setitem__(key,array.array('i'))
        return self[key]

Adj= AdjDict()

for line in file("net.txt"):
    entry =  line.strip().split('\t')
    node = int(entry[1])
    follower = int(entry[0])
    if node < 10 ** 6:
        Adj[node].append(follower)

# Code for writting Adj matrix to the file:
+2  A: 

Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.

With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).

Paul McGuire
+1 for pointing to C. For opening such large files the `_FILE_OFFSET_FLAG` might be interesting.
phimuemue
+1  A: 

If I had to solve this on my current hardware, I'd probably write a few small programs:

The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)

Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)

Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)

Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:

5 1
5 2
12 1
12 2

and you'll need to return:

5 1 2
12 1 2

(This won't take much memory.)

By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.

sarnold
Even I considered solution based on sorting, however then elegant way to do it seems to be via hadoop since it has a sort as an intermediate step.
largescaled
+1  A: 

Maybe you can do a multi-pass through the file.

Do a range of keys each pass through the file, for example if you picked a range size of 100

1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.

for your sample, the 1st pass would output

5 1 2
12 1 2

and the 4th pass would output

341 2

Choose the range size so that the dict you are creating fits into your RAM

I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk

gnibbler
Nice way to avoid the mess of 70+ temporary files that my solution involves -- but it works only if keys don't go up to MAX_INT. :)
sarnold
@sarnold, It will work fine with `long`. Why do you think maxint is a problem? It's pretty large on 64 bit builds anyway :) I would hope that someone processing 34GB files is using 64bit platform (with lots of RAM) these days
gnibbler
@gnibbler, going from 0 to 2^31 by increments of 100 would take 21 million passes over the file. By 100,000 it would still take 214 thousand passes over the file. So, I hope he doesn't have to count his way up to huge numbers.
sarnold
@sarnold, the idea would be to choose the range as large as possible while not exceeding the available RAM. Can't tell from the sample data given whether a range of 100 or a range of 100,000 would be better. To process 34GB of data with say 2GB of RAM you'd expect to need quite a few passes though.
gnibbler
since values in second column range from 0 - 45 million, I had tried using steps of 10 million, maybe because the whole program was written in Python it turned out to be slow.
largescaled
@largescaled, at steps of 10 million are you sure you are not hitting swap? keep an eye on you memory consumption. you should try 1 million or even 100,000
gnibbler
@gnibbler I have 6 GB ram, and I could load the whole wikipedia link dataset, which is around ~10 million nodes and nearly same degree as this one on average. I also monitored the memory usage, it was no where near 6 Gb and it was past nearly 2 hours.
largescaled
+1  A: 

If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.


THAT SAID HOWEVER, I really do think that: (a) for such a large data set, exact resolution is not appropriate and that (b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.

Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.

For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.

Jérémie