I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value, such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i')
and extending them on append
.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update: keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file: