tags:

views:

101

answers:

4

Hi,

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.

The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:

  1. create an array of numbers to indicate the which records (row number) have already been scanned.
  2. start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database sum the call durations and divide the result by the sum of all call durations add the numbers of summed lines into the array created at the beginning
  3. check the array if the next record number has already been summed if it has already been summed then skip the record, else perform step 2

I would appreciate if anyone of you suggested any improvement of the logic described above.

p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)

Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.

+1  A: 

As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.

Your first step should be to analyse the distribution(s) of data in your dataset.

In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.

Now you have a hash table which contains the data you want.

High Performance Mark
A: 

Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:

select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc

That would be a start.

Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.

Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:

  1. Scan each record
  2. Find the record for each (calling_party, called_party) combination, and update its stats.

A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.

Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).

Amrinder Arora
A: 
  1. How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
  2. You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
  3. You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?

If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.

Eamon Nerbonne
A: 

Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.

For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Sheldon L. Cooper