tags:

views:

291

answers:

9

It is a technical test question. Given 1 million records in a log file. These are records of website hits of an online shopping website. The records are of type:

TimeStamp:Date,time ; IP address; ProductName

Find the distinct IP addresses and most popular product. What is the most efficient way to do this? One solution is hashing. If solution is hashing, please provide a explanation for efficiently hashing this since there are one million records.

A: 

In the real world, if this is a once-off or occasional thing, I would simply insert the data into a database and run a few basic queries.

Since this is a homework assignment though, that's probably not what the prof is looking for. An IP address is really just a 32-bit number. I could convert each IP to it's 32-bit equivalent instead of creating a hash.

Since this is homework, the rest "is left as an exercise to the reader."

Eric J.
It is not a homework question. It was asked by a company in technical test for campus placement just yesterday. I cant disclose the name of company due to NDA. Even I am not supposed to ask this question but I want to know the best solution to this question so I can perform well in future
nowonder
That's the same as homework. It's something that's meant to measure _you_. You want it to measure your ability to get someone else to do it for you.
John Saunders
Why the -1? They question is tagged as homework so I gave a reasonable hint without giving away the solution.
Eric J.
+3  A: 

I recently did something similar for homework, not sure on the total number of lines, but it was quite a lot. The point is that your computer can probably do this very quickly, even if there is a million records.

I agree with you on the hashtable, and I would do the two questions slightly differently. The first one I would check every ip against the hashtable, and if it exists, do nothing. If it does not exist, add it to the hashtable, and increment a counter. At the end of the program, the counter will tell you how many unique IP's there were.

The second I would hash the product name and put that in the hashtable. I would increment the value associated with the hashkey every time I found a match in the table. At the end, loop through all the keys and values of the hashtable and find the highest value. That is the most popular product.

Marius
+3  A: 

A million log records is really a very small number; just read them in and keep a set of the IP addresses and a dict from product names to number of mentions -- you don't mention any specific language constraint so I assume a language that will do (implicitly) excellent hashing of such strings on your behalf is acceptable (Perl, Python, Ruby, Java, C#, etc, all have fine facilities for the purpose).

E.g., in Python:

import collections
import heapq

ipset = set()
prodcount = collections.defaultdict(int)

numlines = 0
for line in open('logfile.txt', 'r'):
  timestamp, ip, product = line.strip().split(';')
  ipset.add(ip)
  prodcount[product] += 1
  numlines += 1

print "%d distinct IP in %d lines" % (len(ipset), numlines)
print "Top 10 products:"

top10 = heapq.nlargest(10, prodcount, key=prodcount.get)
for product in top10:
  print "%6d %s" % (prodcount[product], product)
Alex Martelli
+1  A: 

First of all, one million lines is not at all a huge file.

A simple Perl script can chew a 2.7 million lines script in 6 seconds, without having to think much about the algorithm.

In any case hashing is the way to go and, as shown, there's no need to bother with hashing over an integer representation.

If we were talking about a really huge file, then I/O would become the bottleneck and thus the hashing method gets less and less relevant as the file grows bigger.

Theoretically in a language like C it would probably be faster to hash over an integer than over a string, but I doubt that in a language suited to this task that would really make a difference. Things like how to read the file efficiently would matter much much more.

Code

vinko@parrot:~$ more hash.pl
use strict;
use warnings;

my %ip_hash;
my %product_hash;

open my $fh, "<", "log2.txt" or die $!;

while (<$fh>) {
        my ($timestamp, $ip, $product) = split (/;/,$_); #To fix highlighting
        $ip_hash{$ip} = 1 if (!defined $ip_hash{$ip});
        if (!defined $product_hash{$product}) {
                $product_hash{$product} = 1
        } else {
                $product_hash{$product} = $product_hash{$product} + 1;
        }
}

for my $ip (keys %ip_hash) {
        print "$ip\n";
}

my @pkeys = sort {$product_hash{$b} <=> $product_hash{$a}} keys %product_hash;

print "Top product: $pkeys[0]\n";

Sample

vinko@parrot:~$ wc -l log2.txt
2774720 log2.txt
vinko@parrot:~$ head -1 log2.txt
1;10.0.1.1;DuctTape
vinko@parrot:~$ time perl hash.pl
11.1.3.3
11.1.3.2
10.0.2.2
10.1.2.2
11.1.2.2
10.0.2.1
11.2.3.3
10.0.1.1
Top product: DuctTape

real    0m6.295s
user    0m6.230s
sys     0m0.030s
Vinko Vrsalovic
+1  A: 

I also would read the file into a database, and would link it to another table of log filenames and date/time imported.

This is because in the real world you're going to need to do this regularly. The company is going to want to be able to check trends over time, so you're quickly going to be asked questions like "is that more or less unique IP addresses than last month's log file?" and "how are the most popular products changing from week to week".

In my experience the best way to answer these questions in an interview scenario as you describe is to show awareness of real-world situations. A tool to parse the log files (produced daily? weekly? monthly?) and read them into a database where some queries, graphs etc can pull all the data out, especially across multiple log files, will take a bit longer to write but be infinitely more useful and useable.

h4xxr
Then why do we still have log files and log file analysis tools? I'm not saying you're wrong, but you don't require a database. The amount of raw data may be too much to store in a database, so you might need to write software that will pre-process the data.
Robert Paulson
Fair point, for me the clue is the request for most popular product. That sounds very much like business/sales information being required which I'd always want in some kind of data-based analysis tool. Fair point about log files and analysis tools tho.
h4xxr
Surely, this depends on your database. What if your database is fast enough? Also, writing text files isn't the fastest way to log. Windows Device Drivers use Event Tracing for Windows (http://msdn.microsoft.com/en-us/library/aa468736.aspx)
John Saunders
A: 

Like other people write, there are just 2 hashtables. One for IP's and one for Products. You can count the occurrences for both, but you only care about them in the latter "Product Popularity"

The key to hashing is having an efficient hash key, and hash efficiency means that the keys are evenly distributed. Poor key choice means that there will be many collisions, and performance of the hash table will suffer.

Being lazy, I'd be tempted to create a Dictionary<IPAddress,int> and hope that the implementor of the IPAddress class created the hash key appropriately.

Dictionary<IPAddress, int> ipAddresses = new Dictionary<IPAddress, int>();
Dictionary<string, int> products = new Dictionary<string,int>();

For the Products list, after you've setup the hash table, just use linq to select them out.

var sorted = from o in products
             orderby o.Value descending
             where o.Value > 1
             select new { ProductName = o.Key, Count = o.Value };
Robert Paulson
A: 

Just sort the file by each of the two fields of interest. That avoids any need to worry about hash functions and will work just fine on a million record set.

Sorting the IP addresses this way also makes it easy to extract other interesting information such as accesses from the same subnet.

Captain Segfault
+3  A: 

Distinct IP addresses:

$ cut -f 2 -d \; | sort | uniq

Most popular product:

$ cut -f 3 -d \; | sort | uniq -c | sort -n

If you can do so, shell script it like that.

Autocracy
A: 

Unique IPs:

$ awk -F\; '{print $2}' log.file | sort -u

count.awk

{ a[$0]++ }

END {
    for(key in a) {
        print a[key] " " key;
    }
}

Top 10 favorite items:

$ awk -F\; '{print $3}' log.file | awk -f count.awk | sort -r -n | top
Will Hartung