views:

20

answers:

3

log file1 contains records of customers(name,id,date) who visited yesterday log file2 contains records of customers(name,id,date) who visited today

How would you display customers who visited yesterday but not today?

Constraint is: Don't use auxiliary data structure because file contains millions of records. [So, no hashes] Is there a way to do this using Unix commands ??

+1  A: 

Is a customer identified by id? Is it an int or long? If the answer to both questions is yes, an array with 10,000,000 integers shouldn't take more than 10M*4 = 40MB memory - not a big deal on decent hardware. Simply sort and compare them.

btw, sorting an array with 10M random ints takes less than 2 seconds on my machine - again, nothing to be afraid of.

Here's some very simple Java code:

public static void main(final String args[]) throws Exception {

    // elements in each log file
    int count = 10000000;

    // "read" our log file
    Random r = new Random();
    int[] a1 = new int[count];
    int[] a2 = new int[count];
    for (int i = 0; i < count; i++) {
        a1[i] = Math.abs(r.nextInt());
        a2[i] = Math.abs(r.nextInt());
    }

    // start timer
    long start = System.currentTimeMillis();

    // sort logs
    Arrays.sort(a1);
    Arrays.sort(a2);

    // counters for each array
    int i1 = 0, i2 = 0, i3 = 0;

    // initial values
    int n1 = a1[0], n2 = a2[0];

    // result array
    int[] a3 = new int[count];

    try {
        while (true) {
            if (n1 == n2) {
                // we found a match, save value if unique and increment counters
                if (i3 == 0 || a3[i3-1] != n1) a3[i3++] = n1;
                n1 = a1[i1++];
                n2 = a2[i2++];
            } else if (n1 < n2) {
                // n1 is lower, increment counter (next value is higher)
                n1 = a1[i1++];
            } else {
                // n2 is lower, increment counter (next value is higher)
                n2 = a2[i2++];
            }
        }
    } catch (ArrayIndexOutOfBoundsException e) {
        // don't try this at home - it's not the pretties way to leave the loop!
    }

    // we found our results
    System.out.println(i3 + " commont clients");
    System.out.println((System.currentTimeMillis() - start) + "ms");

}

result

// sample output on my machine:
46308 commont clients
3643ms

as you see, quite efficient for 10M records in each log

sfussenegger
+1, I bet that's a lot faster than the script I just threw together and is definitely the right way to go.
Ninefingers
A: 

I was personally going for the creating a data structure and records of visits, but, I can see how you'd do it another way too.

In pseudocode, that looks something like python but could be re-written in perl or shell script or ...

import subprocess 
import os

for line in fileinput.input(['myfile'])::
    # split out data. For the sake of it I'm assuming name\tid\tdate
    fields = line.split("\")
    id = fields[1]

    grepresult = subprocess.Popen("grep \"" + id + "\" file1", shell=True, bufsize=bufsize, stdout=PIPE).stdout

    if len(grepresult) == 0:
        print fields # it wasn't in field1

That's not perfect, not tested so treat appropriately but it gives you the gist of how you'd use unix commands. That said, as sfussenegger points out C/C++ if that's what you're using should be able to handle pretty large files.

Disclaimer: this is a not so neat solution (repeatedly calling grep) to match the requirements of the question. If I was doing it, I would use C.

Ninefingers
A: 

an example, but check the man page of comm for the option you want.

comm -2 <(sort -u yesterday) <(sort -u today)

The other tool you can use is diff

diff <(sort -u yesterday) <(sort -u today)
ghostdog74