ansaurus

Question

display records which exist in file2 but not in file1

Answer 1

+1 A:

Is a customer identified by id? Is it an int or long? If the answer to both questions is yes, an array with 10,000,000 integers shouldn't take more than 10M*4 = 40MB memory - not a big deal on decent hardware. Simply sort and compare them.

btw, sorting an array with 10M random ints takes less than 2 seconds on my machine - again, nothing to be afraid of.

Here's some very simple Java code:

public static void main(final String args[]) throws Exception {

    // elements in each log file
    int count = 10000000;

    // "read" our log file
    Random r = new Random();
    int[] a1 = new int[count];
    int[] a2 = new int[count];
    for (int i = 0; i < count; i++) {
        a1[i] = Math.abs(r.nextInt());
        a2[i] = Math.abs(r.nextInt());
    }

    // start timer
    long start = System.currentTimeMillis();

    // sort logs
    Arrays.sort(a1);
    Arrays.sort(a2);

    // counters for each array
    int i1 = 0, i2 = 0, i3 = 0;

    // initial values
    int n1 = a1[0], n2 = a2[0];

    // result array
    int[] a3 = new int[count];

    try {
        while (true) {
            if (n1 == n2) {
                // we found a match, save value if unique and increment counters
                if (i3 == 0 || a3[i3-1] != n1) a3[i3++] = n1;
                n1 = a1[i1++];
                n2 = a2[i2++];
            } else if (n1 < n2) {
                // n1 is lower, increment counter (next value is higher)
                n1 = a1[i1++];
            } else {
                // n2 is lower, increment counter (next value is higher)
                n2 = a2[i2++];
            }
        }
    } catch (ArrayIndexOutOfBoundsException e) {
        // don't try this at home - it's not the pretties way to leave the loop!
    }

    // we found our results
    System.out.println(i3 + " commont clients");
    System.out.println((System.currentTimeMillis() - start) + "ms");

}

result

// sample output on my machine:
46308 commont clients
3643ms

as you see, quite efficient for 10M records in each log

sfussenegger 2010-03-16 18:14:05

+1, I bet that's a lot faster than the script I just threw together and is definitely the right way to go.

Ninefingers 2010-03-16 18:27:25

Answer 2

A:

I was personally going for the creating a data structure and records of visits, but, I can see how you'd do it another way too.

In pseudocode, that looks something like python but could be re-written in perl or shell script or ...

import subprocess 
import os

for line in fileinput.input(['myfile'])::
    # split out data. For the sake of it I'm assuming name\tid\tdate
    fields = line.split("\")
    id = fields[1]

    grepresult = subprocess.Popen("grep \"" + id + "\" file1", shell=True, bufsize=bufsize, stdout=PIPE).stdout

    if len(grepresult) == 0:
        print fields # it wasn't in field1

That's not perfect, not tested so treat appropriately but it gives you the gist of how you'd use unix commands. That said, as sfussenegger points out C/C++ if that's what you're using should be able to handle pretty large files.

Disclaimer: this is a not so neat solution (repeatedly calling grep) to match the requirements of the question. If I was doing it, I would use C.

Ninefingers 2010-03-16 18:20:33

Answer 3

A:

an example, but check the man page of comm for the option you want.

comm -2 <(sort -u yesterday) <(sort -u today)

The other tool you can use is diff

diff <(sort -u yesterday) <(sort -u today)

ghostdog74 2010-03-17 00:10:29

ansaurus

tags:

views:

answers:

display records which exist in file2 but not in file1

related questions