tags:

views:

363

answers:

5

I've got two ways of fetching a bunch of data. The data is stored in a sorted vector<map<string, int> >.

I want to identify whether there are inconsistencies between the two vectors.

What I'm currently doing (pseudo-code):

for i in 0... min(length(vector1), length(vector2)):
    for (k, v) in vector1[i]:
        if v != vector2[i][k]:
            // report that k is bad for index i,
            // with vector1 having v, vector2 having vector2[i][k]

for i in 0... min(length(vector1), length(vector2)):
    for (k, v) in vector2[i]:
        if v != vector1[i][k]:
            // report that k is bad for index i,
            // with vector2 having v, vector1 having vector1[i][k]

This works in general, but breaks horribly if vector1 has a, b, c, d and vector2 has a, b, b1, c, d (it reports brokenness for b1, c, and d). I'm after an algorithm that tells me that there's an extra entry in vector2 compared to vector1.

I think I want to do something where when I encountered mismatches entries, I look at the next entries in the second vector, and if a match is found before the end of the second vector, store the index i of the entry found in the second vector, and move to matching the next entry in the first vector, beginning with vector2[i+1].

Is there a neater way of doing this? Some standard algorithm that I've not come across?

I'm working in C++, so C++ solutions are welcome, but solutions in any language or pseudo-code would also be great.

Example

Given the arbitrary map objects: a, b, c, d, e, f and g;

With vector1: a, b, d, e, f

and vector2: a, c, e, f

I want an algorithm that tells me either:

Extra b at index 1 of vector1, and vector2's c != vector1's d.

or (I'd view this as an effectively equivalent outcome)

vector1's b != vector2's c and extra d at index 2 of vector1

Edit

I ended up using std::set_difference, and then doing some matching on the diffs from both sets to work out which entries were similar but different, and which had entries completely absent from the other vector.

+4  A: 

Something like the std::mismatch algorithm

You could also use std::set_difference

Glen
don't forget about set_symmetric_difference().
D.Shawley
Both of those look interesting to me. This might seem like a stupid question, but do `set_difference` and `set_symmetrics_difference` deal in sets? (i.e. Are repeated elements effectively ignored, so `a, a, b, c` is equivalent to `a, b, c`?)
Dominic Rodger
It's not clear to me, but my reading of it is that std::set_difference requires a sorted set of data. i.e. no duplicates.
Glen
A: 

What exactly are you trying to achieve? Could you please define precisely what output you expect in terms of the input? Your pseudo code compares maps at the vector index. If that is not the correct semantics, then what is?

Ari
I've edited the question to include a more fully-fleshed out example. Let me know if you want further clarification.
Dominic Rodger
It seems like what you want is the so called "edit distance" between the vectors. Generally speaking, given two strings (in this case "strings" of map objects) there are many ways to transform one into the other using the basic operations "delete one character", "insert one character", and "change one character". The edit distance is the (lengtgh of) the shortest sequence of such operations that transforms one string to the other. Is this what you want? There are known algorithms for that.
Ari
@Ari - great point about edit distance. I'd definitely not been thinking in the right terms - I'm looking for something like Levenshtein distances, but with details as to precisely *what* elements need to be changed.
Dominic Rodger
The dynamic programming algo that computes the distance can also be made to give the sequence of operations realizing it, i.e., what elements need to changed/added/deleted.
Ari
A: 

Can you associate with each map some kind of checksum (or Blumen filter) - that at single check you could be able to decide if comparison has a sense.

Dewfy
I can fairly easily tell if the two vectors are different, what I care about is precisely *how* they are different.
Dominic Rodger
A: 

In your example, note that is not possible to differentiate between

Extra b at index 1 of vector1, and vector2's c != vector1's d.

and

Extra b at index 1 of vector 1, extra d at index 2 of v1, and extra c at 1 in v2

because it is not clear that "c" shoud be compared to "d", it could be compared to "b" either. I assume the vectors are not sorted, because std::map doesn't provide a relational operator. Rather are the maps, which is as far as I see completly irrelevant ;-) So your example is slightly misreading. It could even be

Compare b f e a d

with a c f e

You can check each element of the first vector against each element of the second vector. This has quadratic runtime.

for i in 0... length(vector1):
    foundmatch = false;

    for j in 0... length(vector2):
        mismatch = false;
        for (k, v) in vector1[i]:
            if v != vector2[j][k]:
                mismatch = true;
                break; // no need to compare against the remaining keys.

        if (!mismatch) // found matching element j in vector2 for element i in vector1
            foundmatch = true;
            break;  // no need to compare against the remaining elements in vector2

    if (foundmatch)
        continue;
    else
        // report that vector1[i] has no matching element in vector2[]
        // "extra b at i"

If you want the find the missing elements, just swap vector1 and vector2.

If you want to check in a element in vector2 mismatches to a element in vector1 in only a single key, you have to add additional code around "no need to compare against the remainig keys".

drhirsch
+1  A: 

It sounds like you're looking for the diff algorithm. The idea is to identify the longest common subsequence of the two vectors (using map equality), then recurse down the non-common portions. Eventually you'll have an alternating list of vector sub-sequences that are identical, and sub-sequences that have no common elements. You can then easily produce whatever output you like from this.

Apply it to the two vectors, and there you go.

Note that since map comparison is expensive, if you can hash the maps (use a strong hash - collisions will result in incorrect output) and use the hashes for comparisons you'll save a lot of time.

Once you're down to the mismatched subsequences at the end, you'll have something like:

Input vectors: a b c d e f, a b c' d e f
Output:
   COMMON a b
   LEFT c
   RIGHT c'
   COMMON d e f

You can then individually compare the maps c and c' to figure out how they differ.

If you have a mutation and insertion next to each other, it gets more complex:

Input vectors: a b V W d e f, a b X Y d e f
Output:
   COMMON a b
   LEFT V W
   RIGHT X Y
   COMMON d e f

Determining whether to match V and W against X or Y (or not at all) is something you'll have to come up with a heuristic for.

Of course, if you don't care about how the content of the maps differ, then you can stop here, and you have the output you want.

bdonlan
+1 - thanks, that's exactly what I'm trying to do. I'll take a look at the longest common subsequence problem.
Dominic Rodger