views:

935

answers:

12

I am trying to come up with an algorithm to compare two strings. It would register a match any words that contain the same letters. For example rent and tern would be equivalent because they both contain the letters r,e,n,t.

EDIT I apologize for being so vague. The comparison is going to be made on two sets of a few thousands of words hundreds of times. This is only a small part of the overall code so I don't want it to bog everything down.

For those who were asking yes overmatching would be very important for example rent would also match ternicate.

EDIT 2 For a match like rent == ternicate, ternicate would not match rent. It is more like does word two contain the letters of word one. So if you have extra letters it would still be a match so long as the word contains all of the letters of the first word.

+6  A: 

Simply sort the characters of each string first, then compare them.

rent == tern
enrt == enrt
Brad Gilbert
I'd considered that however I didn't think that it would be efficient enough.
Steve
@Rob - it depends on how you sort them.
Stephen C
for small datasets, a seemingly overly simple approach is too often the approach that works best.
Nicholas Jordan
Keep it simple! Don't guess on the speed until you profile it with real life data.
Thorbjørn Ravn Andersen
Re: "I'd considered that however I didn't think that it would be efficient enough." I regularly do this kind of thing in python on a 175K word English dictionary and it takes about 2-3 seconds to do all the words. What is your efficiency concern based on?
hughdbrown
Comparing n strings of length k in this manner takes time O(n log k).
Apocalisp
+3  A: 

One alternative is to count the numbers of each character in each string and compare the counts. A simple implementation should take O(max(N, A)) time where N is the length of the larger of the strings, and A is the size of the array you use to store counts. For example, in Java:

public boolean equalIgnoringOrder(String s1, String s2) {
    if (s1.length() != s2.length()) {
        return false;
    }
    // Assuming characters in the range ASCII 0 to 127 
    int[] c1 = new int[128];
    int[] c2 = new int[128];
    for (int i = 0; i < s1.length(); i++) {
        c1[s1.charAt(i)]++;
        c2[s2.charAt(i)]++;
    }
    for (int i = 0; i < c1.length; i++) {
        if (c1[i] != c2[i]) {
            return false;
        }
    }
    return true;
}

There are some possible improvements to this. For example, you can cope with an arbitrary character set by doing a range reduction; i.e. do an initial pass through s1 and s2 looking for the smallest and largest characters in each one, and use this to determine the size of c1 and c2 and a base offset. This will use less space on average and reduce the time to initialize the count arrays. It also offers a short circuit for the comparison; e.g. when the smallest and largest characters for s1 and s2 are not the same.

By comparison, comparing strings sorted using heapsort or quicksort would be O(NlogN) on average with O(N) space, where N is the larger string's length.

However, as @pst points out, the constants of proportionality can make an O(NlogN) or even O(N*N) algorithm better than an O(N) algorithm if N is not large. In this case, the average lengths of the strings being compared is probably the most important factor.

The code above is effectively performing a Radix Sort with a couple of short circuits. (Three if you include the short circuit associated with range reduction.) So ultimately it boils down to whether a Quick Sort / Heap Sort or a Radix Sort would be better. And that depends on input string lengths and character ranges.


On a different tack. @John's answer proposes that we compute a product of prime numbers. If we do the computation using an arbitrary precision representation, the resulting values will be unique for each distinct set of "equal ignoring order" strings. Unfortunately, the computation will be O(N*N). (Each intermediate product has O(N) digits, and multiplying an N-digit number by a constant is O(N). Do that for N characters and you get O(N*N).)

But if we do the computation modulo (say) 64, the result is effectively a really good hash that is insensitive to character order; e.g.

long hash = 1;
for (int i = 0; i < s.length(); i++) {
    hash = hash * primes[s.charAt(i)];
}

So, I would claim that the algorithm that gives best performance and space usage on average for comparing randomly generated strings is likely to be of the form:

if (s1.length() != s2.length()) {
    return false;
}
if (hash(s1) != hash(s2)) { // computed as above
    return false;
}
// Compare using sorting or character counting as above.


One final point. If we assume that the string pointers are not identical and that the strings have unequal lengths, any algorithm that computes this equals predicate has to be at O(N) or worse. It has to examine every character in both strings to make this determination, and that takes O(N) operations.

Any algorithm that does less than 2 * N fetches or less than 2 * N further operations on the fetched values in this scenario is provably incorrect.

Stephen C
@Stephen Idea thief (just kidding). +1 for the good showing of code. However, this sorting can be performed as a radix sort so the O of the complexity of the sorting may be reducible considerably.
pst
@pst - this >>IS<< a radix sort :-)
Stephen C
@Stephen I challenge that :) It builds frequency tables and then compares those. It fails to perform the sorting.
pst
(Okay, okay, that was a cheap shot.)
pst
So your hash starts at 0 and is multiplied by a succession of primes? I'm betting you didn't run that code before posting.
hughdbrown
Duh!!!!!!!!!!!!!! Fixed.
Stephen C
+9  A: 
John Kugelman
Does it have to be "The first 26 primes" ? Can I just use any sequence of primes that I can generate easily? I would like to use your approach in my code.
Nicholas Jordan
Any set of unique primes will work.
John Kugelman
This can only work if you use a BigInteger representation for your numbers. If you use fixed sized integers you will get collisions and false positives.
Stephen C
Having said that, if you view the fixed precision version of this algorithm as a hash function, you could combine it with a sort / counting based algorithm to improve overall time and space complexity *in the average case*.
Stephen C
Use a 32-bit bitmap that fits inside an int.
peufeu
@peufeu I don't understand that. Multiplying (even the first 26) prime numbers together will quickly exhaust 32 bits of storage. [In 32 characters in the ideal case of "a" (2) and less otherwise].
pst
Thanks, John - I am considering going to BigInteger ( Stephen C ) to avoid overflow ( and related ) as I am writing this for a rather free-form project where I can use any code I can understand. I think peufeu's approach might work for some sort of String.hashCode() { are we in Java ? } Your answer ( approach ) is neither rid ( iculous ) nor in-efficient, but we could keep open the possibility of some bit-twisting as there are some remarkable advances in science done by offhand remark. See Kepler - major advance in Science while telling someone "The short version"
Nicholas Jordan
@Nicholas - see my updated answer for why BigInteger would be a bad idea ... and other thoughts.
Stephen C
This also works with the overmatching, as added to the question, because you can just check for divisibility instead of equality. Repeated letters would still need to be handled, but as the question does not specify how to handle them, it's not possible to design that part. It seems to me, though, that that case would be pretty simple too, no matter which (reasonable) handling is desired.
jk
@pst - see my answer for a 32-bit bitmap solution which appears to meet the vague requirements...
Alnitak
akavel
As jk points out, by checking for integer divisibility, this solution also handles the subset problem. :)
Nick Johnson
-1 - far too complicated
Alnitak
This may be disqualified because of the implementation language/platform. Check the size of these samples: artichokes 16888470642670typewriter 15593877170360907overflow 3976815478207numbers 14866421691On the other hand, it does show the equivalence of these phrases: womanhitler 9204705917595826motherinlaw 9204705917595826
hughdbrown
could this also be done first checking to see if the strings are equal length, then after that just add up the ascii values of each character. If they are equal, then they contain the same characters
Zerobu
@Zerobu - No, that would fail for ADD and ACE.
John Kugelman
+4  A: 

The key here, given the ambiguity in the question, is that it does not appear necessary to count how many times any letter appears, only that it does appear.

Therefore, assuming that all letters are in the range a-z, and also assuming that it's possible to index the original word lists as arrays using integer indices:

1. create two arrays (one for each list).

2. for every word in both lists calculate bitmap as follows:

bitmap = 0
foreach (character in word) {
    bitmap |= (1 << (character - 'a'))
}
arrayX[index] = bitmap;

this bitmap represents the set of all letters which appear in that word.

3. then for each word in set A, iterate over set B, and match when

arrayA[indexA] | arrayB[indexB] == arrayB[indexB]

That test will only be true if the set of characters in that word A is a subset of the characters of word B. The "or" operation for bitsets is the equivalent of the union (∪) operator for real sets.

See the Wikipedia entry on set mathemtatics - A ⊆ B if and only if A ∪ B = B.

BTW, Step 3 is O(n^2), but should still be very fast because it's just a bitwise comparison. A couple of thousand words in each list (~4M tests) should take less than a second.

Alnitak
Be careful however to take any necessary precautions, like: 1) converting all `word`s to lowercase 2) special handling if you have characters other than 'a'-'z' (e.g. not 7bit-ASCII-English) 3) `bitmap` must be able to contain flags for each character, so at least 26 (?) bits wide
akavel
This isn't a bitmap - it's a histogram.
Nick Johnson
@Nick - huh? No, it's a bitmap (or bitset) representing which letters occur. It does not _count_ how many times each occur, which would be a histogram.
Alnitak
@akavel - it's java - ints are always 32 bits wide, and I did state the 'a-z' assumption too... :)
Alnitak
Excellent solution. Then, of course, we'd need to know whether he'll be going through each pair of sets multiple times, or only once per word. Because, of course, if he's going through them multiple times, he'll want to store your array with each word.
CPerkins
@CPerkins - indeed, although the bitwise test is the important part of this solution, not whatever collection is used for the mapping beween word -> bitmap
Alnitak
+1  A: 

maybe not the fastet, but likely the shortest solution using java+google-collections+guava (for casting char[]->List<Character>)

import com.google.common.collect.ImmutableMultiset;
import com.google.common.primitives.Chars;

public class EqualsOrderignore {
private static boolean compareIgnoreOrder(final String s1, String s2) {
    return ImmutableMultiset.copyOf(Chars.asList(s1.toCharArray()))
            .equals(ImmutableMultiset.copyOf(Chars.asList(s2.toCharArray())));
} 
}

runtime of this algorithm: O(s1.length + s2.length)

i am quite convinced this solution will perform en-par with handcrafted O(N1+N2) solution on a -server VM.

as a plus this solution will work for any instances of characters, not just a-Z.

Andreas Petersson
A: 

Assuming that:

  1. your words only consist of ascii characters
  2. case doesnt matter
  3. abc matches abcde and abcde does not match abc

You can go through the match string (s2) counting chars, then go through the value (s1) and check all chars are present in the other, something like (pseudo code, not checked):

boolean matches(String s1, String s2) {
   int[]  counts = new int[256];
   char[] c1;
   char[] c2;

   c1 = s1.getCharArray();
   c2 = c2.getCharArray();

   // count char occurences in longest string
   for (int n = 0; n < c2.length; n++) {
       counts[(int)c2[n]]++;
   }

   // check all chars in shortest string are foud in the longest
   for (int n = 0; n < c1.length; n++) {
       if (0 == counts[(int)c1[n]]) {
          return false;
       }
   }

   return true;
}

This would be O(n) for the sum of argument lengths.

Edit: the question was changed to an asymmetric function between s1 an s2.

rsp
Why do you compare the number of counts to 0? That just says whether at least a single instance is found in the other string. Maybe it makes more sense to do the same count on s2 and see if `s1_count(c) <= s2_count(c)` for each c in s1.
hughdbrown
That is exactly what is asked: s1 matches s2 if all chars in s1 occur in s2 regardless of sequence. Therefore it does not match if one of the chars in s1 does not occur in s2, hence the comparison aganst 0.
rsp
+1  A: 

I've done a lot of code that worked with word games and anagrams. The usual approach is to convert the word into a sorted key so that, as mentioned above, 'rent' matches 'tern' because both map to 'enrt'. Once you start on this route, though, it becomes really useful to have a dictionary of character and count of occurrence. Here is some python code that converts an unsorted string to a dictionary with (key=character, value=count):

import collections

# Create a defaultdict(int) from a string
def create_collections_dict(key):
    dk = collections.defaultdict(int)
    for k in key:
        dk[k] += 1
    return dk

Now you can score words against others by instantly seeing how many letters they have in common:

# Score the similarity of a defaultdict(int) against a string
# (which is temporarily converted to a defaultdict(int))
def score(dk, cand) :
    dc = create_collections_dict(cand)
    return sum(min(dk[k], dc[k]) for k in dk.keys() if k in dc)

if __name__ == '__main__':
    base = create_collections_dict('rent')
    for word in ['tern', 'ternicate', 'foobar']:
        print word, score(base, word)

Results:

tern 4
ternicate 4
foobar 1
hughdbrown
+2  A: 

I have to agree with Stephen C - this is not well enough defined to answer.

I'm not going to downvote, but could you explain, for example, whether rent is equivalent to terrent? You've got answerers who are assuming that it is (people thinking that number of occurrences doesn't matter, and other answerers who assume the worst. One of these groups is wasting their time.

Also, since your concern is about performance, we need to know more about your call pattern. Could you explain whether you'll look at a pair of sets more than once, or if the sets vary?

And just as a terminology twitch, you may already know this, but with the current formulation your algorithm isn't symmetric.

You say that rent would match ternicate, but obviously, ternicate would not match rent. So you're not really looking for equivalence. You're looking for something like "is found in", or "can be made from".

This means you have to care about order - you'll get different results depending on how you visit your sets.

Don't get me wrong: it's an interesting problem... I just don't know what the problem is.

CPerkins
it's well defined enough to answer now
Alnitak
A: 

It is pretty vague, but I'd use an associative array to solve it:

Use each letter of each word as the key to an associative array of integers. One word's letters increment the values, and the other would decrement. Then at the end you can run foreach through all the keys and check that all the values are zero, and then it matches. This gets you basic rent==tren functionality.

Caveats for vagueness: 1. If multiple letters are okay, eg rent==rreenntt, then when adding letters to the array, check if the key exists, and if it does, don't add it again.
2. If extra letters are okay, eg rent==renter, but fernt!=renter, then when checking the array values at the end, check that 1 and -1 aren't both in the array at the same time. In other words, 1 and 0 only are okay, or -1 and 0 are okay, but not 1 and -1 can't be in the array at the same time.

I don't know how fast this would be relative to other approaches, but it would be easy to implement.

SDGator
A: 

I think you should build a tree. I have written a bit of python code to illustrate the idea, but it's probably buggy:

class knot():
    def __init__(self, char, is_word, string = "" way = 0):
        self.children = []
        self.shortest_way = way
        self.char = char
        self.word = is_word
        self.string = string

def comparing(strings):
    sorted_strings = []
    for string in strings:
        array_of_string = []
        for char in string:
            array_of_string.append(char)
        sorted_strings.append(array_of_string.sort())
    sorted_strings.sort()

    start = []
    matches = []

    for array_of_string in sorted_strings:
        matches += insert_string(array_of_string, start)

def insert_string(array, start):
    for match_string in test_string(array, start):
        matches += (array, match_string.string)
    add_string(array, start, 0):

def add_string(array, knots, n):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[n]):
            minimum = num
        elif (knots[num].char < array[n]):
            maximum = num
        elif (knots[num].char == array[n]):
            return add_string(array, knots[num], n+1)
    knots.append(new_knots(array, n))
    knots.sort

    """ more insertion routine needed"""

def search_children(array, knots):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[0]):
            minimum = num
        elif (knots[num].char < array[0]):
            maximum = num
        elif (knots[num].char == array[0]):
            return test_string(array, knots[num])
    return []

def test_string(array, target_knot):
    if len(array) > target_knot.sortest_way + 1:
        return []
    match_knots = []
    if len(array) == 1 and target_knot.is_word == True:
        match_knots.append(target_knot)
    for i in range(1, len(array)):
        match_knots += search_children(array[i:], target_knot.children)
    return match_knots
Christian
A: 

Assuming you're just looking for subsets, and are limited to the common English letters, then an efficient histogram will do. I'd look at using 64-bit unsigned integer, with 2 bits for counting up to 2 occurrences, and the extra 12 bits to add an overflow flag and to count up to 3 occurrences of 'e t a o i n s r h l d'. Bits are filled rather than using binary ( so for three 'e's you would have 111, otherwise you need something more complicated than a binary & to test containment). To test the subset relation, you check the overflow bit of the subset you are testing and if not set, you can just use bitwise and to test for subset. Fall back to O(Length) checking of the sorted contents of the string if the histogram does overflow.

Pete Kirkham
A: 

For whatever algorithm you choose, an optimization might be made for strings of the same length. All you have to do is XOR each character, if the result is 0 then they contain the same letters. This doesn't help in the substring case, but it might be a help to short circuit a more expensive compare.

Dolphin