views:

275

answers:

5

I have two strings containing letters and numbers separated by spaces. ex "elza7ma wa2fa fel matab" and "2ana ba7eb el za7ma 2awy 2awy"

What is the fastest way to compare this two string to find out whether or not they have a word in common?

I tried to split one of them using string.split and use string.compare on the whole array of words. but this is very slow since I will be comparing a lot of strings.

+11  A: 

A LINQ solution

"elza7ma wa2fa fel matab".Split()
                         .Intersect("2ana ba7eb el za7ma 2awy 2awy".Split())
                         .Any();

// as a string extension method
public static class StringExtensions
{
    public static bool OneWordMatches(this string theString, string otherString)
    {
        return theString.Split().Intersect(otherString.Split()).Any();
    }
}

// returns true
"elza7ma wa2fa fel matab 2ana".OneWordMatches("2ana ba7eb el za7ma 2awy 2awy");
Russ Cam
No overload of `Split` takes a single `char`. Probably best to also specify `RemoveEmptyEntries`
JaredPar
You can use the `Split()` without any parameters. In this case it will use spaces, tabs and new lines as separator.
Oliver
Is this really faster, or does Intersect() also loops through both arrays?
Sjoerd
Not sure about performance, will need to benchmark them. I suspect a hashset implementation like JaredPar's would be faster
Russ Cam
This one is implementation is excellent. I will be comparing hundreds of strings so performance is a must. Do you know whether this is faster or the hashset? how can I know?
Marwan
@Marwan: You can know by *trying it both ways and measuring the results*. How else could you possibly know?
Eric Lippert
@Russ: FYI Intersect already uses a Set<T> class internally to implement intersection.
Eric Lippert
@Eric - Thanks - I couldn't quite see what was going on in Reflector :)
Russ Cam
+4  A: 

I think the easiest way is to break up the strings into words and use a set structure like HashSet<string> to check for duplicates. For example

public bool HasMatchingWord(string left, string right) { 
  var hashSet = new HashSet<string>(
    left.Split(" ", StringSplitOptions.RemoveEmptyEntries)); 
  return right
    .Split(" ", StringSplitOptions.RemoveEmptyEntries)
    .Any(x => hashSet.Contains(x));
}
JaredPar
Might want to add an equality check as well to handle collisions (if there are any).
Jeff M
Is anyone sure if this is the best method performance wise?
Marwan
+1  A: 

You could split the two strings by word and build two hashtables/dictionaries. Then go through both and add keys incrementing an int in a third dictionary (Dictionary<string, int>). If any key in the third dictionary have a count of more than one, that word is in both original strings.

I would think that any algorithm to solve this problem would be 'slow' - especially for large input strings/many words.

mbanzon
Adding all words to the same HashSet and checking the return value of Add() is simpler.
Sjoerd
I re-read the original question - yes it would be much simpler. He is just asking if there are any words that are represented in both strings - not which and not how many occurences.
mbanzon
A: 

I'd probably take the initial performance hit and split the string and then order alphabetically and by word length. If you just have to find out if one word matches, break as soon as you find one. Once you have the split string arrays ordered alphabetically and by length, that limits the numbers of compares you would have to do.

Matt
A: 
  • The easiest way would be to compare all words to any other word. This is an easy solution, but slow.
  • Another way is to sort both lists, and then compare the top two entries. Like mergesort, but with the goal of finding equal words.
  • Another way is to compile the list of words into a tree, and match the words against that tree. A regex can do this, or you can do this yourself. In your example, the first letter should be 2, b, e or z. This way, each word is only inspected once and the least number of characters are inspected.
Sjoerd