views:

196

answers:

6

Let's say I have a set of keywords in an array {"olympics", "sports tennis best", "tennis", "tennis rules"}

I then have a large list (up to 50 at a time) of strings (or actually tweets), so they are a max of 140 characters.

I want to look at each string and see what keywords are present there. In the case where a keyword is composed of multiple words like "sports tennis best", the words don't have to be together in the string, but all of them have to show up.

I've having trouble figuring out an algorithm that does this efficiently.

Do you guys have suggestions on a way to do this? Thanks!

Edit: To explain a bit better each keyword has an id associated with it, so {1:"olympics", 2:"sports tennis best", 3:"tennis", 4:"tennis rules"}

I want to go through the list of strings/tweets and see which group of keywords match. The output should be, this tweet belongs with keyword #4. (multiple matches may be made, so anything that matches keyword 2, would also match 3 -since they both contain tennis).

When there are multiple words in the keyword, e.g. "sports tennis best" they don't have to appear together but have to all appear. e.g. this will correctly match: "i just played tennis, i love sports, its the best"... since this string contains "sports tennis best" it will match and be associated with the keywordID (which is 2 for this example).

Edit 2: Case insensitive.

A: 

I would suggest putting all of your keywords into a list of strings then going over your list of data (tweets, whatever) as another list of strings.

Do something like this:

Dim matchingStrings As Dictonary(String, String);
For Each stringToSearch As String In tweetList
   For Each keyword As String In keywordList
      If stringToSearch.Contains(keyword)
        matchingString.Add(stringToSearch, keyword);

break; End IF End For End For

Then MatchingString Contains all of your matches

EDIT: In C# And per your multiple words in a keyword list

Dictionary<string, string> matchingString = New Dictionary<string, string>; 
foreach (String stringToSearch In tweetList){
   foreach (String keyword In keywordList){
        If(stringToSearch.Contains(keyword){
            matchingString.Add(stringToSearch, keyword);
            break;
}
else if{
    List<string> split = keyword.Split(" ")
   foreach(String sKeyword In split){
          If(stringToSearch.Contains(keyword){
             matchingString.Add(stringToSearch, keyword);
             break;
          }
    }

 }

} }

msarchet
But what about the keywords that have multiple words in them? This wouldn't match that.
rksprst
Q is tagged c# not vb
Mikael Svenson
It will match the strings, if you need to match the individual words in the key words you will need to split up your keywords then. I'm going to rewrite this in C# in just a second
msarchet
@Mikael Svenson: So what? There’s a trivial 1:1 translation.
Konrad Rudolph
I'm working on it, I didn't actually notice that
msarchet
The curly braces are right I just can't get them to line up correctly for me.
msarchet
@Konrad, I agree it's trivial, but if you tag something C#, you expect the answer to be in C#. If not why tag it?
Mikael Svenson
+6  A: 
IEnumerable<string> tweets, keywords;

var x = tweets.Select(t => new
                           {
                               Tweet = t,
                               Keywords = keywords.Where(k => k.Split(' ')
                                                               .All(t.Contains))
                                                  .ToArray()
                           });
dtb
A: 

Whoops.

  foreach (var s in strings)
  {
      foreach (var keywordList in keywordSet) 
      {
          if (s.ContainsAll(keywordList))
          {
              // hit!
          }
      }
  }

...

private bool ContainsAll(this string s, string keywordList)
{    
    foreach (var singleWord in keywordList.Split(' '))
    {
        if (!s.Contains(singleWord)) return false;
    }
    return true;
}
Cheeso
+1  A: 

Multiple patterns can be searched very efficiently using several algorithms such as the algorithm of Aho-Corasick (using a trie) or the one from Wu and Manber.

If performance is critical, I suggest taking either of those. To search in multiple strings, it may be most efficient to concatenate all your 50 strings into one larger string, keeping book of the starting positions of individual strings.

Konrad Rudolph
+1  A: 

Maybe something like this?

        string[] keywords = new string[] {"olympics", "sports tennis best", "tennis", "tennis rules"};

        string testString = "I like sports and the olympics and think tennis is best.";

        string[] usedKeywords = keywords.Where(keyword => keyword.Split(' ').All(s => testString.Contains(s))).ToArray();
Robin Day
A: 

There are ways to pre-process strings to make searches more effective, but I think that the overhead is more than the gain for such short strings. It's not that much data, so I would just loop through the strings:

foreach (string tweet in tweets) {
  foreach (string keywords in theArray) {[
    string[] keyword = keywords.Split(' ');
    bool found = true;
    foreach (string word in keyword) {
      if (tweet.indexOf(word) == -1) {
        found = false;
        break;
      }
    }
    if (found) {
      // all words exist in the tweet
    }
  }
}
Guffa