views:

446

answers:

5

Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.

But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?

Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.

BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.

Thanks in advance!

EDIT: I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...

A: 

If acronym's have fixed size (like in above example), you could calculate a hash for all of them (could be done once per application life) and then split the string in such overlapped pieces and calculate hashes for them too. Then all you'd have to do is to search for values from one array into another one.

You probably could create a suffix/prefix tree or something similar from acronyms and search using this information, there's plenty of algorithms in Wikipedia to do just that.

You could also create an deterministic automata for each of acronyms but it's very similar to previous approach.

vava
Unfortunately the acronym's don't have a fixed size, so I don't think a hash would help... but interesting idea, thanks!
Dscoduc
A: 

Why not simply split the string and compare the returned list? It seems like needless overhead to use a REGEX in this case. I know your format may differ, but it would seem that you could:

  • Split the string based on the 'title separator', in your case a colon :
  • Take the 2nd half of the result, the acronym string, and split it based on the acronym separator, in this case a pipe |
  • Finally, iterate over the newly split list of acronyms and compare each to your list of candidates with a nested for loop

EDIT: If you only need to know if a particular acronym or set of acronyms exist inside a string, use the .Search() method instead of .Match().

Soviut
I hadn't thought about a nested loop. The example string and acronym I gave is on;y a subset of the many different formats the acronym's are stored... so splitting the string wouldn't be as efficient as using a Regular Expression...
Dscoduc
I'm sorry, I meant the string I would be searching has many different formats so I couldn't reliably split the string value...
Dscoduc
A: 

The regex approach seems efficient and elegant enough. Of course, you'll have to watch out for unescaped characters when building the expression, or a failure to compile it because of complexity or size limitations.

Another way to do this would be to construct a trie data structure to represent all the acronyms (this may somewhat duplicate what the regex matcher is doing). As you step through each character in the string, you would create a new pointer to the root of the trie, and advance existing pointers to the appropriate child (if any). You get a match when any pointer reaches a leaf.

Zach Scrivena
Argh! I just glanced at the trie data structure link and now my brain hurts... Seriously though, I should look closer at this to see if it would work... Thanks for the suggestion!
Dscoduc
Yup, it's probably not worth the effort since regex is really much simpler.
Zach Scrivena
A: 

Here is what I came up with. I would appreciate any constructive criticism that you could offer...

First, create an enum that holds each of my acronym's:

enum acronym
{ AB1,DE2,CC3 }

Next I create a string array of the enum:

string[] acronyms = Enum.GetNames(typeof(acronym));

Finally I loop through the string array and peform the regex.match method:

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

See anything wrong with that?

Dscoduc
Perhaps you could just use if(input.Contains(a)) instead of Regex?
it depends
Using .Contains would probably run faster but I need to capture the matching value (sorry, I just updated the question)...
Dscoduc
If acronyms are just strings (not RegExs) then the matched value by definition equals to the acronym you are using to match against e.g., `if (input.Contains(a)) { matchedValue = a; ... }`.
J.F. Sebastian
+2  A: 

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

it depends
Nice! One question... You mentioned the match would include both matches... wouldn't only the first match be returned (what I really want) since the for loop ends when the match.Success is met?
Dscoduc
It loops while there's success, so for me at least, it returns both AB nd CD, but doesn't return BC. If you want to check only for any match then you could use if(regex.IsMatch(inpuut)){..}
it depends
doh... Just figured that out... Couldn't you use the regex.Matches("ZZZABCDZZZ") instead of the for..loop?In any case, very elegant solution and answers my question! Thanks!
Dscoduc