tags:

views:

301

answers:

6
+3  Q: 

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").

Are there any suggestions on making a regex match misspellings of 1 or 2 letters?

The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.

I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.

+11  A: 

Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.

It's also built into PHP.

So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.

I think you'd have more luck matching this way than trying to craft regex's for each special case.

Triptych
stupid 200 rep cap.
Triptych
The price of being awesome ^^
Teifion
Combined with some custom code for checking for silly characters this works really well. Thanks :)
Teifion
+1  A: 

This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.

At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.

codelogic
+3  A: 

Google's implementation of "did you mean" by looking at previous results might also help:

http://stackoverflow.com/questions/41424/how-do-you-implement-a-did-you-mean

Neil Barnwell
+2  A: 

Back in the days we sometimes used Soundex() for these problems.

PEZ
What is Soundex() ?
Teifion
I think there is a strict definition using some algorithm to represent how a word sound. Many databases have a SOUNDEX() function that implements it. But less strictly it can be any function. It's very fun to write them! =) Search this site for "soundex". It shows up now and then.
PEZ
+2  A: 

You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.

Norman Ramsey
+3  A: 

What is Soundex() ? – Teifion (28 mins ago)

A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex

You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D

Will Bickford
Well have 10 rep from me for a good answer :)
Teifion
10 rep from me too. Comment away.
PEZ