views:

213

answers:

4

First, a brief example, let's say I have this "/[0-9]{2}°/" regex and this text "24º". The text won't match, obviusly ... (?) really, it depends on the font.

Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknow chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new "-", now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.

Does anyone knows how I might solve this?

A: 

There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.

For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.

jarnbjo
A: 

Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)

In PHP you are going to be limited to selecting the most common character sets and using them.

This should help: http://unicode.org/charts/charindex.html

There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.

Erik Nedwidek
+1  A: 

Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.

Jan Goyvaerts
A: 

Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.

temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.

Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?

[0-9]{1,3}.

You might have to beef up the first part though with a little position handling like beginning of the string or end.

You may also exclude all the regular characters you don't want.

[0-9]{1,3}[^a-zA-Z]

That will pick up all the punctuation marks (only one though).

Keng