views:

229

answers:

3

Hi all

I know that I can use [a-z] to check for any alphabets from a to z in CF 8. However, are there any regex to detect spanish alphabets like á, í, ó, é, ñ, etc.?

Thanks in advance, Monte

A: 

There was recently a discussion here about international RegExes which I cannot find right now. I believe the current situation is that regular expressions are commonly possible with only default latin alphabet.

User
Second that, at this point it's pretty much still "build your own" when working outside the standard latin alphabet.
patjbs
A: 

Try if the special "word character class" \w works for you. Caution: This will also match numbers. Perhaps you could clarify with an example what you want to accomplish exactly?

\w should match a, ä or á (but also 0).

\w(?<!\d) will match a, ä or á (but not 0).

\w+ will match börk but also l33t.

\b(?:\w(?<!\d))+\b will match börk but not l33t.

Tim Pietzcker
With CF regex, \w only matches alphanumeric and underscore, and it doesn't support negative lookbehinds, so none of your examples will work as intended.
Peter Boughton
+2  A: 

ColdFusion doesn't nicely deal with Unicode regex. You can use things like #Chr(375)# to get the characters into a regex string, but it's a bit messy having to do that.

However, Java does work with Unicode, and since CF can utilise Java easily, you can use Java regexes to do unicode matching.


This will match a single Unicode letter in Java regex:

\p{L}

With more details on regex Unicode here: http://www.regular-expressions.info/unicode.html


And as for using Java regex in CF, well simple replacing is just this:

<cfset NewString = OldString.replaceAll('\p{L}','ReplaceWith') />

So if all you need is to replace strings, you can do that.

However, if you want matching (equivalent to rematch), or more complex functionality, then simplest solution is to use a component that wraps the Java regex functionality into a easy to use CFC with regular CFML functions you can call. Like jre-utils.cfc

This allows you to do:

<cfset jre = createObject('component','jre-utils').init() />

<cfset Matches = jre.match( '\p{L}++' , String ) />

Which will return an array of the (Unicode) words in the string.


Peter Boughton