tags:

views:

615

answers:

4

Regular expression languages use \B to include A..Z, a..z, 0..9, and _, and \b is defined as a word boundary.

How can I write a regular expression that matches all valid Spanish words, including characters such as: á, í, ó, é, ñ, etc.?

I'm using .NET.

+4  A: 

Use a Spanish locale and make your regex locale-sensitive.

Dave
+1  A: 

Your regex system should have something equivalent to Python's re.L (aka re.LOCALE) to make a regex locale-dependent, so that what's a word-character and what isn't changes with locale, as do "word boundaries" etc. Are you instead asking for a way to compensate for some given regex system not supporting locale, trying to force the issue anyway...?

Alex Martelli
A: 

I did some Googling

this may help you to use french words with regex.

for more search

Ish Kumar
A: 

This depends heavily on the language (and regex engine) you're using.

In Perl, \w matches all word characters, regardless of language or alphabet, and something like /\b(\w+)\b/ would (probably) match Spanish words as well as English words or Russian words.

In languages using PCRE, \w (and therefore probably \b) do NOT match Unicode characters. You will probably need to build your own set. I suggest something like [\wáéíóúñ] (matches all word characters, plus the accented characters you want), and the PCRE library has to be pre-built with Unicode support before this will even work.

If you're using something else, good luck. Some regex engines don't even support Unicode.

Chris Lutz