ansaurus

Question

Answer 1

+5 A:

Try this:

(?:[\w\-](?<!_))+

It does a simple match on anything that is encoded as a \w (or a dash) and then has a zero-width lookbehind that ensures that the character that was just matched is not a underbar.

Otherwise you could pick this one:

(?:[^_\W]|-)+

which is a more set-based approach (note the uppercase P)

OK, I had a lot of fun with unicode in php's flavor of PCREs :D Peekaboo says there is a simple solution available:

[\p{L}\p{N}\-]+

\p{L} matches anything unicode that qualifies as a Letter (note: not a word character, thus no underbars), while \p{N} matches anything that looks like a number (including roman numerals and more exotic things).
\- is just an escaped dash. Although not strictly necessary, I tend to make it a point to escape dashes in character classes... Note, that there are dozens of different dashes in unicode, thus giving rise to the following version:

[\p{L}\p{N}\p{Pd}]+

Where "Pd" is Punctuation Dash, including, but not limited to our minus-dash-thingy. (Note, again no underbar here).

dionadar 2010-01-14 04:50:14

Good one, thanks!

Alix Axel 2010-01-14 04:51:36

will negating \W not include hypen ?

codaddict 2010-01-14 05:02:07

@dionadar - this doesn't match accented characters for me.

meder 2010-01-14 05:09:46

@codadict As far as I know, the hyphen is not included in \w - and even if it was, it would not hurt to state it like this ;)

dionadar 2010-01-14 05:12:25

@meder OP states: "The \w [...] also matches UTF-8 chars if I have the u modifier set."

dionadar 2010-01-14 05:14:04

@meder: Both the regexes work for me, maybe it's the PHP version you're testing with?

Alix Axel 2010-01-14 05:17:19

@Alix, dionadar - perhaps I just missed something.. output http://medero.org/dump/i18n.php code http://medero.org/dump/code.txt are the accented characters matching for you guys?

meder 2010-01-14 05:35:20

@meder: Yes, see http://www.rubular.com/regexes/12922, http://www.rubular.com/regexes/12923 and http://www.rubular.com/regexes/12924. I get the exact same matches PHP 5.3.0 / Windows 7.

Alix Axel 2010-01-14 05:48:15

@meder: My guess is that your `i18n.php` file is not encoded with UTF-8 no BOM?

Alix Axel 2010-01-14 05:55:47

@meder, Alix: Substituting the \w with \p{L} should reduce incompatibility problems, will update answer asap

dionadar 2010-01-14 05:58:53

@dionadar: The problem with `\p{L}` is that it matches only letters - I also need numbers, and I have no idea what the difference between `\p{Nd}`, `\p{Nl}` and `\p{No}` is. If you do, please let me know.

Alix Axel 2010-01-14 06:08:16

\p{N} includes all kinds of numbers - afaik Nd does the 0-9 dance, while Nl includes roman literals (in unicode a roman 1 is not the letter I, but rather something that looks like it) and No is pretty much everything they could not find in the other two, but still is a number.

dionadar 2010-01-14 06:19:41

@dionadar: Thanks for the explanation, I had already posted a related question: http://stackoverflow.com/questions/2062521/regex-unicode-properties-reference-and-examples, hopefully I'll be able to understand `\p` classes a little better.

Alix Axel 2010-01-14 06:21:55

Oh, on a sidenote: I tested the last version with an English-US php 5.something on a ubuntu vm against german umlauts (my spanish does not even go far enough to spell movie quotes) - and it works (w/o the /u switch i might add) even if the file is not even unicode encoded.

dionadar 2010-01-14 06:22:41

Answer 2

+1 A:

I am not sure which language you use, but in PERL you can simply write: [[:alnum:]-]+ when the correct locale is set.

Jiri Klouda 2010-01-14 05:33:30

That's nice to know, but I'm using PHP (PCRE engine).

Alix Axel 2010-01-14 05:34:18

Tried it in PHP and Rubular (Ruby), see http://www.rubular.com/regexes/12922 and http://www.rubular.com/regexes/12923.

Alix Axel 2010-01-14 05:39:37

I've corrected a small mistake there.

Jiri Klouda 2010-01-14 05:54:23

[:alnum:] could be replaced with \p{IsAlnum} in PCRE you could try \p{L}\p{N}

Jiri Klouda 2010-01-14 06:04:53

ansaurus

tags:

views:

answers:

RegEx: \w - "_" + "-" in UTF-8

related questions