ansaurus

Question

Regex word-break with unicode diacritics

Answer 1

+1 A:

The equivalent of /(?:(?=\B).)*/ in a unicode context would be:

/
(?:
  (?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
  |   (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
  )
  .
)*
/

...or somewhat simplified:

/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/

This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.

A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.

In the second regex, I removed the look-arounds and used simple character classes instead.

MizardX 2009-10-02 22:27:06

ansaurus

tags:

views:

answers:

Regex word-break with unicode diacritics

related questions