tags:

views:

27

answers:

2

I'm using regular expression lib icucore via RegKit on the iPhone to replace a pattern in a large string.

The Pattern i'm looking for looks some thing like this

| hello world (P1)|

I'm matching this pattern with the following regular expression

\|((\w*|.| )+)\((\w\d+)\)\|

This transforms the input string into 3 groups when a match is found, of which group 1(string) and group 3(string in parentheses) are of interest to me.

I'm converting these formated strings into html links so the above would be transformed into

<a href="P1">Hello world </a>

My problem is the trailing space in the third group. Which when the link is highlighted and underlined, results with the line extending beyond the printed characters.

While i know i could extract all the matches and process them manually, using the search and replace feature of the icu lib is a much cleaner solution, and i would rather not do that as a result.

Many thanks as always

+1  A: 

Would the following work as an alternate regular expression?

\|((\w*|.| )+)\s+\((\w\d+)\)\| Where inserting the extra \s+ pulls the space outside the 1st grouping.

Though, given your example & regex, I'm not sure why you don't just do:

\|(.+)\s+\((\w\d+)\)\|

Which will have the same effect. However, both your original regex and my simpler one would both fail, however on:

| hello world (P1)| and on the same line | howdy world (P1)|

where it would roll it up into 1 match.

Mike R
Thanks mike, the first one you gave worked like a charm. (i wrote a small test suite, and this passed all the tests.) Nice to know i wasn't far off.
Jonathan
+1  A: 
\|\s*([\w ,.-]+)\s+\((\w\d+)\)\|

will put the trailing space(s) outside the capturing group. This will of course only work if there always is a space. Can you guarantee that?

If not, use

\|\s*([\w ,.-]+(?<!\s))\s*\((\w\d+)\)\|

This uses a lookbehind assertion to make sure the capturing group ends in a non-space character.

Tim Pietzcker
Thanks tim, both of these were close, and i especially like that they reduced the number of groups to the correct amount(2), however i have to support punctuation and hyphenated words, and i couldn't work out where to place the OR to match non letter characters. Look ahead/behind Assertions are something i need to get more familiar with.
Jonathan
Oh, OK. Character classes are your friend - and much easier and faster than alternation: `[\w .-]` matches the same stuff as `(\w| |\.|+)`. I've edited my answer - and of course it's easy to add more characters if you need them.
Tim Pietzcker