tags:

views:

157

answers:

4

Currently I use this reg ex:

"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"

It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?

Thanks

+2  A: 

Add |\\ inside the group, after the \d for instance.

It would be better to use just a character class: [a-zA-Z\d\\] than the alternations.
Jonathan Leffler
And '[[:alnum:]]' is probably clearer still. Yours works, but ...
Jonathan Leffler
\\ == \b, so the whole problem needs to be refactored.
Axeman
+1  A: 

This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:

([a-zA-Z]|\d){2,13}

into this ...

([\w]{2,13})

and you can also add a test for the backslash character with this ...

([\w\x5c]{2,13})

which makes the regex just a tad easier to eyeball, depending on your personal preference.

"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"

See also:

dreftymac
A: 

Both @slavy13 and @dreftymac give you the basic solution with pointers, but...

  • You can use \d inside a character class to mean a digit.
  • You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
  • You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
  • If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.

Contrast the behaviour of these two one-liners:

perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'

perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'

Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)

I'd probably use this regex as it seems clearest to me:

m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/

Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".

Jonathan Leffler
A: 

As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is

/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/

I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.

You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:

/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/
Axeman