tags:

views:

1498

answers:

7

Can a regular expression match whitespace or the start of a string?

I'm trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I'd like to be a bit more conservative, and look for certain delimiters around it.

>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'

>>> re.sub(ur'GBP([\W\d])', ur'£\g<1>', text) # matches GBP with any prefix
u'\xa3 5 Off when you spend \xa375.00'

>>> re.sub(ur'^GBP([\W\d])', ur'£\g<1>', text) # matches at start only
u'\xa3 5 Off when you spend GBP75.00'

>>> re.sub(ur'(\W)GBP([\W\d])', ur'\g<1>£\g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend \xa375.00'

Can I do both of the latter examples at the same time?

+3  A: 

\b is word boundary, which can be a white space, the beginning of a line or a non-alphanumeric symbol ("\"GBP\"").

Motti
Cool. I've learnt two things from your answer. 1. I've never used word boundaries in regular expressions before. 2. Things (particularly \b) don't work well if you accidentally use u'' rather than r'' prefixes on Python regular expressions.
Mat
@Mat: Of course you could use ur"myregex"
nosklo
Cool. That makes sense now you mention it.
Mat
+1  A: 

Yes, why not?

re.sub(u'^\W*GBP...

matches the start of the string, 0 or more whitespaces, then GBP...

edit: Oh, I think you want alternation, use the |:

re.sub(u'(^|\W)GBP...
Svante
A: 

You can always trim leading and trailing whitespace from the token before you search if it's not a matching/grouping situation that requires the full line.

duffymo
+6  A: 

this replaces GBP if its preceded by start of string or a word boundary (which start of string already is) and after GBP comes a numeric value or a word boundary:

re.sub(u'\bGBP(?=\b|\d)', u'£', text)

This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough ?

Martijn Laarman
"\d+": the plus sign is not necessary
ΤΖΩΤΖΙΟΥ
You're right, in fact most regex engines don't allow for repetition or and some only fixed repetition trough {MIN,MAX} inside lookarounds making the \d+ invalid. I was aware but completely missed it so thanks i've edited accordingly :)
Martijn Laarman
@Martijn, that only applies to lookBEHINDs; lookAHEADs have no such limitation (at least, not in any flavor I'm familiar with).
Alan Moore
it actually does apply to some flavor(s) for lookaheads i'll have to dig up the exact name though. It's much more common for lookbehinds indeed. There are fewer flavors that DO allow quantifying within lookbehinds to those that DONT. Whereas (much) less flavors DONT allow for it in lookahead then DO
Martijn Laarman
+7  A: 

Use the OR "|" operator:

>>> re.sub(r'(^|\W)GBP([\W\d])', u'\g<1>£\g<2>', text)
u'\xa3 5 Off when you spend \xa375.00'
Zach Scrivena
Excellent. I'd assumed ^ was forced to be at the very beginning of the string. Minor change necessary to maintain the spacing: re.sub(u'(^|\W)GBP([\W\d])', u'\g<1>£\g<2>', text). Accepted due to being the most intuitive solution to my immediate problem.
Mat
@Mat: Thanks, I've updated my answer as suggested.
Zach Scrivena
+1  A: 

I think you're looking for '(^|\W)GBP([\W\d])'

Christoph
A: 

It works in Perl:

$text = 'GBP 5 off when you spend GBP75';
$text =~ s/(\W|^)GBP([\W\d])/$1\$$2/g;
printf "$text\n";

The output is:

$ 5 off when you spend $75

Note that I stipulated that the match should be global, to get all occurrences.

joel.neely