ansaurus

Question

Regular expression: match start or whitespace

Answer 1

+3 A:

\b is word boundary, which can be a white space, the beginning of a line or a non-alphanumeric symbol ("\"GBP\"").

Motti 2009-02-08 12:42:22

Cool. I've learnt two things from your answer. 1. I've never used word boundaries in regular expressions before. 2. Things (particularly \b) don't work well if you accidentally use u'' rather than r'' prefixes on Python regular expressions.

Mat 2009-02-08 12:59:59

@Mat: Of course you could use ur"myregex"

nosklo 2009-02-09 10:49:08

Cool. That makes sense now you mention it.

Mat 2009-02-09 19:07:49

Answer 2

+1 A:

Yes, why not?

re.sub(u'^\W*GBP...

matches the start of the string, 0 or more whitespaces, then GBP...

edit: Oh, I think you want alternation, use the |:

re.sub(u'(^|\W)GBP...

Svante 2009-02-08 12:43:25

Answer 3

A:

You can always trim leading and trailing whitespace from the token before you search if it's not a matching/grouping situation that requires the full line.

duffymo 2009-02-08 12:44:29

Answer 4

+6 A:

this replaces GBP if its preceded by start of string or a word boundary (which start of string already is) and after GBP comes a numeric value or a word boundary:

re.sub(u'\bGBP(?=\b|\d)', u'£', text)

This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough ?

Martijn Laarman 2009-02-08 12:46:39

"\d+": the plus sign is not necessary

ΤΖΩΤΖΙΟΥ 2009-02-08 18:06:44

You're right, in fact most regex engines don't allow for repetition or and some only fixed repetition trough {MIN,MAX} inside lookarounds making the \d+ invalid. I was aware but completely missed it so thanks i've edited accordingly :)

Martijn Laarman 2009-02-08 19:54:20

@Martijn, that only applies to lookBEHINDs; lookAHEADs have no such limitation (at least, not in any flavor I'm familiar with).

Alan Moore 2009-02-09 01:37:50

it actually does apply to some flavor(s) for lookaheads i'll have to dig up the exact name though. It's much more common for lookbehinds indeed. There are fewer flavors that DO allow quantifying within lookbehinds to those that DONT. Whereas (much) less flavors DONT allow for it in lookahead then DO

Martijn Laarman 2009-02-09 13:11:23

Answer 5

+7 A:

Use the OR "|" operator:

>>> re.sub(r'(^|\W)GBP([\W\d])', u'\g<1>£\g<2>', text)
u'\xa3 5 Off when you spend \xa375.00'

Zach Scrivena 2009-02-08 12:46:54

Excellent. I'd assumed ^ was forced to be at the very beginning of the string. Minor change necessary to maintain the spacing: re.sub(u'(^|\W)GBP([\W\d])', u'\g<1>£\g<2>', text). Accepted due to being the most intuitive solution to my immediate problem.

Mat 2009-02-08 12:54:49

@Mat: Thanks, I've updated my answer as suggested.

Zach Scrivena 2009-02-08 13:00:41

Answer 6

+1 A:

I think you're looking for '(^|\W)GBP([\W\d])'

Christoph 2009-02-08 12:47:27

Answer 7

A:

It works in Perl:

$text = 'GBP 5 off when you spend GBP75';
$text =~ s/(\W|^)GBP([\W\d])/$1\$$2/g;
printf "$text\n";

The output is:

$ 5 off when you spend $75

Note that I stipulated that the match should be global, to get all occurrences.

joel.neely 2009-02-08 13:10:56

ansaurus

tags:

views:

answers:

Regular expression: match start or whitespace

related questions