tags:

views:

52

answers:

3

The other day I sat with a regular expression problem. Eventually I solved it a different way, without regular expressions, but I would still like to know how you do it :)

The problem I was having was running svn update via an automated script, and I wanted to detect conflicts. Doing this with or withouth regex is trivial, but it got me thinking about a more obscure problem: How do you match exactly ONE occurence of a character inside a fixed length field of whitespace ?

For instance, let's say we wanted to match "C" inside a six-byte wide field:

"C     " MATCH
"   C  " MATCH
" C C  " NO MATCH
"  M   " NO MATCH
"      " NO MATCH
"C      " NO MATCH (7 characters, not 6)
"   C " NO MATCH (5 characters, not 6)
+3  A: 
^(?=[ C]{6}$) *C(?! *C)

Explanation:

^             # start-of-string
(?=[ C]{6}$)  # followed by exactly 6 times " " or "C" and the end-of-string
 *C           # any number of spaces and a "C"
(?! *C)       # not followed by another C anywhere (negative lookahead)

Notes:

  • The ^(?=…{6}$) construct can be used anywhere you want to measure string length but not actually match anything yet.
  • Since the end of the string is already checked in the look-ahead, you do not need to put a $ at the end of the regex, but it does not hurt to do it.
Tomalak
It shouldn't be ^C, the user asked for a white space.
testalino
@testalino: Good point. That makes things easier, even.
Tomalak
+1 for awesome use of lookaheads, interesting example. However, the match fails on two occurences of C, this fixes it: ^(?=.{6}$)[^C]*?C(?!.*C)
KennethJ
on 2nd inspection, this fails when C is mixed with other characters
KennethJ
@KennethJ: I've just changed the regex to cater for that.
Tomalak
+3  A: 
^[^C]*C[^C]*$

but this will not verify the length of your string.

eumiro
+1 The negative character classes are better than my lookaheads.
Tomalak
Character classes are better. But this doesn’t take the overall length into account. So you will need to do something like: `^(C[^C]{5}|[^C]C[^C]{4}|[^C]{2}C[^C]{3}|[^C]{3}C[^C]{2}|[^C]{4}C[^C]|[^C]{5}C)$`
Gumbo
… but unfortunately, with the second requirement (spaces only) this does not work anymore. See @testalino's comment on my answer.
Tomalak
Solved it, you need to incorporate the length lookahead from Tomalak, and replace [^C]* with whitespace: ^(?=.{6}$) *C *$
KennethJ
@Gumbo: You're not being serious. I doubt that this will beat look-aheads in performance.
Tomalak
@Tomalak: Maybe not that one. But maybe this one: `^(C[^C]{5}|[^C](C[^C]{4}|[^C](C[^C]{3}|[^C](C[^C]{2}|[^C](C[^C]|[^C]C)))))$` ;)
Gumbo
@Tomalak: The point is rather that not every regular expression implementation does support look-aheads assertions.
Gumbo
@Gumbo, if not every RE implementation supports look-aheads, then every such implementation supports the length of a string. So just to it beside the RE matching...
eumiro
@eumiro: We don’t know if the whole string is always the subject. Maybe it’s just a substring.
Gumbo
+4  A: 

I know it's not right to answer your own question, but I basically merged your answers ... please don't flame :)

^(?=.{6}$) *C *$

Edit: Replacing . with Tomalak's response below [ C] increases the speed with about 4-5% or so

^(?=[ C]{6}$) *C *$

KennethJ
+1 Yeah, that's good, too. The final `$` still is unnecessary, see my note #2. *(And no, there is noting wrong with answering your own question at all. No danger of being flamed.)*
Tomalak
@tomalak if I remove the final $, it fails when something other than C occures after the C, and also on multiple occurences of C
KennethJ
See my changed answer for an initial look-ahead that fixes this.
Tomalak
I've just done some benchmarking, this is by far the fastest variant: `^(?=[ C]{6}$) *C *`. A little slower is the look-ahead based version from my answer `^(?=[ C]{6}$) *C(?! *C)`, and @Gumbo's two monster regexes are from the comments to @eumiro's answer are the slowest of the bunch, the second one slightly faster than the first one.
Tomalak
@Tomalak - did you use spaces instead of `[^C]` on Gumbo's regex? While we're being silly, you might as well test `C-----|-C----|--C---|---C--|----C-|-----C`.
Kobi
@tomalek Your modified variant still fails on two occurences of C. Face it, as much as you'd like to not have it, you need the ending $ :)
KennethJ
@Kenneth: Yeah, you're right, that's true for this regex. It's not true for mine, though.
Tomalak
@Kobi: Yes, I did use spaces instead of `[^C]`.
Tomalak