tags:

views:

129

answers:

2

I have a string that looks like this:

"#Text() #SomeMoreText() #TextThatContainsDelimiter(#blah) #SomethingElse()"

I'd like to get back

[#Text(), #SomeMoreText(), #TextThatContainsDelimiter(#blah), #SomethingElse()]

One way I thought about doing this was to require that the # to be escaped into \#, which makes the input string:

"#Text() #SomeMoreText() #TextThatContainsDelimiter(\#blah) #SomethingElse()"

I can then split it using /[^\\]#/ which gives me:

[#Text(), SomeMoreText, TextThatContainsDelimiter(\#blah), SomethingElse()]

The first element will contain # but I can strip it out. However, is there a cleaner way to do this without having to escape the #, and which ensures that the first element will not contain a #? Basically I'd like it to split by # only if the # is not enclosed by parentheses.

My hunch is that since the # is context-sensitive and and regular expressions are only suited for context-free strings, this may not be the right tool. If so, would I have to write a grammar for this and roll my own parser/lexer?

+2  A: 

Argh! I tend to lose my abilities here. The regex (?<!\()(?=#) works

PS Home:\> $s -split '(?<!\()(?=#)'

#Text()
#SomeMoreText()
#TextThatContainsDelimiter(#blah)
#SomethingElse()

This combines a negative lookbehind (to make sure there isn't an opening parenthesis preceding the #) and a positive lookahead to look for the #.

Joey
Interesting! This works for the example, but let's say I have the following: #Text(/#blah/). It fails there (splits along / so that we get #Text(, #blah/)). I'm reasonably well-versed in regexes, but obviously nowhere near your level. Can you explain what you are doing?
Vivin Paliath
Not really, no. As I said, I don't understand this either. `(?!<\()(?=#))` would have been straightforward ... it would match between the `#` and the preceding character (provided said character isn't an opening parenthesis). But doesn't work, sadly.
Joey
So I've determined that if the # inside the parentheses is preceded by a /, it is split along the / (and the / goes missing). I read up a little more on regular expressions and figured out that you are using a negative-lookahead and a positive-lookahead. Trying to figure out how your first negative-lookahead works :)
Vivin Paliath
@Vivin: Ok, found my mistake and corrected it. The first one was supposed to be a look*behind*.
Joey
Perfect! That works in my perl script. I tried to do this in Javascript, but it doesn't support any form of look-behind :( Thanks!
Vivin Paliath
+2  A: 

From your example, it looks like you want to split on whitespace that's immediately followed by a hash symbol:

/\s+(?=#)/

That leaves the leading # on all the tokens, but you won't need to treat the first token specially. You could also use this:

/(?:^|\s+)#/

That would strip the hash symbols at the cost of generating an empty string as the first token. But some languages provide a way to discard empty leading tokens. Note that JavaScript does support lookaheads, just not lookbehinds.

Alan Moore
Perfect! That's what I wanted. Thanks!
Vivin Paliath