views:

206

answers:

4

How do I, using a regular expression, split this string:

string = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"

into this array:

string.split( regexp ) =>

[ "a[a=d b&c[e[100&2=34]]]", "e[cheese=blue and white]", "x[a=a b]" ]

The basic rule is that string should be split at whitespace ( \s ), unless whitespace exists inside brackets( [ ] );

+5  A: 

You can't; regular expressions are based on state machines which don't have a "stack" so you can remember the number of nesting levels.

But maybe you can use a trick: Try to convert the string into a valid JSON string. Then you can use eval() to parse it into a JavaScript object.

Aaron Digulla
A: 

could you split on "(?<=])\s(?=[a-z][)"? that is, a space preceeded by a ] and followed by a letter and a [? This assumes you never have any string inside brackets like "a[b=d[x=y b] g[w=v b]]"

Thomas Cowart
+4  A: 

If the rule is this simple, I would suggest just doing it manually. Step through each character and keep track of your nesting level by increasing by 1 for each [ and decreasing by 1 for each ]. If you reach a space with nesting == 0 then split.

Edit: I was thinking that I might also mention that there are other pattern matching facilities in some languages that do natively support this sort of thing. For example, in Lua you can use '%b[]' to match balanced nested []'s. (Of course, Lua doesn't have a built in split function....)

Dolphin
+1 - good idea! back to basics, yo!
nickf
It is possible to do this with a regex when using Oniguruma, as it has an extension that allows certain types of nesting. However, I don't believe any javascript impl includes this extension, and regardless, I agree with Dolphin that a non-regex approach is going to be faster and cleaner. You can implement a state machine for this pretty easily.
Kevin Ballard
A: 

Another is a looping approach where you deconstruct the nested brackets one level at a time, else it's hard(TM) to ensure your single regexp will work as expected.

Here's an example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
left = str.dup
tokn=0
toks=[]
# Deconstruct
loop do
  left.sub!(/\[[^\]\[]*\]/,"\{#{tokn}\}")
  break if $~.nil?
  toks[tokn]=$&
  tokn+=1
end
left=left.split(/\s+/)
# Reconstruct
(toks.size-1).downto(0) do |tokn|
  left.each { |str| str.sub!("\{#{tokn}\}", toks[tokn]) }
end

The above uses {n} where n is an integer during the deconstruction, so in some cases original input like this in the string would break the reconstruction. This should illustrate the approach though.

Writing code that does the split by iterating through the characters is simpler and safer though.

Example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
toks=[]
level=st=en=0; 
str.each_byte do |c|
  en+=1; 
  level+=1 if c=='['[0]; 
  level-=1 if c==']'[0]; 
  if level==0 && c==' '[0]
    toks.push(str[st,en-1-st]);
    st=en
  end
end    
toks.push(str[st,en-st]) if st!=en 
p toks
fd