ansaurus

Question

Splitting a complex string with Regular Expressions

Answer 1

+5 A:

You can't; regular expressions are based on state machines which don't have a "stack" so you can remember the number of nesting levels.

But maybe you can use a trick: Try to convert the string into a valid JSON string. Then you can use eval() to parse it into a JavaScript object.

Aaron Digulla 2009-05-26 14:17:58

Answer 2

A:

could you split on "(?<=])\s(?=[a-z][)"? that is, a space preceeded by a ] and followed by a letter and a [? This assumes you never have any string inside brackets like "a[b=d[x=y b] g[w=v b]]"

Thomas Cowart 2009-05-26 14:21:59

Answer 3

+4 A:

If the rule is this simple, I would suggest just doing it manually. Step through each character and keep track of your nesting level by increasing by 1 for each [ and decreasing by 1 for each ]. If you reach a space with nesting == 0 then split.

Edit: I was thinking that I might also mention that there are other pattern matching facilities in some languages that do natively support this sort of thing. For example, in Lua you can use '%b[]' to match balanced nested []'s. (Of course, Lua doesn't have a built in split function....)

Dolphin 2009-05-26 14:59:56

+1 - good idea! back to basics, yo!

nickf 2009-05-26 15:02:44

It is possible to do this with a regex when using Oniguruma, as it has an extension that allows certain types of nesting. However, I don't believe any javascript impl includes this extension, and regardless, I agree with Dolphin that a non-regex approach is going to be faster and cleaner. You can implement a state machine for this pretty easily.

Kevin Ballard 2009-05-27 03:30:03

Answer 4

A:

Another is a looping approach where you deconstruct the nested brackets one level at a time, else it's hard(TM) to ensure your single regexp will work as expected.

Here's an example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
left = str.dup
tokn=0
toks=[]
# Deconstruct
loop do
  left.sub!(/\[[^\]\[]*\]/,"\{#{tokn}\}")
  break if $~.nil?
  toks[tokn]=$&
  tokn+=1
end
left=left.split(/\s+/)
# Reconstruct
(toks.size-1).downto(0) do |tokn|
  left.each { |str| str.sub!("\{#{tokn}\}", toks[tokn]) }
end

The above uses {n} where n is an integer during the deconstruction, so in some cases original input like this in the string would break the reconstruction. This should illustrate the approach though.

Writing code that does the split by iterating through the characters is simpler and safer though.

Example in ruby:

str = "a[a=d b&c[e[100&2=34]]] e[cheese=blue and white] x[a=a b]"
toks=[]
level=st=en=0; 
str.each_byte do |c|
  en+=1; 
  level+=1 if c=='['[0]; 
  level-=1 if c==']'[0]; 
  if level==0 && c==' '[0]
    toks.push(str[st,en-1-st]);
    st=en
  end
end    
toks.push(str[st,en-st]) if st!=en 
p toks

fd 2009-05-26 15:14:25

ansaurus

tags:

views:

answers:

Splitting a complex string with Regular Expressions

related questions