tags:

views:

497

answers:

5

I've been wrestling with an issue I was hoping to solve with regex.

Let's say I have a string that can contain any alphanumeric with the possibility of a substring within being surrounded by square brackets. These substrings could appear anywhere in the string like this. There can also be any number of bracket-ed substrings.

Examples:

  • aaa[bb b]
  • aaa[bbb]ccc[d dd]
  • [aaa]bbb[c cc]

You can see that there are whitespaces in some of the bracketed substrings, that's fine. My main issue right now is when I encounter spaces outside of the brackets like this:

  • a aa[bb b]

Now I want to preserve the spaces inside the brackets but remove them everywhere else.

This gets a little more tricky for strings like:

  • a aa[bb b]c cc[d dd]e ee[f ff]

Here I would want the return to be:

  • aaa[bb b]ccc[d dd]eee[f ff]

I spent some time now reading through different reg ex pages regarding lookarounds, negative assertions, etc. and it's making my head spin.

NOTE: for anyone visiting this, I was not looking for any solution involving nested brackets. If that was the case I'd probably do it pragmatically like some of the comments mentioned below.

+7  A: 

This regex should do the trick:

[ ](?=[^\]]*?(?:\[|$))

Just replace the space that was matched with "".

Basically all it's doing is making sure that the space you are going to remove has a "[" in front of it, but not if it has a "]" before it.

That should work as long as you don't have nested square brackets, e.g.:

a a[b [c c]b]

Because in that case, the space after the first "b" will be removed and it will become:

aa[b[c c]b]

Senseful
Works for the test case: php -r "var_dump(preg_replace('@[ ](?=[^[\]]*?(?:\[|$))@', '', 'a aa[bb b]c cc[d dd]e ee[f ff]'));"
Frank Farmer
+1 for answering the actual question: how to perform *this task* (ie, no nesting) *with regexes*.
Alan Moore
Awesome, thank you. I was somewhat close, but I couldn't handle past 2 sets of bracketed substrings. And I did not need nested brackets (phew!).
seano
Excellent solution! Two quick questions. Can you explain the need for '[' following the '^' and the '|$'? [ ](?=[^\]]*?(?:\[)) seems to work as well.
Doomspork
The '|$' at the end is required in case your string is something like 'a aa[bb b]c cc[d dd]e ee[f ff]g gg', to get rid of the space between the g's. They don't have a '[' following them, so you also want to check for end of string ('$'). You are correct that the '[' inside the first character class is not required. That is because '.*?b' is essentially the same as '[^b]*b' as long as that's the end of the regex. This was just left over from while I was writing it in the first place before I used the '?' character. It's interesting to note however that '.+?b' is not the same as '[^b]+b'.
Senseful
+8  A: 

This doesn't sound like something you really want regex for. It's very easy to parse directly by reading through. Pseudo-code:

inside_brackets = false;
for ( i = 0; i < length(str); i++) {
    if (str[i] == '[' )
        inside_brackets = true;
    else if str[i] == ']'
        inside_brackets = false;
    if ( ! inside_brackets && is_space(str[i]) )
        delete(str[i]);
}

Anything involving regex is going to involve a lot of lookbehind stuff, which will be repeated over and over, and it'll be much slower and less comprehensible.

To make this work for nested brackets, simply change inside_brackets to a counter, starting at zero, incrementing on open brackets, and decrementing on close brackets.

Jefromi
Heh, good thing I checked for new answers before posting mine. That's almost exactly what I had, except my pseudocode didn't look as much like PHP.
Michael Myers
Actually it shouldn't involve any look behind if there is no nesting, and your code also assumes no nesting.
Senseful
Depending on the language, this may need to be expanded to handle nested brackets. But this is probably the best approach.
derobert
eagle, I was a bit imprecise (read as "incorrect"). What I was thinking of was the fact that, nested or not, for every bracket you have to find the matching close. You're right, you're really looking for repetitions of the pattern /\[[^\]]\]/.
Jefromi
+1  A: 

How to do this depends on what should be done with:

a b [ c [ d [ e ] f ] g

That is ambiguous; possible answers are at least:

  • ab[ c [ d [ e ] f ]g
  • ab[ c [ d [ e ]f]g
  • error out; the brackets don't match!

For the first two cases, you can use regexps. For the third case, you'd be much better off with a (small) parser.

For either case one or two, split the string on the first [. Strip spaces from everything before [ (that's obviously outside of the brackets). Next, look for .*\] (case 1) or .*?\] (case 2) and move that over to your output. Repeat until you're out of input.

derobert
+1  A: 

This works for me:

(\[.+?\])|\s

Then you simply pass in a replacement value of $1 when you call the replace function. The idea is to look for the patterns inside the brackets first and make sure they're untouched. And then every space outside the brackets gets replaced with nothing.

Note that I tested this with Regex Hero (a .NET regex tester), and not in PHP. So I'm not 100% sure this will work for you.

That was an interesting one. Sounded simple at first, then seemed rather difficult. And then the solution I finally arrived at was indeed simple. I was surprised the solution didn't require a lookaround of any sort. And it should be faster than any method that uses a lookaround.

Steve Wortham
A: 

The following will match start-of-line or end-of-bracket (which must come before any space you want to match) followed by anything that isn't start-of-bracket or a space, followed by some space.

/((^|\])[^ \[]*) +/

replacing "all" with $1 will remove the first block of spaces from each non-bracketed sequence. You will have to repeat the match to remove all spaces.

Example:

abcd efg [hij klm]nop qrst u
abcdefg [hij klm]nopqrst u
abcdefg[hij klm]nopqrstu
done
Draemon