tags:

views:

123

answers:

7

Hello,

those reqular expressions drive me crazy. I'm stuck with this one:

test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not

Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:

test1:link test2:silver test3:out1insideout2 test4:this|not

I came up with (PHP)

$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]

this works for part1 of the task. but before that I think I should do the option split, my best solution:

$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);

Result:

test1:silver test3:[[out1[[inside]]out2]] this|not

I'm stuck. may someone with some free minutes help me? Thanks!

A: 

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.

When trying to get something going favour clarity and simplicity.

Seems like you have all the pieces.

djna
does not work. if I remove the [[ first then "this|not" is split too. and my problem is that i don't have a working option split expression...
cydo
+1  A: 

I think the easiest way to do this would be multiple passes. Use a regular expression like:

\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]

This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.

Edit 1: By way of explanation,

\[\[        # Opening [[
(?:         # A non-matching group (we don't want this bit)
    [^\[\]] # Non-bracket characters
    *       # Zero or more of anything but [
    \|      # A literal '|' character representing the end of the discarded options
)?          # This group is optional: if there is only one option, it won't be present
(           # The group we're actually interested in ($1)
    [^\[\]] # All the non-bracket characters
    +       # Must be at least one
)           # End of $1
\]\]        # End of the grouping.

Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).

Edit 3: There is no need to know the number of nested brackets as you can do something like:

$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
    $oldtext = $newtext;
    $newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;

Basically, this keeps running the regular expression replace until the output is the same as the input.

Note that I don't know PHP, so there are probably syntax errors in the above.

Al
That is applicable if the number of nested brackets is known (e.g. if n<=MAX then pass it MAX times).
streetpc
@streetpc: I don't think you need to know the number of nested brackets, I'll edit the above to explain why.
Al
nice one, i still trying to fiddle out how this works, thanks for explanation!
cydo
A: 

Why not just simply remove any brackets that are left?

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);
Gumbo
looks nice, though i've replace the last one with$str = str_replace(array('[[', ']]'), '', $str);in order not to remove something like [thisnot].
cydo
I like this "2 pass" solution, even though I have to figure out how you did this option split.
cydo
His regular expression will only match all groups with options, then it will replace them to the last option, without square brackets. Then the replace removes all remaining square brackets. It's quicker than multiple passes, at the expense of accuracy (but this might not be a problem to the OP.)
Blixt
Regarding accuracy: `test5:[[abc[[def|ghi]]jkl|[[mno|pqr]]]]`
Blixt
blixt: you're right. with this solutition there is a problem with nested links
cydo
A: 

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/'; 
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
  if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
    $opt = explode('|', $match); $match = $opt[count($opt)-1]; 
  }
  $m[2][$pos] = str_replace(array('[', ']'),'', $match );
}

foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];
TML
Of course, this is just a first stab at it - you probably would want to do the work of the 2nd foreach in the first foreach loop, with some logic to decide when you should be pulling values, etc.
TML
A: 

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.

Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.

You will need to escape all backslashes when putting it into a string (\ becomes \\.)

\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]

As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)

Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:

test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not

becomes

test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not
Blixt
oh. something I haven't thought about. let me take a look
cydo
Yeah I probably went overboard with my regular expression... But I like optimizing them for accuracy and speed because regular expressions are slow and it can be really noticeable in some cases if you're not careful.
Blixt
<code>test1:[[works]] test2:[[failed|works]] test3:[[out1[[inside]]out2]] test4:dont|replace test5:[[with[inner]bracket]] test6:[[nested[[link]]]] test7:[[it[[failed|works]]yesit[[failed|works]]]]</code>works all. thanks!
cydo
A: 

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.

String input = "test1:[[link]] " +
               "test2:[[gold|silver]] " +
               "test3:[[out1[[inside]]out2]] " +
               "test4:this|not";

String step1 = Regex.Replace(input, @"\[\[([^|]+)\|([^\]]+)\]\]", @"[[$2]]");
String step2 = Regex.Replace(step1, @"\[\[|\]\]", String.Empty);

// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);
Daniel Brückner
A: 
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
    $v = preg_replace("/\[\[|\]\]/","",$v);        
    $j = explode(":",$v);
    $j[1]=preg_replace("/.*\|/","",$j[1]);
    print implode(":",$j)."\n"; 
}
ghostdog74