tags:

views:

42

answers:

3
<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>

In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:

<!\[[^\]!>]+\]!>

It works, but as you can see I've used this part:

[^\]!>]+

This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.

How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?

The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.

+6  A: 

The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.

(Also, no need to escape the ])

About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.

Tim Pietzcker
+1 Didn't know you could use `]` without escape :)
Andomar
+1  A: 

To match a fruit name, you could use:

<!\[(.*?)]!>

After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.

Here's a full regex to match each fruit with the following text:

<!\[(.*?)]!>(.*?)(?=(<!\[)|$)

This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.

Andomar
Your information is very useful.
bobo
+1  A: 

depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language

eg gawk

# set field separator as !>
awk -F'!>' '
{ 
  # for each field 
  for(i=1;i<=NF;i++){
    # check if there is <!
    if($i ~ /<!/){
        # if <! is found,  substitute from front till <!
        gsub(/.*<!/,"",$i)

    }
    # print result
    print $i
  }
}
' file

output

# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]

No complicated regex needed.

ghostdog74
I just want to know what regex syntax should be used. I've never heard of Gawk before. Thanks a lot for introducing me to this text-manipulating language.
bobo
several people have posted regex solutions so i guess you can look at them.
ghostdog74
if you want to learn about gawk , go to http://www.gnu.org/software/gawk/manual/
ghostdog74