ansaurus

Question

Answer 1

A:

This is because regular expressions tries to find always the longest matches possible. You should change .*

Try using

\[\[([A-Za-z][A-Za-z\d+]*)(\|\1)*\]\]

This will match only letters, | sign and numbers in double brackets + it checks if value starts with the letter.

RaYell 2009-07-28 11:29:25

Won't work since aaginor expects tags to be nested: [[Tag1|[[Tag2]]|Tag3]]

SealedSun 2009-07-28 11:34:15

As SealedSun points out, I need en RegEx that can deal with nested tags.

Aaginor 2009-07-28 11:55:00

Regex does not handle nesting/recursion very well.

Rob Elliott 2009-07-28 12:02:58

Answer 2

+2 A:

\[\[(.*?\]\] would do it.

The key is the .*? which means get any characters but as few a possible.

EDIT

For nested tags one approach would be:

\[\[(?<text>(?>\[\[(?<Level>)|\]\](?<-Level>)|(?! \[\[ | \]\] ).)+(?(Level)(?!)))\]\]

This ensures that the [[ and ]] match across the text as well.

Lazarus 2009-07-28 11:29:26

Will fail on nested tags: [[This is a [[NestedTag]].]]

SealedSun 2009-07-28 11:35:33

As SealedSun points out, I need an RegEx that can deal with nested tags.

Aaginor 2009-07-28 11:55:35

How do you want to deal with the nested tags? What do you want to do with the nested tag? Treat is separately or just remove the nested brackets?

Lazarus 2009-07-28 13:15:17

Many thanks Lazarus! This RegEx handles the nested tags properly! I want to treat nested tags the same way as stated above. So from "[[Text11 | Text12 | [[Text21 | Text22]] Text13]]" I need "Text22 Text13" as the result. Always the last item of a list separeted by |. If that would be possible with a regex too, that would be nice, but getting this by hand isn't that much of a problem with the result of your regex above.

Aaginor 2009-07-28 14:34:06

After checking the results from your Regex, I think the replacement of the inner tags won't be nessecary, because it seems, that they only occur in tags, that will be removed anyway.

Aaginor 2009-07-28 14:41:15

Answer 3

A:

If Expresso isn't working out for you, you may want to try RegexBuddy.

While not free, it does provide an excellent real time testing environment where you can see how your regex is going to match a section of sample text.

djch 2009-07-28 11:32:50

Answer 4

+4 A:

I wouldn't use regular expressions (since they don't handle recursion/nesting well).

Instead I would parse the text by hand*, which isn't particularly difficult in this case.

You could represent the text as a stream of elements whereas each element is either

a plain text chunk, or
a tag

A tag might contain multiple element streams, separated by |.

elementStream ::= element*
element ::= chunk | tag
chunk ::= TEXT
tag ::= "[[" elementStream otherStreams "]]"
otherStreams ::= "|" elementStream otherStreams

Your parser could represent each of those definitions with a method. So you'd have an elementStream method that would call element as long as there is text available and the next two characters are not "]]" or "|" (if you are inside a tag). Each call to element would return the element parsed, either a chunk or a tag. etc.

This would essentially be a recursive descent parser. Wikipedia: http://en.wikipedia.org/wiki/Recursive_descent_parser (the article is rather long/complicated, unfortunately)

SealedSun 2009-07-28 11:40:15

Yap, I think pasing by hand will be the one. I do not really understand the method your are suggesting yet, but I just spend one or two brain cells on it.

Aaginor 2009-07-28 12:31:57

You might want to have a look at the Wikipedia page I've added to the answer.

SealedSun 2009-07-28 13:52:10

Answer 5

A:

If GPL2 is not an issue for you, maybe you could check out the source code of Screwturn Wiki and see how an expert does it. It's in C#, BTW

Igor Brejc 2009-07-28 11:58:22

I took a glimpse at the code. He uses Regex, but his also doesn't deal with nested tags.

Aaginor 2009-07-28 12:29:14

ansaurus

tags:

views:

answers:

Regex for parsing Wikicode in C#

related questions