tags:

views:

250

answers:

5

Hi folks,

I try to parse articles from wikipedia. I use the *page-articles.xml file, where they backup all their articles in a wikicode-format. To strip the format and get the raw text, I try to use Regular Expressions, but I am not very used to it. I use C# as programming language. I tried a bit around with Expresso, a designer for Regular Expressions, but I am at the end of my wits. Here is what I want to achieve:

The text can contain the following structures: [[TextN]] or [[Text1|TextN]] or [[Text1|Text2|...|TextN]]

the [[ .... ]] pattern can appear within the Texti aswell. I want to replace these structure with TextN

For identifing the structures withhin the text I tried the following RegEx:

\[\[ ( .* \|?)* \]\]

Expresso seems to run and endless loop with this one. After 5 minutes for a relative small text, I canceled the Test Run.

Then I tried something more simple, I want to capture anything between the brackets:

\[\[ .* \]\]

but when I have a line like:

[[Word1]] text inbetween [[Word2]]

the expression returns the whole line, not

[[Word1]]

[[Word2]]

Any tips from Regex-Experts here to solve the problem?

Thanks in advance, Frank

A: 

This is because regular expressions tries to find always the longest matches possible. You should change .*

Try using

\[\[([A-Za-z][A-Za-z\d+]*)(\|\1)*\]\]

This will match only letters, | sign and numbers in double brackets + it checks if value starts with the letter.

RaYell
Won't work since aaginor expects tags to be nested: [[Tag1|[[Tag2]]|Tag3]]
SealedSun
As SealedSun points out, I need en RegEx that can deal with nested tags.
Aaginor
Regex does not handle nesting/recursion very well.
Rob Elliott
+2  A: 

\[\[(.*?\]\] would do it.

The key is the .*? which means get any characters but as few a possible.

EDIT

For nested tags one approach would be:

\[\[(?<text>(?>\[\[(?<Level>)|\]\](?<-Level>)|(?! \[\[ | \]\] ).)+(?(Level)(?!)))\]\]

This ensures that the [[ and ]] match across the text as well.

Lazarus
Will fail on nested tags: [[This is a [[NestedTag]].]]
SealedSun
As SealedSun points out, I need an RegEx that can deal with nested tags.
Aaginor
How do you want to deal with the nested tags? What do you want to do with the nested tag? Treat is separately or just remove the nested brackets?
Lazarus
Many thanks Lazarus! This RegEx handles the nested tags properly! I want to treat nested tags the same way as stated above. So from "[[Text11 | Text12 | [[Text21 | Text22]] Text13]]" I need "Text22 Text13" as the result. Always the last item of a list separeted by |. If that would be possible with a regex too, that would be nice, but getting this by hand isn't that much of a problem with the result of your regex above.
Aaginor
After checking the results from your Regex, I think the replacement of the inner tags won't be nessecary, because it seems, that they only occur in tags, that will be removed anyway.
Aaginor
A: 

If Expresso isn't working out for you, you may want to try RegexBuddy.

While not free, it does provide an excellent real time testing environment where you can see how your regex is going to match a section of sample text.

djch
+4  A: 

I wouldn't use regular expressions (since they don't handle recursion/nesting well).

Instead I would parse the text by hand*, which isn't particularly difficult in this case.

You could represent the text as a stream of elements whereas each element is either

  • a plain text chunk, or
  • a tag

A tag might contain multiple element streams, separated by |.

elementStream ::= element*
element ::= chunk | tag
chunk ::= TEXT
tag ::= "[[" elementStream otherStreams "]]"
otherStreams ::= "|" elementStream otherStreams

Your parser could represent each of those definitions with a method. So you'd have an elementStream method that would call element as long as there is text available and the next two characters are not "]]" or "|" (if you are inside a tag). Each call to element would return the element parsed, either a chunk or a tag. etc.

This would essentially be a recursive descent parser. Wikipedia: http://en.wikipedia.org/wiki/Recursive_descent_parser (the article is rather long/complicated, unfortunately)

SealedSun
Yap, I think pasing by hand will be the one. I do not really understand the method your are suggesting yet, but I just spend one or two brain cells on it.
Aaginor
You might want to have a look at the Wikipedia page I've added to the answer.
SealedSun
A: 

If GPL2 is not an issue for you, maybe you could check out the source code of Screwturn Wiki and see how an expert does it. It's in C#, BTW

Igor Brejc
I took a glimpse at the code. He uses Regex, but his also doesn't deal with nested tags.
Aaginor