ansaurus

Question

Answer 1

+1 A:

I think the best way is to merge all lines into one string, especially for the infobox.

Then something along the lines of

$reg = "\n(\* '''[^\n]*)";

for the first part (everything after a new line that start with * ''' and is not a new line).

And for the second part I'm not quire sure right now, but this is a nice place to play around a bit: http://www.solmetra.com/scripts/regex/index.php

And here is a short reference for regular expression syntax: http://www.regular-expressions.info/reference.html

Runeborg 2009-06-17 07:52:34

Answer 2

+3 A:

MediaWiki is open-source. Have a look at their source code ... ;-)

Philippe Gerber 2009-06-17 14:01:46

No better place to look than the actual implementation. :)

musicfreak 2009-06-18 07:19:47

Answer 3

+1 A:

I need to retrieve all lines in a single array which start with the pattern * '''

Enable multiline mode and ensure dotall mode is disabled, and use this:

^\* '''.*$

That expression dissected is:

(?xm-s) # Flags:
        # x enables comment mode (spaces ignore, hashes start comments)
        # m enables multiline mode (^$ match lines)
        # -s disables dotall (. matches newline)
^       # start of line
\*      # literal asterisk
[ ]     # literal space (needs braces in comment mode, but not otherwise)
'''     # three literal apostrophes
.*      # any character (excluding newline), greedily matched zero or many times.
$       # end of line

Peter Boughton 2009-06-18 07:45:26

Answer 4

+1 A:

I need to extract the infobox ...

Try this, this time making sure dotall mode is enabled:

\{\{Infobox.*?(?=\}\} <!-- Infobox ends -->)

And again, explanation for that:

(?xs)    # x=comment mode, s=dotall mode
\{\{     # two opening braces (special char, so needs escaping here.)
Infobox  # literal text
.*?      # any char (including newlines), non-greedily match zero or more times.
(?=      # begin positive lookahead
\}\}     # two closing braces
<!-- Infobox ends --> # literal text
)        # end positive lookahead

This will match upto (but excluding) the the ending expression - you could remove the lookahead itself and include just the contents to have it include the ending, if necessary.

Update, based on comment to answer:

\{\{Infobox.*?(?=\n\}\}\n)

Same as above, but lookahead looks for two braces on their own line.

To optionally allow the comment also, use:

\{\{Infobox.*?(?=\n\}\}(?: <!-- Infobox ends-->)?\n)

Peter Boughton 2009-06-18 07:46:21

Thanks but the issue with the infobox is that not all the pages have the infobox ending with  comment. The infobox from what I have noticed definitely ends with two curly braces }} with a newline before and after i.e. \n}}\nThe trick is that there can be curly braces within the string but those within are on the same line. - How do I solve this ..

Ali 2009-06-18 07:57:10

As you suggest - use \n before and after - so with escaping that becomes \n\}\}\n

Peter Boughton 2009-06-18 08:12:56

Thanks a lot man for the help :)

Ali 2009-06-18 08:14:12

ansaurus

tags:

views:

answers:

Need a simple Regular Expressions here

EDIT! HELP!

related questions