tags:

views:

219

answers:

4

I'm making a simple Textile parser and am trying to write a regular expression for "blockquote" but am having difficulty matching multiple new lines. Example:

bq. first line of quote
second line of quote
third line of quote

not part of the quote

It will be replaced with blockquote tags via preg_replace() so basically it needs to match everything between "bq." and the first double new line it comes across. The best I can manage is to get the first line of the quote. Thanks

A: 

Would this work?

'/(.+)\n\n/s'

I believe 's' stands for single line.

Maxwell Troy Milton King
`s` is usually called 'dot-all' (since it will let the `.` match line breaks as well). But doing what you suggest will cause the regex to match all text, and will then backtrack in the string to find the last `\n\n`. From the string `ABC\n\nDEF\n\nGHIJ\n\nKLM`, it will match `ABC\n\nDEF\n\nGHIJ\n\n`. Not what the Fourjays is looking for IMO.
Bart Kiers
That's true. Would have to make + non-hungry I guess.. Thanks.
Maxwell Troy Milton King
A: 

My instincts tell me something like...

preg_match("/^bq\. (.+?)\n\n/s", $input, $matches)

Just like the above fella says, the s flag after the / at the end of the RegEx means that the . will match new line characters. Usually, without this, RegExs are kind of a one line thing.

Then the question mark ? after the .+ denotes a non-greedy match so that the .+ won't match as it can; instead it will match the minimum possible, so that the \n\n will match the first available double line.

To what extent are you planning on supporting features of Textile? Because your RegEx can get pretty complicated, as Textile allows things like...

bq.. This is a block quote

This is still a block quote

or...

bq(funky). This is a block quote belonging to the class funky!

bq{color:red;}. Block quote with red text!

All of which your regex-replace technique won't be able to handle, methinks.

LeguRi
I'm only supporting what I'd consider the "most common" features (bold, italic, strikethrough, quotes, titles, paragraphs, bullets). Would only use the classes on the paragraph at the very most. So far the regex's work well for things like bold, but I suspect I'm going to have to do something more involved for the bulleted list.
Fourjays
... and nested bold/italics? I think that `*this is _not_ going to work with RegEx*`
LeguRi
Nested bold/italics seem to work fine. Using the regex's it does a straight replace of the contents of the bold "tag" so the italics are left untouched.
Fourjays
k, cool! Sorry to hassle you on the matter - I wrote my own Textile interpreter once for kicks and loved doing it, so I'm very interested.
LeguRi
No problem. I'm mostly doing it as an experiment too. Regex's are my nemesis though. ;)
Fourjays
A: 

Edit: Ehr, misread the question.. "bq." was significant.

echo preg_replace('/^bq\.(.+?)\n\n/s', '<blockquote>$1</blockquote>', $str, 1);

Sometimes data that is entered via webforms contains \r\n instead of just \n which would make it

echo preg_replace('/^bq\.(.+?)\r\n\r\n/s', '<blockquote>$1</blockquote>', $str, 1);

The questionmark makes it add the closing blockquotes after the first double return found ("non-greedy" I believe it's called), so any other double returns are left alone (if that is not what you want, take it out obviously).

MSpreij
+4  A: 

Try this regex:

(?s)bq\.((?!(\r?\n){2}).)*+

meaning:

(?s)           # enable dot-all option
b              # match the character 'b'
q              # match the character 'q'
\.             # match the character '.'
(              # start capture group 1
  (?!          #   start negative look ahead
    (          #     start capture group 2
      \r?      #       match the character '\r' and match it once or none at all
      \n       #       match the character '\n'
    ){2}       #     end capture group 2 and repeat it exactly 2 times
  )            #   end negative look ahead
  .            #   match any character
)*+            # end capture group 1 and repeat it zero or more times, possessively

The \r?\n matches a Windows, *nix and (newer) MacOS line breaks. If you need to account for real old Mac computers, add the single \r to it: \r?\n|\r

Bart Kiers
I would probably use `{2}` in place of `{2,}` because you can't tell if eating all adjacent newlines is desirable. It's certainly not part of the requirement.
Tomalak
+1, very nice. Why the possessive quantifier, though, if nothing else follows the capturing group?
Tim Pietzcker
@Tomalak: agreed. Fixed.
Bart Kiers
@Tim: since the `.` matched by group 1 will never be one of the line breaks from `\n\n` (because of the negative look ahead), it is safe to possessively capture it. That way, the regex engine does not keep track of all states the`.` captures (no need for backtracking). With relative small strings it won't matter much, but when the strings get large, performance may very well increase by using the possessive quantifier.
Bart Kiers
This got the job done. Nice explanation too.
Fourjays
You're welcome Fourjays.
Bart Kiers
@Bart, did you use your auto-explainer for that? And have you gone public with it yet?
Alan Moore
@Alan, yes, I used my auto-explainer tool for that. It's still (heavily) under development, but you can download a beta-ish version here: http://big-o.nl/apps/pcreparser/ Besides an ascii exlanation, it also builds a nice JTree from the regex.
Bart Kiers