views:

471

answers:

4

I am attempting to write a CF component that will parse wikiCreole text. I am having trouble getting the correct matches with some of my regular expression though. I feel like if I can just get my head around the first one the rest will just click. Here is an example:

The following is sample input:

You can make things **bold** or //italic// or **//both//** or //**both**//.

Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

My first attempt was:

<cfset out = REreplace(out, "\*\*(.*?)\*\*", "<strong>\1</strong>", "all") />

Then I realized that it would not match where the ** is not given, and it should end where there are two carriage returns.

So I tried this:

<cfset out = REreplace(out, "\*\*(.*?)[(\*\*)|(\r\n\r\n)]", "<strong>\1</strong>", "all") />

and it is close but for some reason it gives you this:

You can make things <strong>bold</strong>* or //italic// or <strong>//both//</strong>* or //<strong>both</strong>*//.

Character formatting extends across line breaks: <strong>bold,</strong>
this is still bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

Any ideas?

PS: If anyone has any suggestions for better tags, or a better title for this post I am all ears.

A: 

I always use a regex web-page. It seems like I start from scratch every time I used regex.

Try using '$1' instead of \1 for this one - the replace is slightly different... but I think the pattern is what you need to get working.

Getting closer with this:

**(.?)**|//(.?)//

The tricky part is the //** or **//

Ok, first checking for //bold// then //bold// then bold, then //bold//

**//(.?)//**|//**(.?)**//|**(.?)**|//(.?)//

Kieveli
Thanks. I am using a testing page like this, I just can't seem to find the right regex to do what I am trying to do.
Ryan Guill
I tried the $1 but it put a literal $1 in there instead of the match.
Ryan Guill
The replace isn't working quite like I'd expected...
Kieveli
+1  A: 

You really should change your

(.*?)

to something like

[^*]*?

to match any character except the *. I don't know if that is the problem, but it could be the any-character . is eating one of your stars. It also a generally accepted "best practice" when trying to balance matching characters like the double star or html start/end tags to explicitly exclude them from your match set for the inner text.

*Disclaimer, I didn't test this in ColdFusion for the nuances of the regex engine - but the idea should hold true.

Goyuix
Thanks for that. That does seem to work somewhat better. Would this also match a carriage return though? If so, is there a way to exclude that?
Ryan Guill
That would fail for "**A * B**" which (presumably) should be replaced by "<strong>A * B</strong>".
Michael Carman
+6  A: 

The [...] represents a character class, so this:

[(\*\*)|(\r\n\r\n)]

Is effectively the same as this:

[*|\r\n]

i.e. it matches a single "*" and the "|" isn't an alternation.

Another problem is that you replace the double linefeed. Even if your match succeeded you would end up merging paragraphs. You need to either restore it or not consume it in the first place. I'd use a positive lookahead to do the latter.

In Perl I'd write it this way:

$string =~ s/\*\*(.*?)(?:\*\*|(?=\n\n))/<strong>$1<\/strong>/sg;

Taking a wild guess, the ColdFusion probably looks like this:

REreplace(out, "\*\*(.*?)(?:\*\*|(?=\r\n\r\n))", "<strong>\1</strong>", "all")
Michael Carman
This doesn't seem to match anything at all, but I see what you are saying about the [] only matching one char. I am not sure what the : is doing, is it possible that the syntax for that is different in CF?
Ryan Guill
The (?:...) is a non-capturing grouping. It bounds the alternation between the literal "**" and the "\n\n". From what I can see the only syntax difference (here) is that in CF a "." matches newline where in Perl it doesn't by default.
Michael Carman
Okay I changed it to this:\*\*([^*]*?)(?:\*\*|(?=\r\n))and it starts to work, except I think it needs to be changed and only look for double carriage returns. Two in a row is a new paragraph. So would that be:\*\*([^*]*?)(?:\*\*|(?=(\r\n){2}))is that right?
Ryan Guill
Basically, yeah, although I'd use either "\r\n\r\n" or "(?:\r\n){2}" instead. The "(\r\n){2}" will capture. Apparently in ColdFusion "\n" is just a linefeed and not a semi-magical platform-neutral newline. I've updated the answer to reflect this.
Michael Carman
A: 

I find this app immensely helpful when I'm doing anything with regex: http://www.gskinner.com/RegExr/desktop/ Still doesn't help with your actual issue, but could be useful going forward.

Ryan McIlmoyl