views:

83

answers:

3

Hi, I have the following HTML

<p>Some text <a title="link" href="http://link.com/" target="_blank">my link</a> more 
text <a title="link" href="http://link.com/" target="_blank">more link</a>.</p>
<p>Another paragraph.</p>
<p>[code:cf]</p>
<p>&lt;cfset ArrFruits = ["Orange", "Apple", "Peach", "Blueberry", </p>
<p>"Blackberry", "Strawberry", "Grape", "Mango", </p>
<p>"Clementine", "Cherry", "Plum", "Guava", </p>
<p>"Cranberry"]&gt;</p>
<p>[/code]</p>
<p>Another line</p>
<p><img src="http://image.jpg" alt="Array" />
</p>
<p>More text</p>
<p>[code:cf]</p>
<p>&lt;table border="1"&gt;</p>
<p> &lt;cfoutput&gt;</p>
<p> &lt;cfloop array="#GroupsOf(ArrFruits, 5)#" index="arrFruitsIX"&gt;</p>
<p>  &lt;tr&gt;</p>
<p> &lt;cfloop array="#arrFruitsIX#" index="arrFruit"&gt;</p>
<p>     &lt;td&gt;#arrFruit#&lt;/td&gt;</p>
<p> &lt;/cfloop&gt;</p>
<p>  &lt;/tr&gt;</p>
<p> &lt;/cfloop&gt;</p>
<p> &lt;/cfoutput&gt;</p>
<p>&lt;/table&gt;</p>
<p>[/code]</p>
<p>With an output that looks like:</p>
<p><img src="another_image.jpg" alt="" width="342" height="85" /></p>

What I'm trying to do, is write a regular expression that will remove all the <p> or </p>, and whenever it finds a </p>, it will replace it with a line-break.

So far, my pattern looks like this:

/\<p\>(.*?)(<\/p>)/g

And I'm replacing the matches with:

$1\n

It all looks good, but it's also replacing the contents inside the [code][/code] tags, which in this case should not replace the <p> tags at all, so as a result, i would lkike to get rid of the <p> tags, when the content isn't inside the [code] tags.

I can't ever get negation right, I know it will be something along the lines of

\<p\>^\[code*\](.*?)(<\/p>)

But obviously this doesn't work :-)

Could anyone please lend me a hand with this regex?

BTW, I know I shouldn't be using regular expressions to parse HTML at all. I'm fully aware of that, but still, for this specific case, I'd like to use regex.

Thanks in advance

+1  A: 

I assume that you have special knowedge about the application which generated the HTML you are venturing to parse, otherwise you would not be even considering regular expressions for the task. (Part of that is also, I assume, knowledge that <p> tags always appear after a newline and that </p> closing tags always appear before a newline.)

The above having been said, you cannot easily or efficiently achieve what you are trying to achieve with regular expressions alone (you would have to use complex nested look-behind and look-ahead assertions to validate that your <p>...</p> occurrence is not inside a [code]...[/code] block, and non-fixed-length look-behind assertions are particularly limited, and IIRC plain buggy prior to JDK1.6.)

You should first iterate over the input sequence, breaking it down into code and non-code chunks, and transferring the chunks into the output sequence either unchanged (in the case of code chunks) or with <p>...</p>-substitution applied via regex or simple string replacement (in the case of non-code chunks.)

Up to you if you will have to (or how you want to) deal with nested or mismatched code chunks.

Cheers, V.

vladr
A: 

The syntax for negative lookahead is (?!).

(?![code.*?]([^\[]|\[\/[^c]|\[\/c[^o]|\[\/co[^d]|\[\/cod[^e]|\[\/code[^\]])*)<p>.*?</p>

SHiNKiROU
This only leaves the first line of my code block with the <p> tags, everything else hasn't it, and only works for one code block on the page
Marcos Placona
The code you posted doesn't seem to do anything now
Marcos Placona
+1  A: 

I know I shouldn't be using regular expressions to parse HTML at all. I'm fully aware of that, but still, for this specific case, I'd like to use regex.

Can you explain this a bit more?

Will
"I know I shouldn't pound nails with a screwdriver, but this time, I'd like to use a screwdriver." Just Say No!
TrueWill