ansaurus

Question

Process a block of HTML, ignoring content within specific tags

Answer 1

+3 A:

Hi,

The first solution that comes to my mind looks like this :

extract all the codes
remove the codes, replacing them with a special marker, that will not be affected by your string manipulations -- that marker has to be really special (and you could verify it's not present in the input string, btw)
do your manipulations on the string
put back the codes, where there are markers now

In code, it could be something like this : (sorry, it's quite long -- and I didn't include any check ; it's up to you to add those)

$str = <<<A
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sodales lacus et erat accumsan consectetur. Sed lacinia enim vitae erat suscipit fermentum. Quisque lobortis nisi et lacus imperdiet ac malesuada dui imperdiet. <pre><code>ThIs Is 
CoDe 1</code></pre>Donec vestibulum commodo quam rhoncus luctus. Nam vitae ipsum sed nibh dignissim condimentum. Sed ultrices fermentum dapibus. Vivamus mattis nisi nec enim convallis quis aliquet arcu accumsan. Suspendisse potenti. Nullam eget fringilla nunc. Nulla porta justo justo. Nunc consectetur egestas malesuada. Mauris ac nisi ipsum, et accumsan lorem. Quisque interdum accumsan pellentesque. Sed at felis metus. Nulla gravida tincidunt tortor, <pre><code>AnD cOdE 2</code></pre>nec aliquam tortor ultricies vel. Integer semper libero eu magna congue eget lacinia purus auctor. Nunc volutpat ultricies feugiat. Nullam id mauris eget ipsum ultricies ullamcorper non vel risus. Proin volutpat volutpat interdum. Nulla orci odio, ornare sit amet ullamcorper non, condimentum sagittis libero. <pre><code>aNd
CoDe
NuMbEr 3
</code></pre>Ut non justo at neque convallis luctus ultricies amet. 
A;
var_dump($str);

// Extract the codes
$matches = array();
preg_match_all('#<pre><code>(.*?)</code></pre>#s', $str, $matches);
var_dump($matches);

// Remove the codes
$str_nocode = preg_replace('#<pre><code>.*?</code></pre>#s', 'THIS_IS_A_NOCODE_MARKER', $str);
var_dump($str_nocode);

// Do whaterver you want with $str_nocode
$str_nocode = strtoupper($str_nocode);
var_dump($str_nocode);

// And put back the codes :
$str_codes = $str_nocode;
foreach ($matches[0] as $code) {
    $str_codes = preg_replace('#THIS_IS_A_NOCODE_MARKER#', $code, $str_codes, 1);
}
var_dump($str_codes);

I've tried with :

code on one line,
code on 2 lines,
and code on multiple lines

Note : you should really test more than I did -- but this could give you a first idea...

Hope this helps :-)

As a side note : generally, parsing HTML with regexes is considered bad practice, and often leads to troubles... Maybe using something like DOMDocument::loadHTML could be an idea worth having a look ?

Pascal MARTIN 2009-07-20 19:43:41

Ahh I think I get it. As far as DOMDocument goes: so you're thinking it's better to use something like DOMDocument::getElementsByTagName to pull the <pre> block than using preg_match?

Darren Newton 2009-07-20 20:06:30

Didn't say it *is*, but it *might*, in some case (yeah, I'm kinda overly careful :-D ) ^^ I don't know much of your application, and if you really know what kinda stuff comes in, a couple a regex would be just fine (In you situation, I would probably use those ; and only search for another solution if regex don't work after some tunning). But, in a (more ? ) complex case, regex are often not the best tool for HTML "parsing". About getElementsByTagName, maybe -- I'd probably first try something with Xpath, though, at least for fun ^^

Pascal MARTIN 2009-07-20 21:23:09

Thanks Pascal, this works well. I'm using microtime() to salt the token so its always unique and shouldn't collide with any user text. I really like rojoca's solution below, but I am going to use this one as I only have to process the text once, instead of multiple times.

Darren Newton 2009-07-21 12:05:57

OK :-) You're welcome !

Pascal MARTIN 2009-07-21 17:07:47

Answer 2

A:

I recommend using Textile which allows for markdown-like text formatting and HTML. It's super easy to use and I think it should solve the problem if I understand it right.

Jesse Kochis 2009-07-20 19:55:45

Answer 3

A:

2009-07-20 19:56:34

I'm doing a lot more than converting quotes and I am not stripping tags. The point is it shouldn't matter what processing I'm doing, I don't want it to happen to specific chunks within the text.

Darren Newton 2009-07-20 20:00:48

You should take a look at http://stackoverflow.com/editing-help

Gumbo 2009-07-20 20:12:03

You're missing the point. I know how markdown works. I'm asking how to selectively process only certain parts of a text while ignoring others.

Darren Newton 2009-07-20 20:14:16

I don't think Gumbo was talking to you, darren_n (though that was my first impression, too). I believe he was advising Brent to learn how to use the formatting tools here on SO--advice which I heartily endorse.

Alan Moore 2009-07-21 01:26:17

Ah sorry Gumbo! My Bad!

Darren Newton 2009-07-21 11:33:47

Answer 4

+1 A:

If you're getting everything you need back from preg_match_all() then you could leverage preg_split() like so:

$pattern = '/(<pre><code>(.*?)<\/code><\/pre>)/s';

// get the code blocks
preg_match_all($pattern, $text, $matches);
$code_blocks = $matches[0];

// split up the text around the code blocks into an array
$unprocessed = preg_split($pattern, $text);
$processed_text = '';
foreach($unprocessed as $block) {

    // process the text here
    $processed_text .= process($block); 

    // add the next code block
    if(!empty($code_blocks)) $processed_text .= array_shift($code_blocks);
}

// any remaining
$processed_text .= implode('', $code_blocks);

This has the unfortunate drawback of having to do multiple process() calls so depending on how intensive that is and how often you do it this may not be the best solution. It is pretty clear and safe though and you don't have to add any special markers for replacement later.

If your processing is doing white listing and safety type stuff then have a look at HTMLPurifier which can do some sophisticated filtering of HTML that may let you avoid this type of stuff altogether (don't quote me on that though).

rojoca 2009-07-21 00:26:02

This is nice, and your solution follows along the same thought processes I was having to solve the problem. I am going with Pascal's answer because I only have to process the text once, but I will probably be using a variant of your solution in the future.

Darren Newton 2009-07-21 12:08:06

ansaurus

tags:

views:

answers:

Process a block of HTML, ignoring content within specific tags

related questions