views:

176

answers:

5

Hi, I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like : Replace everything between (and including) "" by nothing.

Could anybody help me out on this please ?

+2  A: 

I would suggest not trying to do this with a regular expression. A safer approach would be to use something like

Simple HTML DOM

Here is the link to the API Reference: Simple HTML DOM API Reference

Another option would be to use DOMDocument

The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.

<?php
    $doc = new DOMDocument;
    $doc->loadHTMLFile('blah.html');

    $content       = $doc->documentElement;
    $table         = $content->getElementsByTagName('table')->item(0);
    $delfirstTable = $content->removeChild($table);

    echo $doc->saveHTML();
?>
RC
Thanks for your quick answer RC, trouble is that the html tags are not plain standard but "customized" so I'm afraid DOM won't work unless I restructure all the articles in the database :/
Argo
Argo, DOM will work with any valid XML
Mez
This is assuming the file is valid XML, HTML... or valid anything. Which in a real-life scenario, it probably isn't. :)
Ilari Kajaste
You should be able to turn validation off. If it's somewhat reasonable I would think it should be able to do the job.
RC
+2  A: 

If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.

cballou
Maybe I should have given a bit more context, but I was just hoping for a line of code, and didn't expect completely different approaches in return. In fact, I have a content website, with articles stored in a database. I formatted them with custom html to avoid lengthy html code in the articles, and replace them with php for the real html when shown in the page. I also wanted to add the possibility to only view the headings to get an idea of the structure and content of the article, and that is where I ran into the problem of how to hide <p>s <table>s etc and only keep <h>s.
Argo
You may or may not be able to achieve what you're looking for with a line of code, but the larger question is how robust is it? I can write tons of one-line solutions that work 99% of the time but fail sensationally 1% of the time. If that meets your needs, then by all means go for it.
RC
+1  A: 

PSEUDO CODE

function replaceMe($html_you_want_to_replace,$html_dom) {
   return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}

HTML Before

<div>I'm Here</div><div>I'm next</div>

<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>

HTML After

<div>I'm next</div>

I know it's a hack job

Phill Pafford
There's an error in your regex. You need to start and end the regex with a delimiter, like # or /
mabwi
Thanks, this is Pseudo code and not tested. But your correct ;)
Phill Pafford
+2  A: 

If you don't know what is between the tags, Phill's response won't work.

This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.

preg_replace('#<div>[^<]+</div>#','',$html);

If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so

preg_replace('#<div>.+</div>#','',$html);

These aren't tested

mabwi
+5  A: 

Easy dude.

To have a Ungreedy regexpr, use the U modifier And to make it multiline, use the s modifier. Knowing that, to remove all paragraphes use this pattern :

#<p[^>]*>(.*)?</p>#sU

Explain :

  • I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
  • <p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
  • (.*)? : Everything (in "Ungreedy mode")
  • </p> : Obviously, the closing paragraph

Hope that help !

Grokwik