tags:

views:

169

answers:

5

Hi everyone, I was just wondering if any one knew a function to remove ALL classes from a string in php.. Basically I only want

<p>

tags rather than

<p class="...">

If that makes sense :)

+2  A: 

A fairly naive regex will probably work for you

$html=preg_replace('/class=".*?"/', '', $html);

I say naive because it would fail if your body text happened to contain class="something" for some reason!. It could be made a little more robust by looking for class="" inside angled bracketted tags if need be.

Paul Dixon
Thanks so much, works like a charm :)
SoulieBaby
Does the code work with upper/lower case, single/double/no quotes, spaces inbetween, spaces before and after the class?
Jon Winstanley
No - only the cases indicated by the OP. Anything else is left as an exercise for the reader :)
Paul Dixon
+2  A: 

I would do something like this on jQuery. Place this in your page header:

$(document).ready(function(){
$(p).each(function(){
     $(this).removeAttr("class");
     //or  $(this).removeclass("className");
})

});

Teknotica
Not PHP, but a better solution
Draemon
Not sure how that could be better without knowing why the OP wanted to do this.
Paul Dixon
Not better, just other way to do it :)
Teknotica
+2  A: 

Maybe it's a bit overkill for your need, but, to parse/validate/clean HTML data, the best tool I know is HTML Purifier

It allows you to define which tags, and which attributes, are OK ; and/or which ones are not ; and it gives valid/clean (X)HTML as output.

(Using regexes to "parse" HTML seems OK at the beginning... And then, when you want to add specific stuff, it generally becomes hell to understand/maintain)

Pascal MARTIN
Correct me if I'm wrong, but don't the lexical analyzers true XML parsers use pick the XML apart with regex anyways ?I think the real issue is that when people try to do regex parsers themselves they try to jump to the middle or end of a string instead of starting at the beginning of the string like a true parser does.
joebert
I don't think they do -- not sure about it, but... seems odd. Anyway, even if they do, they are probably more tested (because they are widely used) than the regex you will write yourself for your own project.
Pascal MARTIN
A: 

You load the HTML into a DOMDocument class, load that into simpleXML. Then you do an XPath query for all p elements and then loop through them. On each loop, you rename the class attribute to something like "killmeplease".

When that's done, reoutput the simpleXML as XML (which, by the way, may change the HTML, but usually only for the better), and you will have a HTML string where each p has a class of "killmeplease". Use str_replace to actually remove them.

Example:

$html_file = "somehtmlfile.html";

$dom = new DOMDocument();
$dom->loadHTMLFile($html_file);

$xml = simplexml_import_dom($dom);

$paragraphs = $xml->xpath("//p");

foreach($paragraphs as $paragraph) {
     $paragraph['class'] = "killmeplease";
 }

 $new_html = $xml->asXML();

 $better_html = str_replace('class="killmeplease"', "", $new_html);

Or, if you want to make the code more simple but tangle with preg_replace, you could go with:

$html_file = "somehtmlfile.html";
$html_string = file_get_contents($html_file);

$bad_p_class = "/(<p ).*(class=.*)(\s.*>)/";

$better_html = preg_replace($bad_p_class, '$1 $3', $html_string);

The tricky part with regular expressions is they tend to be greedy and trying to turn that off can cause problems if your p element tag has a line break in it. But give either of those a shot.

Anthony
A: 

HTML Purifier

HTML can be very tricky to regex because of the hundreds of different ways code can be written or formatted.

The HTML purifier is a mature open source library for cleaning up HTML. I would advise its usage in this case.

In HTML purifier's configuration documentation, you can specify classes and attributes which should be allowed and what the purifier should do if it finds them.

http://htmlpurifier.org/docs/

Jon Winstanley