ansaurus

Question

Answer 1

+2 A:

A fairly naive regex will probably work for you

$html=preg_replace('/class=".*?"/', '', $html);

I say naive because it would fail if your body text happened to contain class="something" for some reason!. It could be made a little more robust by looking for class="" inside angled bracketted tags if need be.

Paul Dixon 2009-07-23 10:38:05

Thanks so much, works like a charm :)

SoulieBaby 2009-07-23 10:43:54

Does the code work with upper/lower case, single/double/no quotes, spaces inbetween, spaces before and after the class?

Jon Winstanley 2009-07-23 11:17:22

No - only the cases indicated by the OP. Anything else is left as an exercise for the reader :)

Paul Dixon 2009-07-23 12:45:33

Answer 2

+2 A:

I would do something like this on jQuery. Place this in your page header:

$(document).ready(function(){
$(p).each(function(){
     $(this).removeAttr("class");
     //or  $(this).removeclass("className");
})

});

Teknotica 2009-07-23 10:42:01

Not PHP, but a better solution

Draemon 2009-07-23 10:45:53

Not sure how that could be better without knowing why the OP wanted to do this.

Paul Dixon 2009-07-23 10:57:20

Not better, just other way to do it :)

Teknotica 2009-07-23 10:59:50

Answer 3

+2 A:

Maybe it's a bit overkill for your need, but, to parse/validate/clean HTML data, the best tool I know is HTML Purifier

It allows you to define which tags, and which attributes, are OK ; and/or which ones are not ; and it gives valid/clean (X)HTML as output.

(Using regexes to "parse" HTML seems OK at the beginning... And then, when you want to add specific stuff, it generally becomes hell to understand/maintain)

Pascal MARTIN 2009-07-23 10:42:09

Correct me if I'm wrong, but don't the lexical analyzers true XML parsers use pick the XML apart with regex anyways ?I think the real issue is that when people try to do regex parsers themselves they try to jump to the middle or end of a string instead of starting at the beginning of the string like a true parser does.

joebert 2009-07-23 11:04:57

I don't think they do -- not sure about it, but... seems odd. Anyway, even if they do, they are probably more tested (because they are widely used) than the regex you will write yourself for your own project.

Pascal MARTIN 2009-07-23 11:10:13

Answer 4

A:

You load the HTML into a DOMDocument class, load that into simpleXML. Then you do an XPath query for all p elements and then loop through them. On each loop, you rename the class attribute to something like "killmeplease".

When that's done, reoutput the simpleXML as XML (which, by the way, may change the HTML, but usually only for the better), and you will have a HTML string where each p has a class of "killmeplease". Use str_replace to actually remove them.

Example:

$html_file = "somehtmlfile.html";

$dom = new DOMDocument();
$dom->loadHTMLFile($html_file);

$xml = simplexml_import_dom($dom);

$paragraphs = $xml->xpath("//p");

foreach($paragraphs as $paragraph) {
     $paragraph['class'] = "killmeplease";
 }

 $new_html = $xml->asXML();

 $better_html = str_replace('class="killmeplease"', "", $new_html);

Or, if you want to make the code more simple but tangle with preg_replace, you could go with:

$html_file = "somehtmlfile.html";
$html_string = file_get_contents($html_file);

$bad_p_class = "/(<p ).*(class=.*)(\s.*>)/";

$better_html = preg_replace($bad_p_class, '$1 $3', $html_string);

The tricky part with regular expressions is they tend to be greedy and trying to turn that off can cause problems if your p element tag has a line break in it. But give either of those a shot.

Anthony 2009-07-23 11:09:37

Answer 5

A:

HTML Purifier

HTML can be very tricky to regex because of the hundreds of different ways code can be written or formatted.

The HTML purifier is a mature open source library for cleaning up HTML. I would advise its usage in this case.

In HTML purifier's configuration documentation, you can specify classes and attributes which should be allowed and what the purifier should do if it finds them.

http://htmlpurifier.org/docs/

Jon Winstanley 2009-07-23 11:13:49

ansaurus

tags:

views:

answers:

strip all classes from p tags

HTML Purifier

related questions