ansaurus

Question

How to use preg in php to add html properties

Answer 1

+2 A:

Please please, don't use a regexp to parse HTML.

$dom = new simple_html_dom();
$dom->load($html);
foreach ($dom->find("[style=float: left],[style=float: right]") as $fragment)
{
   if ($fragment[0]->style == 'float:left')
   {
      $fragment[0]->align='left';
      $fragment[0]->style = '';
   }
   ...
}
echo $dom;

Byron Whitlock 2010-08-05 21:05:47

Wrikken 2010-08-05 21:08:13

@wrikken because simple_html_dom is simpler and easier. Have you ever tried it? It will blow you away ;)

Byron Whitlock 2010-08-05 21:11:10

Maybe I'm a fool for having learned `DOM` (mainly using javascript earlier on) and `XPath`, but I don't find it a lick easier, and even so, most of those methods could be simply implemented creating a few helper functions in extending `DOM`.

Wrikken 2010-08-05 21:14:08

Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).

Gordon 2010-08-05 21:15:11

@Wrikken, fair enough. I've worked with tcldom and the expat C lib, and was not easy. Sounds like I need to add another tool to my PHP box!

Byron Whitlock 2010-08-05 21:16:38

But I seem to be attacking you reading this back. That is not my intention. Please, go ahead and use `simplehtmldom`, it was just some venting / inability to understand it's popularity.

Wrikken 2010-08-05 21:16:50

@Wrikken, no worries, seeing your example, I think you have a point. Thanks for the heads up!

Byron Whitlock 2010-08-05 21:20:12

@Byron: thanks for not taking it the wrong way :) @Gordon: nice ones, will check some of them out / compare them to my own toolbox.

Wrikken 2010-08-05 21:22:46

@Wrikken I have to thank you. There is finally someone who shares my thoughts about SimpleHtmlDom. You dont know how lonely I was blankly staring at answers getting massive upvotes for just suggesting SimpleHtmlDom without even giving examples like it was the holy grail. Now I know I am not alone. For that I `define('A_TOKEN_OF_APPRECIATION, '♥')` for you.

Gordon 2010-08-05 21:35:20

@Wrikken, @Gordan, You just found another convert. I am working on a scraping project and for the current website, I tried using the php DOM as Wrikken suggest. Holy smokes that sucker is fast! AND I can use firebug's "copy XPATH" instead of counting by hand. You just saved me at least an hour sirs! THANK YOU VERY MUCH! And thank you for the rant Wrikken. I wish I could buy you a beer!!!!!

Byron Whitlock 2010-08-05 22:09:35

Wrikken 2010-08-05 22:28:13

Answer 2

+6 A:

 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 foreach($x->query("//img[contains(@style,'float: right']") as $node) $node->setAttribute('align','right');
 foreach($x->query("//img[contains(@style,'float: left']") as $node) $node->setAttribute('align','left');

edit:

When there is no certainty of amount of space between 'float:' & 'right', there are several options:

Use the XPath 1.0: //img[starts-with(normalize-space(substring-after(@style,'float:')),'right')]
Just do a simple check for float like //img[contains(@style,'float:'], and check with $node->getAttribute() what actually comes afterwards.
Import preg_match into the equasion (which was just recently pointed out to me (thanks Gordon), but in this case is imho the least favorite solution):

.

 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 $x->registerNamespace("php", "http://php.net/xpath");
 $x->registerPHPFunctions('preg_match');

 foreach($x->query("//img[php:functionString('preg_match','/float\s*:\s*right/',@style)]") as $node) $node->setAttribute('align','right');

Wrikken 2010-08-05 21:12:24

Will this work with variations in the syntax of float? (I'm using CKeditor and I don't know how consistent this is, might get 'float:left ;' or 'float: left;'

Ghjnut 2010-08-05 21:59:39

Not directly, no. For that there is some trickery involved (unfortunately, XPath 2.0's `matches()` function cannot be used. One could fiddle around with `substring-after()` and the like, I'll edit something in a moment.

Wrikken 2010-08-05 22:08:47

+1 Do this. It is so much faster than simple_html_dom it isn't even funny.

Byron Whitlock 2010-08-05 22:10:03

<code>echo preg_replace('%(\<img)(.*float:\s*)(right|left)(\s*;.*/\>)%', '\1 align="\3"\2\3\4', $htmlstring);</code>Looks complicated, but robust (I think). Is there a specific reason it's suggested not to use regexp for parsing?

Ghjnut 2010-08-06 17:10:02

Yes: test containing 'look at this <img ', an `<img..>` _without_ float, but an arbitrary element following it (span, div, another img, etc.) after it _with_ a float, the mentioning of `float: ` in text itself etc. A lot of things _can_ go wrong, which is way we rely more on parsers then actual _best case scenario_ regexes. The reges doesn't look nearly as complicated as 'best efforts' I've seen, actually, it is one of the more naïve ones. A better one would be a regex which at least _tries_ to validate it is still within a tag, which this one utterly lacks.

Wrikken 2010-08-06 21:40:54

Wrikken 2010-08-06 21:45:05

A small hint if you _really_ want to go the regex route which you shouldn't): the `.*` is utterly., catastrophically, wrong: it should match _not >_ (`[^>]`), with the exception that it could match a `>` is in an attribute, in which case it would be: _if we're still in a tag, a quoting character may have started but it not guaranteed in some HTML, and we're not even sure we're in a `style` attribute_ etc, etc, etc.

Wrikken 2010-08-06 21:56:01

Believe you me: I've taken the regex road to HTML before in my days, came up with a perfect solution for all available test cases, untill I was routed by just the right user brainfart in HTML to make it break. If you utterly and _completely_ control your HTML, there's a chance you will succees, but it is **not** in any way reliable for all possible situations.

Wrikken 2010-08-06 21:57:04

ansaurus

tags:

views:

answers:

How to use preg in php to add html properties

related questions