tags:

views:

35

answers:

2

I'm looking to write a script in php that scans an html document and adds new markup to a element based on what it finds. More specifically, I was it to scan the document and for every element it searches for the CSS markup "float: right/left" and if it locates it, it adds align="right/left" (based on what it finds). Example:

<img alt="steve" src="../this/that" style="height: 12px; width: 14px; float: right"/>

becomes

<img alt="steve" src="../this/that" align="right" style="height: 12px; width: 14px; float: right"/>

+2  A: 

Please please, don't use a regexp to parse HTML.

Use simple_html_dom instead.

$dom = new simple_html_dom();
$dom->load($html);
foreach ($dom->find("[style=float: left],[style=float: right]") as $fragment)
{
   if ($fragment[0]->style == 'float:left')
   {
      $fragment[0]->align='left';
      $fragment[0]->style = '';
   }
   ...
}
echo $dom;
Byron Whitlock
Wrikken
@wrikken because simple_html_dom is simpler and easier. Have you ever tried it? It will blow you away ;)
Byron Whitlock
Maybe I'm a fool for having learned `DOM` (mainly using javascript earlier on) and `XPath`, but I don't find it a lick easier, and even so, most of those methods could be simply implemented creating a few helper functions in extending `DOM`.
Wrikken
Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
@Wrikken, fair enough. I've worked with tcldom and the expat C lib, and was not easy. Sounds like I need to add another tool to my PHP box!
Byron Whitlock
But I seem to be attacking you reading this back. That is not my intention. Please, go ahead and use `simplehtmldom`, it was just some venting / inability to understand it's popularity.
Wrikken
@Wrikken, no worries, seeing your example, I think you have a point. Thanks for the heads up!
Byron Whitlock
@Byron: thanks for not taking it the wrong way :) @Gordon: nice ones, will check some of them out / compare them to my own toolbox.
Wrikken
@Wrikken I have to thank you. There is finally someone who shares my thoughts about SimpleHtmlDom. You dont know how lonely I was blankly staring at answers getting massive upvotes for just suggesting SimpleHtmlDom without even giving examples like it was the holy grail. Now I know I am not alone. For that I `define('A_TOKEN_OF_APPRECIATION, '♥')` for you.
Gordon
@Wrikken, @Gordan, You just found another convert. I am working on a scraping project and for the current website, I tried using the php DOM as Wrikken suggest. Holy smokes that sucker is fast! AND I can use firebug's "copy XPATH" instead of counting by hand. You just saved me at least an hour sirs! THANK YOU VERY MUCH! And thank you for the rant Wrikken. I wish I could buy you a beer!!!!!
Byron Whitlock
Wrikken
+6  A: 
 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 foreach($x->query("//img[contains(@style,'float: right']") as $node) $node->setAttribute('align','right');
 foreach($x->query("//img[contains(@style,'float: left']") as $node) $node->setAttribute('align','left');

edit:

When there is no certainty of amount of space between 'float:' & 'right', there are several options:

  1. Use the XPath 1.0: //img[starts-with(normalize-space(substring-after(@style,'float:')),'right')]
  2. Just do a simple check for float like //img[contains(@style,'float:'], and check with $node->getAttribute() what actually comes afterwards.
  3. Import preg_match into the equasion (which was just recently pointed out to me (thanks Gordon), but in this case is imho the least favorite solution):

.

 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 $x->registerNamespace("php", "http://php.net/xpath");
 $x->registerPHPFunctions('preg_match');

 foreach($x->query("//img[php:functionString('preg_match','/float\s*:\s*right/',@style)]") as $node) $node->setAttribute('align','right');
Wrikken
Will this work with variations in the syntax of float? (I'm using CKeditor and I don't know how consistent this is, might get 'float:left ;' or 'float: left;'
Ghjnut
Not directly, no. For that there is some trickery involved (unfortunately, XPath 2.0's `matches()` function cannot be used. One could fiddle around with `substring-after()` and the like, I'll edit something in a moment.
Wrikken
+1 Do this. It is so much faster than simple_html_dom it isn't even funny.
Byron Whitlock
<code>echo preg_replace('%(\<img)(.*float:\s*)(right|left)(\s*;.*/\>)%', '\1 align="\3"\2\3\4', $htmlstring);</code>Looks complicated, but robust (I think). Is there a specific reason it's suggested not to use regexp for parsing?
Ghjnut
Yes: test containing 'look at this <img ', an `<img..>` _without_ float, but an arbitrary element following it (span, div, another img, etc.) after it _with_ a float, the mentioning of `float: ` in text itself etc. A lot of things _can_ go wrong, which is way we rely more on parsers then actual _best case scenario_ regexes. The reges doesn't look nearly as complicated as 'best efforts' I've seen, actually, it is one of the more naïve ones. A better one would be a regex which at least _tries_ to validate it is still within a tag, which this one utterly lacks.
Wrikken
Wrikken
A small hint if you _really_ want to go the regex route which you shouldn't): the `.*` is utterly., catastrophically, wrong: it should match _not >_ (`[^>]`), with the exception that it could match a `>` is in an attribute, in which case it would be: _if we're still in a tag, a quoting character may have started but it not guaranteed in some HTML, and we're not even sure we're in a `style` attribute_ etc, etc, etc.
Wrikken
Believe you me: I've taken the regex road to HTML before in my days, came up with a perfect solution for all available test cases, untill I was routed by just the right user brainfart in HTML to make it break. If you utterly and _completely_ control your HTML, there's a chance you will succees, but it is **not** in any way reliable for all possible situations.
Wrikken