views:

43

answers:

3

I have a large quantity of partial HTML stored in a CMS database.

I'm looking for a way to go through the HTML and find any <a></a> tags that don't have a title and add a title to them based on the contents of the tags.

So if I had <a href="somepage">some text</a> I'd like to modify the tag to look like:

<a title="some text" href="somepage"></a>

Some tags already have a title and some anchor tags have nothing between them.

So far I've managed to make some progress with php and regex.

But I can't seem to be able to get the contents of the anchors, it just displays either a 1 or a 0.

<?php
$file = "test.txt";
$handle = fopen("$file", "r");
$theData = fread($handle, filesize($file));
$line = explode("\r\n", $theData);

$regex = '/^.*<a ((?!title).)*$/'; //finds all lines that don't contain an anchor with a title
$regex2 = '/<a .*><\/a>/'; //finds all lines that have nothing between the anchors
$regex3 = '/<a.*?>(.+?)<\/a>/'; //finds the contents of the anchors

foreach ($line as $lines)
{
  if (!preg_match($regex2, $lines) && preg_match($regex, $lines)){
    $tags = $lines;
    $contents = preg_match($regex3, $tags);
    $replaced = str_replace("<a ", "<a title=\"$contents\" ", $lines);
    echo $replaced ."\r\n";
  }
  else {
  echo $lines. "\r\n";
  }
}
?>

I understand regex is probably not the best way to parse HTML so any help or alternate suggestions would be greatly appreciated.

+1  A: 

Never use regex on parsing HTML. In php, use DOM.

Here's a more simple one: http://simplehtmldom.sourceforge.net/

Ruel
Thanks for your response Ruel.The problem with using DOM is that the html isn't complete. It's just partial html that makes up a page. I guess I could get around it though by surrounding all in `<html></html>` tags.
Toggo
+1  A: 

If it was coherent, you could use a simplistic regex. But it'll fail if your anchors have classes or anything. Also it doesn't corrently encode the title= attribute:

preg_replace('#<(a\s+href="[^"]+")>([^<>]+)</a>#ims', '<$1 title="$2">$2</a>',);

Therefore phpQuery/querypath is likely the robuster approach:

$html = phpQuery::newDocument($html);
foreach ($html->find("a") as $a) {
    if (empty($a->attr("title")) {
         $a->attr("title", $a->text());
    }
}
print $html->getDocument();
mario
Thanks for your response mario. I was hoping not to have to install any additional libraries, but looks like I may have to!
Toggo
True, that's the disadvantage there. However it can deal with partial HTML files more easily. But btw, QueryPath is the smaller of the two libraries.
mario
+1  A: 

Use PHP's built-in DOM parsing. Much more reliable than regex. Be aware that loading HTML into the PHP DOM will normalize it.

$doc = new DOMDocument();
@$doc->loadHTML($html); //supress parsing errors with @

$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
if ($link->getAttribute('title') == '') {
        $link->setAttribute('title', $link->nodeValue);
    }
}
$html = $doc->saveHTML();
Brent Baisley