I have a large quantity of partial HTML stored in a CMS database.
I'm looking for a way to go through the HTML and find any <a></a>
tags that don't have a title and add a title to them based on the contents of the tags.
So if I had <a href="somepage">some text</a>
I'd like to modify the tag to look like:
<a title="some text" href="somepage"></a>
Some tags already have a title and some anchor tags have nothing between them.
So far I've managed to make some progress with php and regex.
But I can't seem to be able to get the contents of the anchors, it just displays either a 1 or a 0.
<?php
$file = "test.txt";
$handle = fopen("$file", "r");
$theData = fread($handle, filesize($file));
$line = explode("\r\n", $theData);
$regex = '/^.*<a ((?!title).)*$/'; //finds all lines that don't contain an anchor with a title
$regex2 = '/<a .*><\/a>/'; //finds all lines that have nothing between the anchors
$regex3 = '/<a.*?>(.+?)<\/a>/'; //finds the contents of the anchors
foreach ($line as $lines)
{
if (!preg_match($regex2, $lines) && preg_match($regex, $lines)){
$tags = $lines;
$contents = preg_match($regex3, $tags);
$replaced = str_replace("<a ", "<a title=\"$contents\" ", $lines);
echo $replaced ."\r\n";
}
else {
echo $lines. "\r\n";
}
}
?>
I understand regex is probably not the best way to parse HTML so any help or alternate suggestions would be greatly appreciated.