views:

123

answers:

2

I've been working on a post editor, I want to generate thumbnails from all images inserted on the html code, so, before to do that I want to get all basic image attributes

example:

$mydomain = 'mysite.com';
$htmlcode = <<<EOD
<p>sample text</p>
<img src='/path/to/my/image.ext' width='120' height='90'  />
<hr />
<img src='html://www.mysite.com/some/ther/path/image.ext' /> <!-- no attributes -->
<hr />
<p>blah blah <img src="http://www.notmyserver.com/path/lorem-ipsum.ext" widht='120' height='90' /></p>
EOD;


function get_all_image_attributes($htmlcode){    
// some code... 
return $images; // array with image src (required), width (if has), heigth (if has)...
}

// then validate (I really need this part)    
$images   = get_all_image_attributes($htmlcode);

function verify($images,$mydomain){
// code...
return $valid_images;
}

A valid image would be (.jpg, .jpeg, .gif, .png)

src="/path/image.ext"

src="http://www.mysite.com/path/image.ext"

src="http://www.mysite.com/some/path/image.ext"

src="http://mysite.com/some/path/image.ext"

src="www.mysite.com/path/image.ext"

ps.

The part to generate thumbnails is already done, don't worry :)

updated

//I have done the following
$html = str_get_html($html);
$images = $html->find('img');
foreach ($images as $image){
 $filename = getfilename($image);
// I would like validate the file if is located in other path,
// or if it contains 'http://[www.]mysite.com/'
 if(file_exists(PUBLICPATH.'post_images/'.$filename))
  valid_imgs[] =  BASEURL.'post_images/'.$filename;
}

function getfilename($full_filename){
    $filename = substr( strrchr($full_filename , "/") ,1); 
    if(!$filename)
      $filename = $full_filename; 
    $filename = preg_replace("/^[.]*/","",$filename);
    return $filename;
}
+3  A: 

Use an HTML parser. With PHP Simple HTML DOM Parser, you can do something along the lines of this:

$html = str_get_html($htmlcode);
foreach($html->find('img') as $element) {
    verify_image($element->src);
}
nicholaides
you could also use a regex, or use simpleXML with XPATH.
prodigitalson
Regex is not a good way to parse HTML.
Justin Johnson
And simpleXML won't tolerate invalid html. you would have to run it through html tidy or similar first.
Byron Whitlock
general rules aren't always true, regex is fine for this
rplevy
A: 

something like this would probably be good:

#!/usr/bin/perl 
open(F, 'tmp.txt');
while(<F>) { 
   while (m/img[^>]* src="([^"]+)"/g) { 
      my $imgurl = $1;
      verify_image($imgurl);
   }
}
rplevy
While this should work for many cases, it would not verify any image that doesn't have the src immediately after <img - so if there was something like <img id="x" src="x.gif"> it wouldn't be checked.
InsDel
just a minor edit to address that (see now.)
rplevy