ansaurus

Question

How to extract img src, title and alt from html using php?

Answer 1

A:

You can also try SimpleXML if the HTML is guaranteed to be XHTML - it will parse the markup for you and you will be able to access the attributes just by their name. (There are DOM libraries as well if it's just HTML and you can't depend on the XML syntax.)

Borek 2008-09-26 08:35:35

I don't think it's XHTML - for the DOM libraries, have you more information about these, and how to use them for this question? Thanks!

Sam 2008-09-26 08:48:18

Well it has been answered by Anonymous...

Borek 2008-09-26 09:57:25

Answer 2

A:

How about using a regular expression to find the img tags (something like "<img[^>]*>"), and then, for each img tag, you could use another regular expression to find each attribute.

Maybe something like " ([a-zA-Z]+)=\"([^"]*)\"" to find the attributes, though you might want to allow for quotes not being there if you're dealing with tag soup... If you went with that, you could get the parameter name and value from the groups within each match.

MB 2008-09-26 08:47:22

Uh, yes, I was thinking something along these lines, but I'm looking for an implementation of the idea - I'm not good at regex :(

Sam 2008-09-26 08:50:32

No me neither! A regexp tester is worth its weight in gold - I use an eclipse plugin but I'm sure there are many others available. Also the regexp information at www.regular-expressions.info is the best I've seen online. My guess is that regexp would be the simplest way to do what you want to do.

MB 2008-09-26 09:01:59

I use RegexBuddy, it's awesome :) It's a pity though that there's no standard syntax for regex find/replace (so the regex you put in also does the replace operation simultaneously. Instead, you need program code to do the two parts.

Chris Dennett 2010-03-19 16:01:33

Answer 3

A:

You can write a regexp to get all img tags (<img[^>]*>), and then use simple explode: $res = explode("\"", $tags), the output will be something like this:

$res[0] = "<img src=";
$res[1] = "/image/fluffybunny.jpg";
$res[2] = "title=";
$res[3] = "Harvey the bunny";
$res[4] = "alt=";
$res[5] = "a cute little fluffy bunny";
$res[6] = "/>";

If you delete the <img tag before the explode, then you will get an array in the form of

property=
value

so the order of the properties are irrelevant, you only use what you will like.

Biri 2008-09-26 08:49:26

Answer 4

+10 A:

Use xpath.

For php you can use simplexml or domxml

EDIT : now that I know better

Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better us an HTML parser.

Solution With regexp

In that case it's better to split the process in two parts :

get all the img tag
extract their metadata

I will assume your doc is not xHTML strict so you can't use an XML parsor. E.G. with this web page source code :

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&amp;d=identicon&amp;r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

Then we get all the img tag attributes with a loop :

Regexps are CPU intensive so you may wan to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.

How does this stuff work ?

First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in its third parameter.

The regexps :

<img[^>]+>

We apply it on the all html web page. It can be read as every string that start with "<img", contains non ">" char and ends with a >.

(alt|title|src)=("[^"]*")

We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().

Finally, everytime you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.

EDIT : answer to the first comment.

It's true that I did not think about the (hopefully few) people using single quotes.

Well, if you use only ', just replace all the " by '.

If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ø] to replace [^"].

e-satis 2008-09-27 11:33:36

Only problem is single quotation marks:<img src='picture.jpg'/> will not work, the regex expects " all the time

Sam 2008-10-01 15:18:42

Tre my friend. I added a note on about that. Thanks.

e-satis 2008-10-04 11:20:11

Answer 8

+2 A:

The script must be edited like this

foreach( $result[0] as $img_tag)

because preg_match_all return array of arrays

Bakudan 2009-09-27 05:14:26

Answer 9

A:

"]+>]+>/)?>"

this will extract anchor tag nested with image tag

Muhammad Irfan 2010-04-29 06:31:55

Answer 10

+1 A:

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument(); @$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) { echo $tag->getAttribute('src'); }

karim 2010-05-30 05:43:50

Very cool code!

Alex Polo 2010-08-16 10:35:40

Answer 11

A:

I used preg_match to do it.

In my case, I had a string containing exactly one <img> tag (and no other markup) that I got from Wordpress and I was trying to get the src attribute so I could run it through timthumb.

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

In the pattern to grab the title or the alt, you could simply use $pattern = '/title="([^"]*)"/'; to grab the title or $pattern = '/title="([^"]*)"/'; to grab the alt. Sadly, my regex isn't good enough to grab all three (alt/title/src) with one pass though.

Jazzerus 2010-09-28 16:59:34

ansaurus

tags:

views:

answers:

How to extract img src, title and alt from html using php?

EDIT : now that I know better

Solution With regexp

How does this stuff work ?

related questions