ansaurus

Question

Answer 1

A:

Remove the ^ and $ and if you want to capture the url for later use surround in ()

preg_match_all('#(http:\/\/(.*)\.(gif|png|jpg))#i', $text, $all_img);

Gary Green 2010-08-01 18:30:42

I hope the html doesn't have anything silly like multiple images in it.

jasonbar 2010-08-01 18:34:07

@jasonbar http://php.net/manual/en/function.preg-match-all.php - preg_match_all "searches *subject* for all matches to the regular expression given in *pattern*"

Ryan Kinal 2010-08-01 18:43:02

Or any non-image URLs before a .gif. @Ryan Kinal: Yes, but there's a greedy pattern in it.

relet 2010-08-01 18:43:47

Your regular expression is what is poor, not your choice of preg_match_all(). Consider the following string: `http://www.example.com/image.gif this is totally not what I want but I like .gif files.` (Hint, it doesn't match what you want it to match.)

jasonbar 2010-08-01 18:44:39

@jasonbar I think that's solved by using a non-greedy match. `(.*?)`

Mike B 2010-08-01 18:53:30

@Mike B, Consider: `http://www.example.com I like .gif a lot.` My point being: regular expressions are not well suited for parsing HTML. You could write a regular expression to get the source from an img tag, sure. It's not a good idea, though.

jasonbar 2010-08-01 19:00:23

And how about "http://www.example.com/My windows appends extensions to every filename I provide.gif.gif"?

relet 2010-08-01 21:04:35

Would anyone care to explain why the correct answer is downvoted here? Regular expressions are the perfect solution for this simple question. Invoking an SGML parser and sniffing for img tags is not only slower, but objectively overkill. @jasonbar: Of course the regex could be made more robust, but those silly counter examples ("text with spaces.gif") don't add anything useful to the discussion.

mario 2010-08-01 21:58:20

@mario, How about https? What about relative images? Maybe images served through a PHP script? Images with multiple extensions (my_image.gif.gif is more common than you'd think). Does the asker want to collect data from the text, or just image src attributes? (the question implies the latter). This is most certainly not the correct answer.

jasonbar 2010-08-01 22:05:12

@jasonbar: https is another fictional example, and *.gif.gif are easily included with a better regex like `#http:[^<\s>]+[.](gif|png|jpe?g)/i`. But the asker didn't ask for special cases. This question was just about fixing the regular expression, not finding the best approach.

mario 2010-08-01 22:17:46

@mario, Yes, why settle for "the best approach" when you can use a terrible approach and just change it for each new use-case you find? Also, spaces are valid in filenames and don't need to be properly encoded to work. Your regular expression further proves why this is a poor solution.

jasonbar 2010-08-01 22:27:26

Answer 2

+1 A:

There are any number of ways to parse the DOM with PHP. Here is an example that does exactly what you want (gets every img src from the html string)

http://simplehtmldom.sourceforge.net/#fragment-11

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
    echo $element->src . '<br>';

Regular expressions are poorly suited for parsing HTML.

jasonbar 2010-08-01 18:40:55

Thanks for the downvote with no comment!

jasonbar 2010-08-01 21:04:59

Another -1. Your answer fails the question by also returning relative img links. While it's technically the better approach, it's clearly out of proportion for the task at hand.

mario 2010-08-01 22:13:37

ansaurus

tags:

views:

answers:

find image links in html using php

related questions