tags:

views:

77

answers:

2

Hi,

Why does this code not find picture links in text?

$text = 'text text <img src = "http://www.singlewheel.com/Scoots/PAV/Martin7PAV/42316.jpg" /> text text';
preg_match_all('#^http:\/\/(.*)\.(gif|png|jpg)$#i', $text, $all_img);
A: 

Remove the ^ and $ and if you want to capture the url for later use surround in ()

preg_match_all('#(http:\/\/(.*)\.(gif|png|jpg))#i', $text, $all_img);
Gary Green
I hope the html doesn't have anything silly like multiple images in it.
jasonbar
@jasonbar http://php.net/manual/en/function.preg-match-all.php - preg_match_all "searches *subject* for all matches to the regular expression given in *pattern*"
Ryan Kinal
Or any non-image URLs before a .gif. @Ryan Kinal: Yes, but there's a greedy pattern in it.
relet
Your regular expression is what is poor, not your choice of preg_match_all(). Consider the following string: `http://www.example.com/image.gif this is totally not what I want but I like .gif files.` (Hint, it doesn't match what you want it to match.)
jasonbar
@jasonbar I think that's solved by using a non-greedy match. `(.*?)`
Mike B
@Mike B, Consider: `http://www.example.com I like .gif a lot.` My point being: regular expressions are not well suited for parsing HTML. You could write a regular expression to get the source from an img tag, sure. It's not a good idea, though.
jasonbar
And how about "http://www.example.com/My windows appends extensions to every filename I provide.gif.gif"?
relet
Would anyone care to explain why the correct answer is downvoted here? Regular expressions are the perfect solution for this simple question. Invoking an SGML parser and sniffing for img tags is not only slower, but objectively overkill. @jasonbar: Of course the regex could be made more robust, but those silly counter examples ("text with spaces.gif") don't add anything useful to the discussion.
mario
@mario, How about https? What about relative images? Maybe images served through a PHP script? Images with multiple extensions (my_image.gif.gif is more common than you'd think). Does the asker want to collect data from the text, or just image src attributes? (the question implies the latter). This is most certainly not the correct answer.
jasonbar
@jasonbar: https is another fictional example, and *.gif.gif are easily included with a better regex like `#http:[^<\s>]+[.](gif|png|jpe?g)/i`. But the asker didn't ask for special cases. This question was just about fixing the regular expression, not finding the best approach.
mario
@mario, Yes, why settle for "the best approach" when you can use a terrible approach and just change it for each new use-case you find? Also, spaces are valid in filenames and don't need to be properly encoded to work. Your regular expression further proves why this is a poor solution.
jasonbar
+1  A: 

There are any number of ways to parse the DOM with PHP. Here is an example that does exactly what you want (gets every img src from the html string)

http://simplehtmldom.sourceforge.net/#fragment-11

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
    echo $element->src . '<br>';

Regular expressions are poorly suited for parsing HTML.

jasonbar
Thanks for the downvote with no comment!
jasonbar
Another -1. Your answer fails the question by also returning relative img links. While it's technically the better approach, it's clearly out of proportion for the task at hand.
mario