tags:

views:

136

answers:

6

I have this regex in PHP:

$regex = '/<img[^>]*'.'src=[\"|\'](.*)[\"|\']/Ui';

It captures all image tag sources in a string, but I want to only capture JPG files. I've tried to mess around with (.*) but I've only proven that I suck at regex... Right now I'm filtering the array but feels too much like a hack when I can just do it straight up with a proper match.

A: 

Just need to search for the .jpg before the closing quotes I believe

$regex = '/<img[^>]*'.'src=[\"|\'](.*\.jpg)[\"|\']/Ui';
Brandon G
This doesn't seem to work, print_r shows that the array has the whole tag -- "<img src="blahblah"> -- as values.
Espo
Sorry, I was just appended the .jpg to your code without testing it. I made some edits, although there seem to be some better options already ;).
Brandon G
A: 

you have to be careful to escape ' since you are using it as PHP delimeter.

Also searching the file which end by .jpg or jpeg would make it.

$regex = '/<img[^>]*src=["\']([^\'"]*)\.(jpg|jpeg)["\'][^>]*>/Ui';
RageZ
Sorry 'bout that. It's my first time here and I forgot to quote it as code.
Espo
@Espo: no problem
RageZ
+5  A: 

Try this:

$regex = '/<img ([^>]* )?src=[\"\']([^\"\']*\.jpe?g)[\"\']/Ui';

I also removed the extra | in the character classes that was not needed.

Ether
Thanks, that did the trick.
Espo
@Ether: I din't see the `|` good catch!
RageZ
This breaks, e.g. on `<img alt="1 >> epsilon" src="graph.jpeg">` or `<pre> <IMgetJPEG classrc="rc.jpeg"> ...`.
Svante
@Svante: I've fixed the regexp so it also matches "jpeg", and does not match the tag "imgetjpeg" or attribute "classrc" (by adding spaces as appropriate).
Ether
`<img alt="my image"src="a.jpeg">`.
Svante
And this doesn't even touch on the whole "not a regular language" problem, HTML comments, `<pre>` regions, strings in embedded scripts, etc.. Leave the parsing of HTML to HTML parsers, that's what their job is, and there are enough of them.
Svante
@Svante: you're preaching to the choir with "don't use regexes to parse html". :p :)
Ether
A: 

Try:

$regex = '/<img[^>]*'.'src=[\"|\'](.*[.]jpg)[\"|\']/Ui';
brianegge
A: 

You all forgot that tags may have spaces between < and img

So a correct regexp should start with /<\s*img

Patonza
Yes, and this is only one small detail that HTML parsers have already solved.
Svante
Indeed. Using a stable HTML parser is a better approach (even if in some circumstances it may not be an option)
Patonza
+2  A: 

First, get all img tags with an HTML parser. Then, take those whose src attribute's value is matched by the regex \.(jpeg|jpg)$.

For example, using this parser:

$html = file_get_html('http://example.foo.org/bar.html');
foreach ($html->find('img') as $img) {
    if (preg_match ("\.(jpeg|jpg)$", $img->src) {
        //save $img or $img->src or whatever you need
    }
}

Edit: I shortened the regular expression. You can also use \.jpe?g$.

Svante
Thanks, I'll look into it. A quick look does show that it uses fopen though -- which I've disabled and have tested to be less reliable than cURL for my use.
Espo