views:

123

answers:

5

I'm trying to match the highlighted parts of this string:

<iframe maybe something here src="http://some.random.url.com/" and the string continues...

I need to match the src="" if it's placed inside of an tag. The iframe tag can be placed anywhere in the source.

Thanks in advance! :)

+8  A: 

You should use a DOM parser for that. Here's an example with DOMDocument :

<?php
    $document = new DOMDocument();
    $document->loadHTML(file_get_contents('yourFileNameHere.html'));
    $lst = $document->getElementsByTagName('iframe');

    for ($i=0; $i<$lst->length; $i++) {
        $iframe= $lst->item($i);
        echo $iframe->attributes->getNamedItem('src')->value, '<br />';
    }
?>
HoLyVieR
Why is using a DOM parser better than just preg_matching out the part that i want? It seems simpler just to write one for it all instead? Apparently this is better for some reason, cause it's already gotten 5 thumbs up, hehe...
Nike
@Nike because [HTML is not regular](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). HTML can be broken, attributes can contain characters that you are expecting to find at the end of a tag, tags can be nested... all of that makes regular expressions a bad tool for parsing HTML.
Daniel Vandersluis
@Nike If you just use a regex, you might match an `<iframe ...` tag inside of a comment, or your regex might not handle some characters that could appear between the `<iframe` and the `src=` tag, or you might get the delimiters wrong at the end of the `src` attribute (attributes might not be quoted), and you'll have to do HTML entity decoding on the contents of the `src` attribute yourself if you use a regex, and on and on. By the time you handle all of these cases in a regex, it will be longer, more complicated, and much more likely to be buggy, than just using a DOM parser.
Brian Campbell
@Nick Just consider this example : <!--<iframe src="-->NotAPath"> How can regexp effectively recognize it's not an iframe ?
HoLyVieR
A: 

You should use a DOM parser, but this regex would get you started if there is a reason you must use regexes

.*(?<iframeOpening><iframe)\s[^>]*(?<iframeSrc>src=['"][^>'"]+['"]?).*

It uses named capture groups by the way, here's how they work

preg_match('/.*(?<iframeOpening><iframe)\s[^>]*src=[\'"](?<iframeSrc>[^>\'"])+[\'"]?.*/', $searchText, $groups);
print_r($groups['iframeSrc']);
Chad
Sorry if i was unclear. That matches the entire iframe element, but i only want to match the SRC of the iframe. :)
Nike
@Nike, you weren't unclear, and this doesn't match the entire iframe element, well, it does, but it includes named groups so you can retrieve the src, see my modified answer
Chad
Nike
@Nike, try it now, I modified it slightly
Chad
Got an error now: Warning: preg_match() [function.preg-match]: Compilation failed: nothing to repeat at offset 70 in....
Nike
I got it working! :) I removed the second * at the end, and now i only get the SRC of it. Is there any way to remove the src= and the quotes around the url? Thanks!
Nike
@Nike, that extra `*` was a typo. I changed it to only return the contents of the src attribute. You did suggest you wanted the src included in your question though, which is why I had it returned.
Chad
Brian Campbell
I'm probably going to change to using a DOM parser later, but right now i know what the URL's are going to be (mostly), and i also know how the source code of the webpage looks like, so it will (hopefully) work as it should for the moment, until something changes. Thanks for the help! :)
Nike
@Brian Campbell, I fully agree, DOM is almost always the best approach... but depending on the situation, it's not always.
Chad
+2  A: 

If youre source is well formed xml you can also use xpath to find the string.

<?php
  $file = simplexml_load_file("file.html");
  $result = $file->xpath("//iframe[@src]/@src");
?>
adamse
+1  A: 
<?php
$html='<iframe maybe somethin gere src="http://some.random.url.com/" and blablabla';

preg_match('|<iframe [^>]*(src="[^"]+")[^>]*|', $html, $matches);

var_dump($matches);

Output:

array(2) {
  [0]=>
  string(75) "<iframe maybe somethin gere src="http://some.random.url.com/" and blablabla"
  [1]=>
  string(33) "src="http://some.random.url.com/""
}

But this is a quick way to do this using regular expression, which may break with unclean html or cause problems, go for a dom parser for a good proof solution.

aularon
+1  A: 

see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

That said, your particular situation isn't really parsing... just string matching. Methods for that have already been enumerated before my answer here...

Here Be Wolves
I'm waiting to be disappointed whenever I enter this kind of question expecting a post that links to that answer :)
BoltClock
well, we work hard to spread the Tales of the Regex Dom Parser, and Its Demise :D
Here Be Wolves