views:

193

answers:

3

My SMF forum contains posts with video and I want to extract them to show on the Wordpress main page. My current regexp (thanks to SO!) extracts the url of the videos, which I embed using AutoEmbed.

Everything works up until a post looks like this:

<embed height="600" width="600" allowscriptaccess="never" quality="high" loop="true" play="true" src="http://mmavlog.net/embed/player.swf?file=http://video.ufc.tv/CSG/UFC113/20100507_ufc113_weigh_in_400k.flv" type="application/x-shockwave-flash">

Here is my current regexp:

$regexp = "/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i";

Since the posts can contain <embed> or <object> I realize that looking for the url by the "http" might be inaccurate. How can I use the regexp to look for "src=" for <embed> and "data=" for <object>?

+1  A: 

How not to do it even though it works:

$str = <<<HTML
<object width="550" height="400">
    <param name="movie" value="somefilename.swf">
    <embed src="somefilename.swf" width="550" height="400">
    </embed>
</object>
HTML;

$matches = array();
if (preg_match_all('/(src|value)="([^"]+)"/', $str, $matches)) {
   print_r($matches);
}
// Array
// (
//     [0] => Array
//         (
//             [0] => value="somefilename.swf"
//             [1] => src="somefilename.swf"
//         )
// 
//     [1] => Array
//         (
//             [0] => value
//             [1] => src
//         )
// 
//     [2] => Array
//         (
//             [0] => somefilename.swf
//             [1] => somefilename.swf
//         )
// 
// )

How to really do it:

This is an example of how to parse HTML with simplehtmldom, and this is what you should do instead using regular expressions (though you could use any other HTML parser, not strictly simplehtmldom, most of them have similar API).

<?php
include('simple_html_dom.php');

$str = <<<HTML
<object width="550" height="400">
    <param name="movie" value="somefilename.swf">
    <embed src="somefilename.swf" width="550" height="400">
    </embed>
</object>
HTML;

$html = str_get_html($str);
$embed = $html->find('embed', 0);
echo $embed->src;
// prints somefilename.swf

$object = $html->find('object param', 0);
echo $object->value;
// prints somefilename.swf
?>
rebus
This might be a novice question, by how do I handle the quotes? I have $regexp = '(src|data)="([^"]+)"'
Ben
Here, i expanded a bit on the answer, including the gurun8 and Delan Azabani advices which is really the way you want to go i would expect.
rebus
Awesome, this looks so much easier and effective than the regex. Thanks for the update!
Ben
+1  A: 

Have you considered parsing the HTML as XML (provided the HTML is well formed) to extract node and attribute data rather than relying on regex?

gurun8
I'm not really familiar with this process, could you point me in the right direction?
Ben
Or parse as SGML/HTML5 parser, which is what is meant to parse HTML.
Delan Azabani
Hey Ben! My apologies, I didn't see your comment before. Here's a PHP XML DOM link: http://www.w3schools.com/php/php_xml_dom.asp and this library looks interesting as well: http://simplehtmldom.sourceforge.net/ Delan's suggestion could be helpful too. Delan do you have a helpful link you could recommend?
gurun8
Thanks! I will be looking into simplehtmldom for sure!
Ben
A: 

To solve the regexp:

/(?:src|data)="([^"]+)"/

A hint: avoid embedding video with embed and object - that's so 2002. Try using the much simpler and more powerful video tag (which requires no plugins).

Delan Azabani
I would love to use the video tag, but not all browsers support it yet...
Ben
Would you rather `no IE support` or `buggy, insecure and non-futureproofed technology`? ;)
Delan Azabani
Ha, nice response, but since 60%+ of the users are IE, I'm stuck until an update.
Ben