views:

113

answers:

1

So, I'm basically trying to match anything inside (and including) object tags, with this:

<?php preg_match_all('/<object(.*)<\/object>/', $blah, $blahBlah); ?>

It finds a match for this:

<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="250" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"&gt;&lt;param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://vimeo.com/moogaloop.swf?clip_id=9048799&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" /><embed type="application/x-shockwave-flash" width="400" height="250" src="http://vimeo.com/moogaloop.swf?clip_id=9048799&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" allowscriptaccess="always" allowfullscreen="true"></embed></object>

But it won't match this:

<object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=5630744&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=5630744&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>

Any idea why? Thanks for any insight.


ETA: Since my approach may have been faulty to begin with, here's some background on what I'm trying to do.

This is for a Wordpress site. I am using a plugin that converts a shorttag into a full video embed code. The plugin was recently (thankfully) updated to make the code more valid.

The function I am trying to create is simply to find the first video object in a post, and grab it for use elsewhere on the site.

Here is the entire function (some of it will only make sense if you've worked with Wordpress):

<?php
function catch_that_video() {
  global $post, $posts;
  $the_video = '';
  ob_start();
  ob_end_clean();
  $output = preg_match_all('/<object(.*)<\/object>/', $post->post_content, $vid_matches);
  $the_video = $vid_matches [1] [0];
  if(empty($the_video)){ $the_video = 0; }
  return $the_video;
}
?>
+1  A: 

The only thing that comes to mind is single vs multiple lines.

/<object(.*)<\/object>/m

That should match across multiple lines.

This manual page discusses the modifiers:

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Update:

Upon further investigation, m is not the correct modifier (from the manual):

m (PCRE_MULTILINE) By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

(Emphasis my own.)

The correct modifier would be s which would allow the dot metacharacter . to match newlines.

Moving on to the updated question, the regex itself matches both of those inputs, if those inputs are simple strings. I don't know what's causing the actual issue.

$input = '<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="250" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"&gt;&lt;param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://vimeo.com/moogaloop.swf?clip_id=9048799&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" /><embed type="application/x-shockwave-flash" width="400" height="250" src="http://vimeo.com/moogaloop.swf?clip_id=9048799&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" allowscriptaccess="always" allowfullscreen="true"></embed></object>';

$input2 = '<object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=5630744&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=5630744&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=00ADEF&amp;amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>';

$matches = array();
preg_match_all('/<object(.*)<\/object>/', $input, $matches); 
echo '<br />$input<pre>';
var_dump($matches);
echo '</pre>';

$matches2 = array();
preg_match_all('/<object(.*)<\/object>/', $input2, $matches2); 
echo '<br />$input2<pre>';
var_dump($matches2);
echo '</pre>';

Moving on:

What are you trying to accomplish with these two lines?

ob_start();
ob_end_clean();

This opens a new output buffer and immediately kills it. (See the bit about stacking output buffers in the documentation.)

Is there a reason to set this equal to 0, instead of say null?

if(empty($the_video)){ $the_video = 0; }

Personally, I would set it to null when declaring it and rely on not clobbering that if there are no matches. This is how I would write that function, assuming that $post is a WordPress global. (Personally, I would just pass that into the function, as I have a disdain for most globals.)

function catch_that_video() 
{
  global $post;

  $the_video = null;
  $vid_matches = array();

  if(preg_match('/<object.*<\/object>/', $post->post_content, $vid_matches))
  {
    $the_video = $vid_matches[0];
  }

  return $the_video;
}

I changed it to use preg_match instead of preg_match_all, since you're using only the first match. This can, of course, be modified to use preg_match_all, if necessary. Though, the appropriate regex will be a pain to create. (Adding the s modifier to the above regex in order to deal with multiple lines would grab everything from the first opening <object> tag to the last closing </object> tag. I don't even want to think about trying to come up with a regex to cover multiple lines and grab individual <object>...</object> blocks.)

However, this doesn't answer the original question as to why the 2nd object block isn't being matched. I would focus my investigation on trying to discover the difference between the two strings. If the issue was the difference between line endings, I would use something like VIM on Linux, as that would display `^M' in place of the \r in the line endings. What about html encoding of the string? Might that be a possible issue?

George Marian
But don't *both* of those inputs use multiple lines?
Rob Kennedy
@Rob Kennedy That assumes that the input is formatted as they are presented in this question. (Which isn't a bad assumption.) When I couldn't come up with a reason why the regex isn't working for both, I decided to throw that assumption out the window.
George Marian
Thanks, that modifier didn't work, but I'm checking to see if any of the others might apply. Though, I'm beginning to see that my approach was wrong from the start anyway.
Kerri
@Kerri That's basically the reason that we shouldn't parse HTML with a regex. Regular expressions are powerful, but HTML is too variable for its use. That said, I'm still curious what the issue is here.
George Marian
Updated the code, because I realized I went into "auto-prettify" mode when I added it. It is now exactly as it was output (line breaks and all).
Kerri
Yours *was* a good idea, George. And for the last few minutes, I've been thinking of expanding on it by exploring the possibility that PCRE doesn't consider one of the inputs to contain "real" line breaks (i.e., Windows- versus Unix-style). Now that Kerri's edited the question to show that neither input has any line breaks at all, that idea's shot, too.
Rob Kennedy
@Rob Kennedy Thanks, but I apparently failed at RTFM. See my update.
George Marian
Thanks, again, George. This is really helpful, and at the very least, they've made my code cleaner. The code I used was adapted from a very widely used Wordpress function, so w/ my minimal skills, I put little thought into the soundness of the code. That output buffer thing you pointed out— it does seem bizarre. It's in every single example of that function, but it makes no sense why. Still trying to figure out the whole matching issue. I don't think it's encoding. I'm pulling out the original source code for the plugin that generates it to see what differences there are between versions…
Kerri
@Kerri It's my pleasure. Feel free to update the question with anything you discover. If you figure it out, I'd like to know what the problem was. Or if you simply find anything useful, I may have more ideas. Just make sure to attach a comment to my answer or use @George in a comment attached to the question. Now that I've attached a comment to the question itself, I should get a notification about such a comment being a reply to my comment.
George Marian