tags:

views:

59

answers:

3

OK, I need to scan many HTML / XHTML documents to see if a particular file has been embedded with SWFObject. If it's the case, I need to replace the call to something else.

So far I have extracted the <script> contents where the calls can be made. Now I need to scan this string to check if the call is there and if it's there I need to replace it.

I know this is a bit odd, but the content comes from a third party which we don't have control on.

Since the call can be made in many different syntax, I will need a regular expression to find and replace the calls.

OK imagine the following scenario:

I'm searching if the file test.swf is embedded with SWFObject in the file.

The <script> content look like this:

alert('test.swf');
//some other random stuff here
swfobject.embedSWF("test.swf",
"The alternative content can screw the regexp with );", "300", "120",
"9.0.0", false, flashvars, params, attributes);

Now I would like to replace swfobject.embedSWF (and all parameters) to something else.

Is there a not too horrible way to do this? Don't forget that the call can be on one or many lines, that the parameters can be wrapped with single quotes (') or double quotes ("), that whitespace can be all around...

EDIT: OK since catching all kind of JS syntax is a bit overkill I will simplify the requirement:

The regular expression can assume only the following

  1. The call is always on the same line
  2. It always start with swfobject.embedSWF (case sensitive)
  3. Is then followed (or not) by whitespaces and then a (
  4. Is then followed (or not) by whitespaces and then a " or a ' (either one but one of the 2 is required)
  5. Is then followed by the filename
  6. Is then followed by " or ' (if we can ensure that it's the same char that in 4 good if not too bad)
  7. Is then followed (or not) by whitespaces and then a ,
  8. Is then followed by anything
  9. Is then followed by ) then any whitespaces (or not) then ; then an end of line.

It should be much simpler to parse this way (I guess).

EDIT 2: I've cooked a solution. I think I'm close but it's not working, Anyone can help? 0 should match but it's not...

<?php

$myFilename = 'test.swf';
$testCases = array();
$testCases[] = 'swfobject.embedSWF("test.swf", "The alternative content can screw the regexp with );", "300", "120", "9.0.0", false, flashvars, params, attributes);';

foreach ($testCases as $i => $currTest)
{
    $currResult = preg_match('/\s*swfobject\.embedSWF\s*\(\s*(["\'])(' . preg_quote($myFilename)  . ')[^"\']+\1\s*,[\s\S]+?\)\s*;\s*$/', $currTest);
    if ($currResult === false || $currResult < 1)
        echo $i, ' Not matching', PHP_EOL;
    else
        echo $i, ' Matching', PHP_EOL;
}

?>
+1  A: 

Use 'grep' or similar on the command line to get a list of files that contain the .swf/script/object strings you need. That'll whittle down the number of files you need to process.

Then, use a PHP script to slurp each of those files into the DOM parser of your choice and do the replacing/fixing-up there.

Marc B
I need an all PHP solution for this as everything is web based.
Activist
Simple enough to replace the command line grep with a few loops and opendir/readdir and/or DirectoryIterator, and doing your own regular expressions. And even if you're limited to remote http-based access, you might still be able to `exec` grep from within php anyways.
Marc B
+2  A: 

Well, somebody had the time to write a basic javascript parser in PHP. I'd give the tokenizer a try (possibly using an HTML parser to first find the <script> nodes).

Wrikken
I only need the regular expression to locate the call everything else is already done.
Activist
Wrikken
OK if it's not possible with a simple regexp I will look out for another solution then.
Activist
+1  A: 

In regards of your EDIT2...

I'm not the best with regular expressions but you can try:

$currResult = preg_match('/\s*swfobject\.embedSWF\s*\(\s*(["\'])(' . preg_quote($myFilename)  . ')\1\s*,[\s\S]+?\)\s*;\s*$/', $currTest);

Seems to work OK for me.

AlexV