tags:

views:

65

answers:

4

I have a newsletter system I am trying to incorporate within a PHP site. The PHP site loads a content area and also loads scripts into the head of the page. This works fine for the code that is generated for the site but now I have the newsletter I am trying to incorporate.

Originally I was going to use an iFrame but the amount of AJAX and jQuery calls makes this quite complex.

So I thought I could use cURL to load the newsletter page as a variable. Then I was going to use RegEx to grab the content between the body tags and place this in the content area. Finally I was going to use RegEx again to search through the head and grab any scripts.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $config_live_site."lib/alerts/user/[email protected]"); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);

// Capture the body content and place in $_content
if (preg_match('%<body>([\s\S]*)</body>%', $loaded_result, $regs)) {
 $_content .= $regs[1];
} else {
 $_content .= "<p>No content to display.</p>";
}

// Capture the scripts and place in the head
if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $loaded_result, $regs)) {
 $headDetails .= $regs[0];
}

This works most of the time but if there is a script in the body of the document it captures down to the last /script'.

My question is two-fold I guess...

A. Is there a better overall approach (My deadline is very short so it needs to be a quick solution without too much editing of the newsletter code)?

B. What RegEx would I need to use to just capture the first script?

+1  A: 

A: I see no issues with using regular expressions to extract the bits you need from HTML pages which are not necessarily valid. In fact some of the spidering solutions I worked with did exactly that.

B: Use preg_match_all() instead of preg_match(). preg_match() only captures the first match while preg_match_all() will continue until the end of the string and return all matches.

Rowlf
+2  A: 

I think you'll need to add a ? to the script regex after the * so it's not greedy. Greedy regex's match as much as is possible (everything between the first opening tag and the last closing), non-greedy match as little as possible (only what's between the opening tag and the first closing tag). Try:

%(<script type="text/javascript">[\s\S]*?</script>)%

As mentioned, change it to preg_match_all, and you should just match the individual script sections instead of everything between the first and last script tags.

Tim Lytle
This worked perfectly, thankyou. :) Funny how just adding a single question-mark makes such a massive difference.
Das123
A: 
$doc = new DOMDocument();
$doc->loadHTML($loaded_result);
$xpath = new DOMXpath($doc);

$kod = $xpath->query("//head/script");
$i = 0;
foreach($kod as $node){
    echo 'im the script nº'.(++$i).' in the head and this is my content: ';
    echo $doc->saveXML($node)."\n";
}
useless
look like Eineki just stole my answer
useless
this code doesnt need to have a well formed html, DOMDocument will correct it as your browser does.
useless
dear user267351, may I call you just 351? ;)I'm sorry to know that you think I have borrowed, or stolen, your answer, you know, the task is so simple (load an xml, apply an xpath and process the results) that solutions have to be similar. By the way, I didn't know of loadHTML method of domdocument, I have always resorted to tidy to fix broken html documents. Good to have learned it.
Eineki
A: 

A quick and dirty response can be: delete the body content just after capturing it. Then proceed

if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) {
   $_header .= $regs[1];
} else {
   $_header .= "<p>No content to display.</p>";
}

then apply the regex just to the header

if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) {
   $headDetails .= $regs[0];
}

If the html you get from curl is well formed, you should use simplexml to perform your extraction. As its name suggest, it is very simple to use.

$xml = simplexml_load_string($loaded_content);

$body = $xml->body->asXML();

$scripts = $xml->xpath('//head/script');
foreach ($scripts as $script) {
  $_scripts .= $script->asXML();
}

If your html is not well formed, then you hava to resort to tidy to normalize it (or better, correct the scripts that output invalid html content)

Eineki
I tried the XML approach based on your answer but I suspect the HTML is not completely valid so it threw a number of exceptions. Looked like it would have been a nice solution but the RegEx is doing the job for now. Thanks. :)
Das123