ansaurus

Question

Extracting portions of a loaded page in PHP (RegEx)

Answer 1

+1 A:

A: I see no issues with using regular expressions to extract the bits you need from HTML pages which are not necessarily valid. In fact some of the spidering solutions I worked with did exactly that.

B: Use preg_match_all() instead of preg_match(). preg_match() only captures the first match while preg_match_all() will continue until the end of the string and return all matches.

Rowlf 2010-02-07 00:33:22

Answer 2

+2 A:

I think you'll need to add a ? to the script regex after the * so it's not greedy. Greedy regex's match as much as is possible (everything between the first opening tag and the last closing), non-greedy match as little as possible (only what's between the opening tag and the first closing tag). Try:

%(<script type="text/javascript">[\s\S]*?</script>)%

As mentioned, change it to preg_match_all, and you should just match the individual script sections instead of everything between the first and last script tags.

Tim Lytle 2010-02-07 00:38:00

This worked perfectly, thankyou. :) Funny how just adding a single question-mark makes such a massive difference.

Das123 2010-02-07 16:31:42

Answer 3

A:

$doc = new DOMDocument();
$doc->loadHTML($loaded_result);
$xpath = new DOMXpath($doc);

$kod = $xpath->query("//head/script");
$i = 0;
foreach($kod as $node){
    echo 'im the script nº'.(++$i).' in the head and this is my content: ';
    echo $doc->saveXML($node)."\n";
}

useless 2010-02-07 01:08:49

look like Eineki just stole my answer

useless 2010-02-07 06:05:11

this code doesnt need to have a well formed html, DOMDocument will correct it as your browser does.

useless 2010-02-07 19:27:53

dear user267351, may I call you just 351? ;)I'm sorry to know that you think I have borrowed, or stolen, your answer, you know, the task is so simple (load an xml, apply an xpath and process the results) that solutions have to be similar. By the way, I didn't know of loadHTML method of domdocument, I have always resorted to tidy to fix broken html documents. Good to have learned it.

Eineki 2010-02-08 10:39:36

Answer 4

A:

A quick and dirty response can be: delete the body content just after capturing it. Then proceed

if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) {
   $_header .= $regs[1];
} else {
   $_header .= "<p>No content to display.</p>";
}

then apply the regex just to the header

if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) {
   $headDetails .= $regs[0];
}

If the html you get from curl is well formed, you should use simplexml to perform your extraction. As its name suggest, it is very simple to use.

$xml = simplexml_load_string($loaded_content);

$body = $xml->body->asXML();

$scripts = $xml->xpath('//head/script');
foreach ($scripts as $script) {
  $_scripts .= $script->asXML();
}

If your html is not well formed, then you hava to resort to tidy to normalize it (or better, correct the scripts that output invalid html content)

Eineki 2010-02-07 01:14:39

I tried the XML approach based on your answer but I suspect the HTML is not completely valid so it threw a number of exceptions. Looked like it would have been a nice solution but the RegEx is doing the job for now. Thanks. :)

Das123 2010-02-07 16:34:47

ansaurus

tags:

views:

answers:

Extracting portions of a loaded page in PHP (RegEx)

related questions