views:

76

answers:

3

I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution).

Using standard PHP 5, how could I create a function called fetchHTML([URL]) that returns the HTML content of a webpage that's found between the <body>...</body> tags?

Please let me know if there are any prerequisite "includes". Thanks.

+2  A: 

Assuming that it will always be <body> and not <BODY> or <body style="width:100%"> or anything except <body> and </body>, and with the caveat that you shouldn't use regex to parse HTML, even though I'm about to, here ya go:

<?php

function fetchHTML( $url )
{
    $feed = '<body>Lots of stuff in here</body>';

    $content = file_get_contents( $url );

    preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );

    $content = $match[1];

    return $content;


} // fetchHTML
?>

If you echo fetchHTML([some url]);, you'll get the html between the body tags.

Please note original caveats.

hookedonwinter
Very straightforward - good answer. And how would I check for the different ways of implementing of a <body> tag (as you highlighted above)? Isn't there a regex switch for case insensitivity?
Yaaqov
There is. it's just i (right before that m at the end of the pattern). But for example, http://stackoverflow.com won't work, because the opening body tag is `<body class="home-page">`
hookedonwinter
Got it. Thanks for the pointers.
Yaaqov
Loved the "do as I say, not as I do" caveat. ;)
Alex Zylman
@Alex aka "I have no idea how to do this properly, but I know that. so.. good luck"
hookedonwinter
Since you know regex sucks for this, why not give a DOM parser answer?
Alex JL
DMin
@Alex JL Not aware how to. I'd love to see an answer using that instead though. I realize this isn't the best solution, but it's the only way I know.
hookedonwinter
@DMin then the start of the string would potentially be ` class="whatever">`, which OP might not want.
hookedonwinter
@hookedonwinter okay, I'll post one!
Alex JL
+2  A: 

I think you're better of using a class like SimpleDom -> http://sourceforge.net/projects/simplehtmldom/ to extract the data as you don't need to write such complicated regular expressions

niggles
+2  A: 

Okay, here's a DOM parser code example as requested.

<?php

function fetchHTML( $url )
  {

  $content = file_get_contents($url);

  $html=new DomDocument();
  $body=$html->getelementsbytagname('body');
  foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
  return $content;
  }
Alex JL
never even seen DomDocument() before! I'll have to check it out for sure. It makes me want to just use jQuery for the solution... `$( <?= $content ?> ).find( 'body' ).html();` heh
hookedonwinter
@hookedonwinter ha... that would work, I guess! If you had it open in a browser though, hmm... which reminds me, actually there is something called phpquery http://code.google.com/p/phpquery/ which is pretty cool!
Alex JL
@Alex JL You've now given me enough to learn for the next week. Thanks!
hookedonwinter
http://querypath.org/ is another jQuery-in-PHP implementation.
Scott Reynen