views:

342

answers:

3

Hi there,

I've been looking around but have yet to find a solution. I'm trying to scrape an HTML document and get the text between two comments however have been unable to do this successfully so far.

I'm using PHP and have tried the PHP Simple DOM parser recommended here many times but can't seem to get it to do what I want.

Here's (part of) the page that I wish to parse:

<div class="class">
  <!-- blah -->
    text
  <!-- end blah -->

  Text I want

  <!-- blah -->
    text
  <!-- end blah -->
</div>

Thanks

+1  A: 

Maybe you can use regular expressions?

$text = '
<div class="class">
  <!-- blah -->
    text
  <!-- end blah -->

  Text I want

  <!-- blah -->
    text
  <!-- end blah -->
</div>
';

$regex = '/(<!-- end blah -->)(.*?)(<!-- blah -->)/ims';
$match = preg_match_all ($regex, $text, $matches);
Deniss Kozlovs
Obligatory "now you have two problems" comment ;)
DisgruntledGoat
"Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins".
Jon Winstanley
A: 

You can use my Pretty Diff tool to minify and then beautify your markup. This will scrape out all comments. It is JavaScript and not PHP, however. If that is not satisfactory then analyze my code and rewrite it in PHP. I have already figured out all the problems, so you just have to read what is there. Be sure to read the documentation about markup requirements up before using the tool.

http://mailmarkup.org/prettydiff/prettydiff.html

documentation: http://mailmarkup.org/prettydiff/documentation.html

+2  A: 

Assuming that each comment is different (i.e. "blah" is not the same in the first and second sections), you can use some simple strpos to grab everything between them. Regular expressions are not necessary.

$startStr = '<!-- end blah1 -->';
$endStr = '<!-- start blah2 -->';

$startPos = strpos($HTML, $startStr) + strlen($startStr);
$endPos = strpos($HTML, $endStr );

$textYouWant = substr($HTML, $startPos, $endPos-$startPos);

If the two sets of comments are the same, you'll need to modify this to find the second "blah", using strpos's offset parameter

DisgruntledGoat
Thanks, this worked. :)
Pep