views:

319

answers:

2

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website.

In any case, I don't know much about regexes, so bear with me.

I've got an HTML page in a format similar to this

<html>
<head>...</head>
<body>
    <div class=nav>...</div><p id="someshit" />
    <div class=body>....</div>
    <div class=footer>...</div>
</body>

I need to extract the contents of the body class container.

I tried this.

$pattern = "/<div class=\"body\">\(.*?\)<\/div>/sui"
$text = $htmlPageAsIs;
if (preg_match($pattern, $text, $matches))
    echo "MATCHED!";
else
    echo "Sorry gambooka, but your text is in another castle.";

What am I doing wrong? My text ends up in another castle.

*EDIT: ooohh... never mind, I found readability's code

A: 

You are matching for class="body" your document has class=body: you're missing the quotes. Use "/<div class=\"?body\"?>(.*?)<\/div>/sui".

Jakub Hampl
And yes: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Jakub Hampl
That's one thing, but he also forgot to escape the quotes (you as well), since the whole string block is wrapped in the same type of quotes.
treznik
Noooooooo! No more references to that! It's an awesome answer and everything, but the constant references to that question are getting as old as Jon Skeet facts.
karim79
The original HTML does have quotes. I wrote that down as an example. In any case, my bad. I should've been clearer.
gAMBOOKa
Also you probably shouldn't escape the brackets.
Jakub Hampl
@karim79 As long as they keep asking :D
Jakub Hampl
@karim79: Only Jon Skeet is allowed to post links to that question.
Tim Pietzcker
+6  A: 

Use a HTML/XML parser and store a single XPath path per website.

Ignacio Vazquez-Abrams
Also fun: If you know jQuery already, use phpQuery for CSS-like selectors instead of XPath.
christian studer