ansaurus

Question

How to extract the headline and content from a crawled web page / article?

Answer 1

+1 A:

You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.

They have an example on how to scrape Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

Pekka 2010-05-08 11:08:07

On second thought, this is most certainly a quadru-tetra-hydro-duplicate. Ah well.

Pekka 2010-05-08 11:09:11

Well, I couldn't find any related articles. I'm not looking for an HTML parser, I'm looking for ways to differentiate headline and text from other garbage.

gAMBOOKa 2010-05-08 12:08:25

`<td><table cellSpacing=0 cellPadding=0 width="100%" border=0><tbody><tr><td align=right width="95%" style="border-color:#3333DD; font-family:Times New Roman, Times, serif; font-weight:bold;color:#003399; font-size:22px; text-align:center; overflow:hidden;"><b>` --- That's a starting tag for a headline on one of our target websites.

gAMBOOKa 2010-05-08 12:10:16

@gAMBOO oh my. That is going to be pretty tough, especially seeing as the structure could change daily. In cases like this, I'd recommend talking to the target site and seeing whether there aren't better ways of getting the data (e.g. in XML or RSS format).

Pekka 2010-05-08 12:13:24

RSS is unreliable. A lot of them don't support RSS at all, and among those that do, many truncate the text.

gAMBOOKa 2010-05-08 12:18:12

@gAMBOOK RSS is not unreliable - the code that is used to build some RSS feeds is :) Anyway, the point is: Talk to the people running the site. It is next to impossible to scrape anything worthwhile from spaghetti code like that.

Pekka 2010-05-08 12:24:10

Google does it, why can't we?! =D ... in any case, we have over 80 target websites and counting. So talking to them to clean up their code because our code can't understand it is out of the question.

gAMBOOKa 2010-05-08 12:53:25

@gAMBOOKa: Google does it because they can afford to have 50 top-notch people work around the clock on perfect solutions for the issue :D I think you'll still be best off with the DOM parser. You just need to refine the rules to things like "the first paragraph in the first table containing `font-weight: bold` is probably the title, if it is not immediately followed by a `xyz` tag". I don't know of an automatic "important content finder" as such, at least not as an available Open Source solution.

Pekka 2010-05-08 18:44:40

ansaurus

tags:

views:

answers:

How to extract the headline and content from a crawled web page / article?

related questions