views:

26

answers:

1

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.

+1  A: 

You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.

They have an example on how to scrape Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);
Pekka
On second thought, this is most certainly a quadru-tetra-hydro-duplicate. Ah well.
Pekka
Well, I couldn't find any related articles. I'm not looking for an HTML parser, I'm looking for ways to differentiate headline and text from other garbage.
gAMBOOKa
`<td><table cellSpacing=0 cellPadding=0 width="100%" border=0><tbody><tr><td align=right width="95%" style="border-color:#3333DD; font-family:Times New Roman, Times, serif; font-weight:bold;color:#003399; font-size:22px; text-align:center; overflow:hidden;"><b>` --- That's a starting tag for a headline on one of our target websites.
gAMBOOKa
@gAMBOO oh my. That is going to be pretty tough, especially seeing as the structure could change daily. In cases like this, I'd recommend talking to the target site and seeing whether there aren't better ways of getting the data (e.g. in XML or RSS format).
Pekka
RSS is unreliable. A lot of them don't support RSS at all, and among those that do, many truncate the text.
gAMBOOKa
@gAMBOOK RSS is not unreliable - the code that is used to build some RSS feeds is :) Anyway, the point is: Talk to the people running the site. It is next to impossible to scrape anything worthwhile from spaghetti code like that.
Pekka
Google does it, why can't we?! =D ... in any case, we have over 80 target websites and counting. So talking to them to clean up their code because our code can't understand it is out of the question.
gAMBOOKa
@gAMBOOKa: Google does it because they can afford to have 50 top-notch people work around the clock on perfect solutions for the issue :D I think you'll still be best off with the DOM parser. You just need to refine the rules to things like "the first paragraph in the first table containing `font-weight: bold` is probably the title, if it is not immediately followed by a `xyz` tag". I don't know of an automatic "important content finder" as such, at least not as an available Open Source solution.
Pekka