I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.
views:
26answers:
1
+1
A:
You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.
They have an example on how to scrape Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
Pekka
2010-05-08 11:08:07
On second thought, this is most certainly a quadru-tetra-hydro-duplicate. Ah well.
Pekka
2010-05-08 11:09:11
Well, I couldn't find any related articles. I'm not looking for an HTML parser, I'm looking for ways to differentiate headline and text from other garbage.
gAMBOOKa
2010-05-08 12:08:25
`<td><table cellSpacing=0 cellPadding=0 width="100%" border=0><tbody><tr><td align=right width="95%" style="border-color:#3333DD; font-family:Times New Roman, Times, serif; font-weight:bold;color:#003399; font-size:22px; text-align:center; overflow:hidden;"><b>` --- That's a starting tag for a headline on one of our target websites.
gAMBOOKa
2010-05-08 12:10:16
@gAMBOO oh my. That is going to be pretty tough, especially seeing as the structure could change daily. In cases like this, I'd recommend talking to the target site and seeing whether there aren't better ways of getting the data (e.g. in XML or RSS format).
Pekka
2010-05-08 12:13:24
RSS is unreliable. A lot of them don't support RSS at all, and among those that do, many truncate the text.
gAMBOOKa
2010-05-08 12:18:12
@gAMBOOK RSS is not unreliable - the code that is used to build some RSS feeds is :) Anyway, the point is: Talk to the people running the site. It is next to impossible to scrape anything worthwhile from spaghetti code like that.
Pekka
2010-05-08 12:24:10
Google does it, why can't we?! =D ... in any case, we have over 80 target websites and counting. So talking to them to clean up their code because our code can't understand it is out of the question.
gAMBOOKa
2010-05-08 12:53:25
@gAMBOOKa: Google does it because they can afford to have 50 top-notch people work around the clock on perfect solutions for the issue :D I think you'll still be best off with the DOM parser. You just need to refine the rules to things like "the first paragraph in the first table containing `font-weight: bold` is probably the title, if it is not immediately followed by a `xyz` tag". I don't know of an automatic "important content finder" as such, at least not as an available Open Source solution.
Pekka
2010-05-08 18:44:40