views:

133

answers:

2

How would you solve this problem?

You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.

I know what you're thinking: You could just look at the RSS and ignore the HTML altogether! However, RSS very often contains only very short excerpts or strips away links that you might be interested in. You want to essentially defeat the excerptedness of the RSS by using the HTML and RSS of the same page together.

An RSS entry looks like:

title
excerpt of post body
permalink

A blog post in HTML looks like:

title (surrounded by permalink, maybe)
...
permalink, maybe
...
post body
...
permalink, maybe

So the HTML page contains the same fields but the placement of the permalink is not known in advance, and the fields will be separated by some noise text that is mostly HTML and white space but also could contain some additional metadata such as "posted by Johnny" or the date or something like that. The text may also be represented slightly different in HTML vs. RSS, as described below.

Additional rules/caveats:

  • Titles may not be unique. This happens more often than you might think. Examples I've seen: "Monday roundup", "TGIF", etc..
  • Titles may even be left blank.
  • Excerpts in RSS are also optional, but assume there must be at least either a non-blank excerpt or a non-blank title
  • The RSS excerpt may contain the full post content but more likely contains a short excerpt of the start of the post body
  • Assume that permalinks must be unique and must be the same in both HTML and RSS.
  • The title and the excerpt and post body may be formatted slightly differently in RSS and in HTML. For example:
    • RSS may have HTML inside of title or body stripped, or on the HTML page more HTML could be added (such as surrounding the first letter of the post body with something) or could be formatted slightly differently
    • Text may be encoded slightly differently, such as being utf8 in RSS while non-ascii characters in HTML are always encoded using ampersand encoding. However, assume that this is English text where non-ascii characters are rare.
    • There could be badly encoded Windows-1252 horribleness. This happens a lot for symbol characters like curly quotes. However, it is safe to assume that most of the text is ascii.
    • There could be case-folding in either direction, especially in the title. So, they could all-uppercase the title in the HTML page but not in RSS.
  • The number of entries in the RSS feed and the HTML page is not assumed to be the same. Either could have more or fewer older entries. We can only expect to get only those posts that appear in both.
  • RSS could be lagged. There may be a new entry in the HTML page that does not appear in the RSS feed yet. This can happen if the RSS is syndicated through Feedburner. Again, we can only expect to resolve those posts that appear in both RSS and HTML.
  • The body of a post can be very short or very long.

100% accuracy is not a constraint. However, the more accurate the better.

Well, what would you do?

A: 

RSS is actually quite simple to parse using XPath any XML parser (or regexes, but that's not recpmmended), you're going through the <item> tags, looking for <title>, <link>, <description> .

You can then post them as different fields in a database, or direcrtly merge them into HTML. In case the <description> is missing, you could scrape the link (one way would be to compare multiple pages to weed-out the layout parts of the HTML).

Osama ALASSIRY
+1  A: 

I would create a scraper for each of the major blogging engines. Start with the main text for a single post per page.

If you're lucky then the engine will provide reasonable XHTML, so you can come up with a number of useful XPath expressions to get the node which corresponds to the article. If not, then I'm afraid it's TagSoup or Tidy to coerce it into well formed XML.

From there, you can look for the metadata and the full text. This should safely remove the headers/footers/sidebars/widgets/ads, though may leave embedded objects etc.

It should also be fairly easy (TM) to segment the page into article metadata, text, comments, etc etc and put it into fairly sensible RSS/Atom item.

This would be the basis of taking an RSS feed (non-full text) and turning it into a full text one (by following the permalinks given in the official RSS).

Once you have a scraper for a blog engine, you can start looking at writing a detector - something that will be the basis of the "given a page, what blog engine was it published with".

With enough scrapers and detectors, it should be possible to point a given RSS/Atom feed out and convert it into a full text feed.

However, this approach has a number of issues:

  • while you may be able to target the big 5 blog engines, there may be some blogs which you just have to have that aren't covered by them: e.g. there are 61 engines listed on Wikipedia; people who write their own blogging engines each need their own scraper.
  • each time a blog engine changes versions, you need to change your detectors and scrapers. More accurately, you need to add a new scraper and detector. The detectors have to become increasing more fussy to distinguish between one version of the same engine and the next (e.g. everytime slashcode changes, it usually changes the HTML, but different sites use different versions of slash).

I'm trying to think of a decent fallback, but I'll edit once I have.

jamesh