How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether! However, RSS very often contains only very short excerpts or strips away links that you might be interested in. You want to essentially defeat the excerptedness of the RSS by using the HTML and RSS of the same page together.
An RSS entry looks like:
title excerpt of post body permalink
A blog post in HTML looks like:
title (surrounded by permalink, maybe) ... permalink, maybe ... post body ... permalink, maybe
So the HTML page contains the same fields but the placement of the permalink is not known in advance, and the fields will be separated by some noise text that is mostly HTML and white space but also could contain some additional metadata such as "posted by Johnny" or the date or something like that. The text may also be represented slightly different in HTML vs. RSS, as described below.
Additional rules/caveats:
- Titles may not be unique. This happens more often than you might think. Examples I've seen: "Monday roundup", "TGIF", etc..
- Titles may even be left blank.
- Excerpts in RSS are also optional, but assume there must be at least either a non-blank excerpt or a non-blank title
- The RSS excerpt may contain the full post content but more likely contains a short excerpt of the start of the post body
- Assume that permalinks must be unique and must be the same in both HTML and RSS.
- The title and the excerpt and post body may be formatted slightly differently in RSS and in HTML. For example:
- RSS may have HTML inside of title or body stripped, or on the HTML page more HTML could be added (such as surrounding the first letter of the post body with something) or could be formatted slightly differently
- Text may be encoded slightly differently, such as being utf8 in RSS while non-ascii characters in HTML are always encoded using ampersand encoding. However, assume that this is English text where non-ascii characters are rare.
- There could be badly encoded Windows-1252 horribleness. This happens a lot for symbol characters like curly quotes. However, it is safe to assume that most of the text is ascii.
- There could be case-folding in either direction, especially in the title. So, they could all-uppercase the title in the HTML page but not in RSS.
- The number of entries in the RSS feed and the HTML page is not assumed to be the same. Either could have more or fewer older entries. We can only expect to get only those posts that appear in both.
- RSS could be lagged. There may be a new entry in the HTML page that does not appear in the RSS feed yet. This can happen if the RSS is syndicated through Feedburner. Again, we can only expect to resolve those posts that appear in both RSS and HTML.
- The body of a post can be very short or very long.
100% accuracy is not a constraint. However, the more accurate the better.
Well, what would you do?