I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.