views:

169

answers:

8

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.

If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.

I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?

Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.

EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.

EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.

A: 

Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...

Alex Mcp
Yeah, but this is unlikely to cover even half of the sites I want to crawl. I can't write something for every possible structure :/
ehsanul
+1  A: 

If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.

Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.

Kwebble
Thanks, didn't know about the tag microformat. It doesn't seem like too many use it though, which is a shame.
ehsanul
A: 

If you find a generic solution let us know. I have tested many tools (KapowTech, iMacros, etc pp), and each requires that you customize your "script" for each website that you need to work with.

SamMeiers
A: 

I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.

dunelmtech
A: 

How big is the domain?

Sudhanshu Arya
The entire web. Please read the question properly.
ehsanul
Whoa man, why the downvote?
Sudhanshu Arya
A: 

This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.

If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.

I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.

GmonC
A: 

Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.

Reinderien
This is equivalent to using Zemanta or Open Calais or Alchemy API or your own tokenizer and tf-idf values to find keywords. The result quality is an important issue for me.
ehsanul
I'm giving it a try, and I'm wondering how to enable the V2 engine. All I see is this (replacing V1 with V2 in the URL just redirects back to V1): http://developer.yahoo.com/search/content/V1/termExtraction.html
ehsanul
Save the pipe, and then go to the page where the results are displayed. The link to enable V2 should be on the left.
Reinderien
A: 

Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.

Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.

dunelmtech