views:

69

answers:

5

I have a list of X sites that I need to classify in some way. Is the site about cars, health, products or is it about everything(wikihow, about.com, etc?) What are some of the better ways to classify sites like this? Should I get keywords that bring traffic to the site and use those? Should I read the content of some random pages and judge it off of that?

+1  A: 

Well if the site is well designed there will be meta tags in the header specifically for this.

EBGreen
<0.1% of the sites on the Internet are "well designed" ;-)
Ryan Detzel
A: 

You aim big - Google and any other search engine is solving the same problem over the last couple of decades.

Olexiy
+1  A: 

Yahoo has a api to extract terms, http://developer.yahoo.com/search/content/V2/termExtraction.html

"The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content. It is one of the technologies used in Y!Q."

THC4k
A: 

This is a tough question to answer. Consider:

  • How granular do you want your classification to be?
  • Do you want to classify sites based on your own criteria or the criteria provided by the sites? In other words, if a site classifies itself as "a premier source for motorcycle maintenance", do you want to create a "motorcycle maintenance" category just for that site? This, of course, will cause your list to become inconsistent. However, if you pigeon-hole the sites to follow your own classification scheme, there is a loss of information, and a risk that the site will not match any of the categories you've defined.
  • Do you allow subcategories? The problem becomes much more complicated if so.
  • Can a site belong to more than one category? If so, is there an ordering or a weight (ie. Primary Category, Secondary Categories, etc.), or do you follow a scheme similar to SO's tags?

As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."

For instance, given the following categories:

{ "Cars", "Motorcycles", "Video Games" }

Spidering the following blocks of text from a site:

The title is an incongruous play on the title of the book Zen in the Art of Archery by Eugen Herrigel. In its introduction, Pirsig explains that, despite its title, "it should in no way be associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either."

and:

Most motorcycles made since 1980 are pretty reliable if properly maintained but that's a big if. To some extent the high reliability of today's motorcycles has worked to the disadvantage of many riders. Some riders have been lulled into believing that motorcycles are like modern cars and require essentially no maintenance. This is not the case (even with cars). Modern bikes require less maintenance than they did in the 60's and 70's but they still need a lot more maintence than a car. This higher reliability also means that there are a a whole bunch of motorcyclists out there who haven't a clue how to work on their bikes or what really needs to be done to ensure reliability.

We get the following scores:

{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }

And we can thus categorize the site as being related mostly to "Motorcycles".

Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.

You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.

JoshJordan
+1  A: 

Maybe I'm a bit biased (disclaimer : I have a degree in library science, and this topic is one of the reasons I got the degree), so the easiest answer is that there is no best way.

Consider this like you would database design -- once you have your system populated, what sort of questions are you going to ask of it?

Is the fact that the site is run by the government significant? Or that it uses flash? Or that the pages are blue? Or that it's a hobbyist site? Or that the intended audience is children?.

Then we get the question of if we're going to have a hierarchical category for any of the facets we're concerned with -- if it's about both cars and motorcycles, should we use the term 'vehicles' instead? And if we do that, will we use keyword expansion so that 'motorcycle' matches the broader terms (ie, vehicles) as well?

So ... the point is ... figure out what your needs are, and work towards that. 'Best' will never come, even with years of refinement (if anything, it gets more difficult, as terms start changing meanings. Remember when 'weblog' was related to web server metrics?)

Joe