Im looking to categorize lots of websites (millions). I can use Nutch to crawl them and get the content of the sites, but I am looking for the best (and cheapest or free) tool to categorize them.
One option is to create regular expressions that look for certain keywords and categorize the sites, but there area also high end LSI type tools like Autonomy. Are there any open source or cheaper tools that will take the text from a webpage/site and categorize it for me? I need some customization on the types of categories used. As part of the categorization I would like to be able to recognize "fake" sites that are really just parked pages, or domainers that are putting ads on the pages as well as just plain old categories, like is this news, sports, science, health, food, entertainment etc...