I'm crawling data from internet,without classifying.
Is there such a library to recommend?
EDIT
I'm crawling jobs from other website,and I need to group them into different industries.
I'm crawling data from internet,without classifying.
Is there such a library to recommend?
EDIT
I'm crawling jobs from other website,and I need to group them into different industries.
My current employer developed a system to categorize web pages. There were not any useful libraries that we could find so we had to do our own. We do not license ours out.
I can give you some hints. Spam analyzers classify email into Junk or Not Junk. You can use the same tools such as Bayesian, CRM-114, etc to do your own classifications on any text, including web pages.
You will have to watch the results of these very carefully and give them a lot of human feedback. You can often find keyword sets that will score very well for you. Finding those keyword sets will take time and effort and it will change some over time.
You will have to write code to divide web pages into topic sections because most pages are not all one thing. There are ad frames, navigation and other things.
To sort unlabelled data into groups, you want clustering, not classification. The most complete machine learning library is the Java-based Weka. You'll probably want to start by extracting text from the web pages (remove script and style elements completely, strip other tags), and then running the text through the StringToWordVector filter before performing clustering.