I'm working on a process that will perform natural language processing (NLP) on one--and potentially several--of our content rich sites. What I'd like to do once the NLP is complete is to automatically organize the output (generally a set of terms that you might think of as tags given the prevalence of that metaphor) into some kind of standard or generally accepted organizational structure.
In a perfect world, I'd really like this to be crowd sourced under the folksonomy concept (as opposed to a taxonomy) since the ultimate goal is to target/appeal to real people rather than "domain experts", but I'm open to ideas and best practices. For the obvious purpose of scalability, I'd like to automate the population of this tax/folksonomy so that "some guy" in the team/organization isn't responsible for looking at a bunch of words (with or without context) and arbitrarily fleshing out the contextual components of the tree.
I have a few ideas for doing this that require some research to establish viability, but I have exactly zero practical experience with this sort of thing so the ideas really just boil down to stuff I made up that might perform some role in accomplishing the task. Imagining that others have vastly more experience with this sort of thing, I'm hoping that I can stand on your shoulders.
Thanks for your thoughts and insights.
Practical Example
I ran the NLP against an article on my own blog. The NLP returned the following terms with an sufficient level of relevance:
- Rob Wilkerson
- change
- Git
Now I want to put those terms into a tax/folksonomy without human intervention. In this case, "Git" and "Rob Wilkerson" are terms could be classified--there is/will be an additional stipulation in the process that will remove terms from the structure if those terms don't generate enough traction to be worth tracking. On the other hand, "change" is probably too vague/ambiguous to be worth the trouble.