views:

244

answers:

4

I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).

How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.

What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).


Possible solution No 1. (partial): Bayesian categorization

(added 2009-05-21)

One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.

A: 

Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.

It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.

Chas. Owens
I'd settle for something like AI / expert system which I have to teach at the beginning, i.e. with computer *assisted* tabularization. Therefore I think that understanding natural language wouldn't be required; something keyword-based like ELIZA or Bayesian spam filters should be, I guess, good enough for me
Jakub Narębski
A: 

You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?

innaM
Free-form answers are interesting... but I did not expect more than 3,000 responses. Up to around 500 replies can be analyzed (tabularized) manually, around 3,000 unfortunately no...
Jakub Narębski
Then use a random sample of 500 replies and analyze those data manually.
innaM
That would take care only of the categorization part (more or less, depending how well random sample would represent different categories of interest). There is still tabularization (how many responses are in each category) to be done. And I'd rather have this computer assisted.
Jakub Narębski
+1  A: 

Text::Ngrams + Algorithm::Cluster

  1. Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
  2. Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
bubaker
Thanks. I'll try this.
Jakub Narębski
Lingua::Stopwords and Lingua::Stem should be a help. Also, if you can identify some categories before processing (like 'happy' and 'unhappy'), AI::Categorizer::Learner might be of some use.
andymurd
+1  A: 

I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.

singingfish
By "quite large corpuses" (err probably corpora), I mean I've taken long documents and used the above tools (actually I got lazy and didn't do stemming) to provide a browseable index of up to 1000 documents.
singingfish
It was difficult choice to select between this and the other answer to mark as 'accepted answer'
Jakub Narębski