views:

260

answers:

3

I have a MySQL database table with a couple thousand rows. The table is setup like so:

id | text

The id column is an auto-incrementing integer, and the text column is a 200-character varchar.

Say I have the following rows:

3 | I think I'll have duck tonight

4 | Maybe the chicken will be alright

5 | I have a pet duck now, awesome!

6 | I love duck

Then the list I'm wanting to generate might be something like:

  • 3 occurrences of 'duck'
  • 3 occurrences of 'I'
  • 2 occurrences of 'have'
  • 1 occurrences of 'chicken'
  • .etc .etc

Plus, I'll probably want to maintain a list of substrings to ignore from the list, like 'I', 'will' and 'have. It's important to note that I do not know what people will post.

I do not have a list of words that I want to monitor, I just want to find the most common substrings. I'll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.

Can anyone suggest the best way to do this? Thanks everyone!

A: 

Extract to flat file and then use your favorite quick language, perl, python, ruby, etc to process the flat file.

If you don't have one these languages as part of your skillset, this is a perfect little task to start using one, and it won't take you long.

Some database tasks are just so much easier to do OUTSIDE of the database.

Corey Trager
A: 

You might want to look into the MySQL Full-Text Parser Plugins

TrickyNixon
+2  A: 

MySQL already does this for you.

First make sure your table is a MyISAM table

Define a FULLTEXT index on your column

On a shell command line navigate to the folder where your MySQL data is stored, then type:

myisam_ftdump -c yourtablename 1 >wordfreq.dump

You can then process wordfreq.dump to eliminate the unwanted column and sort by frequency decending.

You could do all the above with a single command line and some sed/awk wizardry no doubt. And you could incorporate it into your program without needing a dump file.

More info on myisam_ftdump here: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html

Oh... one more thing, the stopwords for MySQL are precompiled into the engine. And words with 3 or less characters are not indexed. The full list is here:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

If this list isn't adequate for your needs, or you need words with less than 3 characters to count, the only way is to recompile MySQL with different rules for FULLTEXT. I don't recommend that!

rwired
Fantastic, thank you!
rmh