views:

349

answers:

3

I'm looking for a short, simple suffix tree building/usage algorithm in Java. The best I've found so far lies withing the Semantic Discovery Toolkit, but the implementation is several thousand lines long and spans several classes. Ideally, the implementation would be as short as possible and span no more than a few hundred lines.

Does anyone have such an implementation?

+2  A: 

There's a simple Java implementation of a suffix tree on the LiteratePrograms wiki.

Fabian Steeg
I had issues with this before, but this is so-far the best I've found. I'm wondering if there's anything shorter without obfuscation. I could probably rip out a lot of the structure to bring LOC down, but this does seem my best option.
Stefan Kendall
A: 

It's really not hard. I've written suffix trees in a few languages. It's a very simple algorithm. If you're at the stage where you're thinking about them, you're probably able to implement one.

My one comment is, look at your data. A hashtable / binary tree might be a more efficient way to store your data. Pointers / references are as expensive as chars, if not more so!

Joe
+1  A: 

The article "Simple Linear Work Suffix Array Construction", by Karkkainen and Sanders, terminates with 50 lines of C++. You will probably also want something to produce the LCP array. Googling for "Computing the LCP array in linear time, given S and the suffix array POS." should find you that.

mcdowella