views:

126

answers:

2

Background

I am designing an application which will convert text from one language to other. For example, input text hello will be converted to specified language text. This is done by having a lookup table for each language. Conversion algorithm has the following steps.

  1. Looks for exact match. The whole word hello will be the pattern. If any matches found, stop processing and return the match.
  2. Else start from left-right and lookup for each character until all the combinations are done. Ex: Iteration1 : pattern = h, 2nd - he, 3rd - hel and so on. Matched token will be kept in a temporary buffer and returned when there are no more matches. This works like maximal munch rule.
  3. If match found, the matched text will be removed from original text and continue processing with the remaining text.

This algorithm will have to iterate over the input text multiple times and leads into quadratic complexity. I am trying to optimize this by arranging the lookup table in a tree structure (very similar to suffix tree). If there are lookup text for h, hel, lo, it will be organized like

root
--h
----hel
--lo

This will help me to avoid unnecessary lookups. After matching h, I need to goto next character only if h node has got children. This reduces the number of iterations drastically.

Questions

  1. What is the name of this kind of data structure? Is there a ready-made implementation available for C or C++?
  2. Do you see any possible improvements on the above algorithm or data structure?

Any thoughts...?

+1  A: 

Look at 'Trie' data structure.

Why don't use Lucene that indexes text in very fast way, and even more - indexing includes algorithms for

  • steming
  • fuse matching

and so on.

Dewfy
Thanks for the suggestion. But, `Lucene` is in `Java` and I am looking for a C or C++ implementation.
Appu
http://stackoverflow.com/questions/1036504/trie-implementation
Manuel
@Appu: I'm using C++ port: clucene http://sourceforge.net/projects/clucene/
Dewfy
+2  A: 

A ternary search tree might be what you're talking about. It is similar to a trie, but more space efficient from what I understand.

Justin Peel