views:

56

answers:

4

I would like to write a dictionary. What algorithms/structures should I use?

Each word or phrase has a corresponding description (examples, videos, images, and more). It should be possible to easily add/remove words and modify description. Quick access is more relevant than quick adding/removing. It should be possible to filter words on the basis of some information from the description. Some descriptions can be half-empty.

I was thinking of some index with words and positions of those words in dictionary file. How to quickly search for some information from description?

+2  A: 

Dictionaries in general are built on top of trees, most commonly, self-balancing trees. The most commonly used are Red-Black trees and AVL Trees, you should start there.
Going through your demands (I'm considering the case, where the word is a key (index), description is data pointed by that key):
1. It should me possibly to add/remove word - check, you add and remove node from tree.
2. It should be possible to modify description - check, description isn't indexed, so when you find one, you can do whatever you want with it, without changing the tree iteself.
3. Quick access - check, you have log2(N) access, tree stays balanced (hence - it's self-balancing tree).
4. Some descriptions can be half-empty - Description is just a data connected to node, it can be empty, or anything you like, that doesn't change anything inside a structure.
5. Filtering words on the basis of some information - I don't get this one, the filtering stuff can be implemented by copying the tree, but without words you want to get filtered, therefore you will get another tree, that has only those words that you want (and descriptions won't be copied).

Edit: One thing you should know - implementing those trees well, isn't a easy task. It's very easy to get a bug or two, you should be checking correctness of your implementation on every step. Also, if you want to get deeper, and into more structures, you might want to read Knuth's.

Ravadre
+1  A: 

Ravadre has pointed to search tree based data structures.

The big alternative is using a hash table. Its main disadvantage over trees is that the data inside it is not ordered – the ordering of elements is somewhat arbitrary. If you need to access the elements in sorted order, using hash tables is not advisable.

However, if you do not need sorted items, go with a hash table: access time is O(1) on average and although this depends on a lot of factors, it is often still far superior to access time in tree-based structures.

By the way, most programming languages already offer either or both data structures so you don’t need to implement them for yourself.

Konrad Rudolph
Ah yes, hash tables might be better in this case. Dictionary<TKey,TValue> in C# is implemented using hashing tables, I think it says for itself, it's also much easier to implement (much, much easier...).
Ravadre
@Ravadre: but .NET also offers `SortedList` and `SortedDictionary`, both of which are implemented using a binary search tree.
Konrad Rudolph
A: 

For storing a dictionary with words as keys, you might like to use a trie, a datastructure where the keys are usually strings. Its pretty nice.

If you store the dictionary itself in an array, the values to which the word keys map could be just a list of the indices of the dictionary entries in the array in whose description the word appears.

If you don't want to use a trie: You could either use a hash table or some kind of binary tree.

Using a hash table, you have, in theory, extremely fast lookup, but the possibility of collisions, which means that the performance possibly gets worse with time. See also this blog post.

With balanced binary search trees (Red-Black trees are pretty popular), your lookup of keys is possibly a bit slower, but (if you use a balanced tree) a relatively good performance is guaranteed.

mrueg
A: 

If I understand you correctly, you want to built an actual dictionary i.e. A list of words with descriptions, videos and images? And not implement a dictionary data-type?

For the former I would suggest a database is your best option. You don't have to keep the entire structure in memory, a good indexing structure allows for quick access. SQL queries will give you the ability to filter by description or any other field.

The main disadvantage with this approach is the insertion algorithm, as the database increases the time taken to insert a word (assuming you wan't to keep order) will increase. A binary search for the correct position is probably you best start. Obviously this facilitates the need for a binary-tree structure.

For the actual database itself, there are several ways you could go about it. Having the actual word as the index might be worth considering, and has the added advantage of you being able to directly gain a position based on the index (assuming you can convert the string to a number that increase as the words position increases

Hope this helps.

Jamie Lewis