views:

152

answers:

4

Hello. This sounds like a simple question, but I don't know how to search for its answer.

I have a trie implementation in C# that will store about 80K words from a dictionary file. It takes quite a while to load all these words (more than 5 mins). I was wondering, what is the best way to "persist" those data so I don't have to reload all words every time I start the application?

Thanks.

+3  A: 

Like all other performance issues, the ideal solution will follow from profiling your current solution and other candidate solutions that you come up with. Where's the bottleneck? The I/O? Lexing the text? Forming the links in the trie? Will be hard to make a concrete suggestion without knowing your performance goals, the nature of the trie-usage and bottlenecks currently present.

Issues to consider:

  1. Storage format: Text? Binary?
  2. Persisted data: The entire structure of the trie (e.g. as XML) or just a list of words, relying on run-time code to push them into the right location in the data-structure? What's the markup to data ratio? How heavy is it to parse?
  3. Storage location: DB / flat-file / ...?
  4. Incremental loading: Possible?

One possible strategy: Create and persist a 'most common words' dictionary with the 1,000 (or so) of the most frequently-used words. Load these words into the trie on start-up, and spawn the loading of the full-dictionary on another thread; incrementally adding to the created trie as new words are read.

  • Pros: User will see faster start-up time.
  • Cons: Might require cross-thread synchronization, user will see an incomplete trie until loading is fully complete. This may or may not be a showstopper depending on what the trie is being used for.
Ani
+2  A: 

I recently refactored a similar data structure, due to slow performance and slow serialization / deserialization times.

My solution was to scrap the trie altogether and go with native .NET collections - Dictionaries and Lookups.

I'm working with about 400k words. From memory it takes about 5 seconds to build the data structure, which is a list of objects indexed by a number of dictionaries and lookups.

  • The top level of the structure is a Dictionary<int, var> where the key is n - the number of letters in the search term.
  • Each value in the dictionary is a Lookup<string, string> where the key is a string with n letters, and the value is all strings that start with that string. e.g for key 'st' values might be 'start', 'stop' and 'string'.

To create the data structure I simply iterate over the entire list of words for i = 1 to maxlength to create a Lookup of all distinct 'starts with' strings for each i. Plug those into the top level dictionary and you're done.

This removes the need for a custom-built trie. I found the performance difference (search time) to be neglible, but the speed of loading to hugely favour my design (not to mention simplicity and maintainability of using simple .NET types).

Kirk Broadhurst
I realise this doesn't really answer your question of how to serialize, but I strongly suggest you reconsider the custom trie. This solution takes less than an hour to implement and is - in my experience - far more performant.
Kirk Broadhurst
A: 

Why do you want to load all 80K words/Nodes at the startup. I believe all these words are categorized (/parent nodes). extend the basic tree to load the words on demand (parent node click/expand event). This way you will be loading the groups first then the group's nodes later.

Try using async calls to load.

Bepenfriends
A: 

I would just serialize it in the old MFC binary fashion. Basically the reading/writing should be about as fast as possible, and the only thing you're left with is allocating and initializing the structure on input, which you need to do anyway.

That is, to serialize a node of the trie, you do this:

Read/Write number N of subnodes
For each subnode
  If reading, allocate a subnode in this node
  Read/Write the character for the subnode
  Serialize the subnode
End

Edit: Just re-read your question, and you want to build the trie from scratch from the wordlist? As others said, profile, but not just with any old profiler. They don't all find your problem. Here's what I do. The time it takes should not be much more than the time it takes to read the file plus the time it takes to create the structure.

Mike Dunlavey