views:

481

answers:

3

While making some final tests of a class-library that I'm writing for Windows Mobile (using Compact Net Framework 2.0), I ran into an OOM-exception.

Basically, my library loads first a dictionary-file (an ordinary text file with a word list) and thereafter another file based upon the dictionary (I call it KeyMap) which size is more or less the same of the previously loaded dictionary.

Everything worked fine (using the emulator and my real device) with above files until I tried to load a Spanish-dictionary which has a size of approximately 2.7MB. The other language dictionaries I have used so far without any OOM-exceptions amounts to approximately 1.8MB each. With the Spanish dictionary, I can load the first file without any problems but when I try to read the second file, I get the OOM-error.

Below I have written the code that I am using. Basically I read the files and assign its contents to a string-variable (DictData and TextKeyMap). Then I make a Split on the string-variable to pass on the contents to a string-array (Dict and KeyMap).

'Loading Dictionary works
Dim ReadDictionary As StreamReader = New StreamReader(DictPath, Encoding.UTF8)
   DictData = ReadDictionary.ReadToEnd()
            ReadDictionary.Close()
            Dict = DictData.ToString.ToUpper.Split(mySplitSep.ToCharArray) 'mySplitSep=chr(10)
            DictData = "" 'perhaps "nothing" is better

 'Loading KeyMap gives me error
 Dim ReadHashKeyMap As StreamReader = New StreamReader(HashKeyMapPath, Encoding.UTF8)
    TextKeyMap = ReadHashKeyMap.ReadToEnd() '<-- OOM-error
            ReadHashKeyMap.Close()
            KeyMap = TextKeyMap.ToString.Split(mySplitSep.ToCharArray) 'mySplitSep=chr(10)
            TextKeyMap = "" 'perhaps "nothing" is better

I am a hobby-programmer with no expert-knowledge so my code shown above can probably be improved. Instead of using ReadToEnd, I tried to read each line in a For-loop but I got the same error (it was also slower).

I presume the error is due to the limitation of 32MB of contiguous memory in Windows Mobile.

Anyone of you who can help me out, perhaps by suggesting some alternative solutions? Maybe the problem is due to my crappy code shown above? What about, loading the second file in another thread? Could this work?

All help I can get will be highly appreciated.

Edit: I asked a similar question some time ago (here) but that one was more related to dealing with the reception of bytes and was resolved using chunks. In this case, I am dealing with strings.

Edit2: This library is a spellchecking-library. It works quite well and implements some quite advance techniques such as Soundex- and DoubleMetaPhone-algorithms. The only major problem so far is the problem mentioned above with a huge text-file for Spanish. Other dictionaries are OK. For more info, please see this link

+2  A: 

It seems that you just don't have enough memory to keep all the text from all the files in memory at the same time. You may need to come up with a strategy that caches a limited subset of the files and is intelligent enough to go back to the file when something's requested that isn't in the cache.

If the entire point of the exercise is that you don't have to go back to the files (e.g., building up some sort of index), you could also try to get "clever" and come up with an alternative representation for the in-memory text that takes advantage of the very-compressible nature of most western languages.

Greg D
GregD: the KeyMap-file is actually an Index-file of the dictionary. I could split it into smaller parts and load the needed part when needed but this would slow down the spellchecking-process which retrieves suggestions, especially in this case since it's running on Windows Mobile with limited memory and hardware-resources.
moster67
+2  A: 

I'd say the problematic line is this one:

Dict = DictData.ToString.ToUpper.Split(mySplitSep.ToCharArray)

The GC isn't able to keep up the creation of temporary objects behind that simple line. "ToUpper" is creating a copy of the original string, and "Split" is creating a new array out of that copy (and probably using more memory for the splitting algo itself). By the way, the call to "ToString" is useless, DictData is already a string, right?

Personally, I would read from the stream by chunks, and do the splitting pieces by pieces, into a List<>. But if you want to keep your code short, try this, you never know:

DictData = ReadDictionary.ReadToEnd()
ReadDictionary.Close()
DictData = DictData.ToUpper()
GC.Collect()
Dict = DictData.Split(mySplitSep.ToCharArray)
DictData = Nothing
GC.Collect()

I never find it a good solution to call GC.Collect. Calling this generally means "something better should have been done". But memory management under .NET CF is sometimes painful.

Martin Plante
slimCODE: I told you my code was crappy! You're right - DictData is a string but to get the conversion into UpperCase, I had to add ".To.String" in order to get ".To.Upper (using Intellisense) or perhaps I'm wrong. I can't verify now.
moster67
slimCode: Could you elaborate your idea about "splitting pieces by pieces into a List<>". I don't understand what you mean. As to your code-suggestions, I will try that. Thanks!"
moster67
slimCODE: I tried your code, applying the same to the 1st and 2nd file but unfortunately I still get OOM-exceptions.
moster67
+3  A: 

As you've not said what you're using this file for I'm assuming that you are just searching for a word for some reason.

First of all, its probably not a good idea to try and load the complete file into memory. Instead, it might more productive to search the file for the data (word) you need and also, perhaps, keep some sort of indexing information in memory to speed things up a bit.

As the data you are trying to search is just a list of words it might be a good idea to scan the file and record in a dictionary where the first letter of a word changes. e.g A's start at line 0; B's start at line 200; C's start at line 300 etc. Use these two pieces of information to populate your dictionary; the letter is the key and the line number is the value. In effect, the dictionary becomes a high level index into the word list file. This dictionary is also very small.

Then, when you start to search for a word, use the first letter of the word to search the dictionary. This will get you the line number where words that begin with that letter are located in the file. Armed with the line number (re)open the file and go straight to that line in the word file by moving stream pointer to the target line. Then search for the target word from there. Either search sequentially, a line at a time (not recommended it will be quite slow but will be easier to code). Or, search for the word using a binary chop (much quicker, but harder to code). Although for the latter you'll also need to know where the words that start with the target letter stop in the file as you'll be search a section of the file. I'd also recommend that you do the word searching in the file rather than load all those words into memory, otherwise you might be back to where you start with OOM errors.

If you're not sure of anything, stick a comment on here and I'll do my best to answer it.

Good luck

Barry Carr
Good input! BinarySearch is already being used whenever and wherever it's possible. While the 1st file (the word list) must be loaded as shown in my code (needed for other algorithms in my library), your idea is still good, especially in regard to the 2nd file which would mean I could avoid loading the 2nd file and load/access the same only when needed. Of course this would involve a lot of file-accessing and probably a quite substantial performance-loss but still....I will give it a shot! Thanks!
moster67
please see my edit2 for some extra information regarding the library
moster67
Thanks for the feed back. Is there any chance you could vote for my answer? ;-)Regarding the second file. You could try and search the file on a back ground thread, that at least should keep your UI responsive.
Barry Carr