tags:

views:

172

answers:

6

I have a class with a couple static arrays:

an int[] with 17,720 elements
a string[] with 17,720 elements

I noticed when I first access this class it takes almost 2 seconds to initialize, which causes a pause in the GUI that's accessing it.

Specifically, it's a lookup for Unicode character names. The first array is an index into the second array.

static readonly int[] NAME_INDEX = {
0x0000, 0x0001, 0x0005, 0x002C, 0x003B, ...

static readonly string[] NAMES = {
"Exclamation Mark", "Digit Three", "Semicolon", "Question Mark", ...

The following code is how the arrays are used (given a character code). [Note: This code isn't a performance problem]

int nameIndex = Array.BinarySearch<int>(NAME_INDEX, code);
if (nameIndex > 0) { return NAMES[nameIndex]; }

I guess I'm looking at other options on how to structure the data so that 1) The class is quickly loaded, and 2) I can quickly get the "name" for a given character code.

Should I not be storing all these thousands of elements in static arrays?

Update
Thanks for all the suggestions. I've tested out a Dictionary approach and the performance of adding all the entries seems to be really poor.

Here is some code with the Unicode data to test out Arrays vs Dictionaries http://drop.io/fontspace/asset/fontspace-unicodesupport-zip

Solution Update
I tested out my original dual arrays (which are faster than both dictionary options) with a background thread to initialize and that helped performance a bit.

However, the real surprise is how well the binary files in resource streams works. It is the fastest solution discussed in this thread. Thanks everyone for your answers!

+3  A: 
  1. Initialize your arrays in separate thread that will not lock the UI
  2. http://msdn.microsoft.com/en-us/library/hz49h034.aspx
Andrey
That's probably the easiest way to deal with the problem.
Robert Harvey
+7  A: 

So a couple of observations. Binary Search is only going to work if your array is sorted, and from your above code snippet, it doesn't look to be sorted.

Since your primary goal is to find a specific name, your code is begging for a hash table. I would suggest using a Dictionary, it will give you O(1) (on average) lookup, without much more overhead than just having the arrays.

As for the load time, I agree with Andrey that the best way is going to be by using a separate thread. You are going to have some initialization overhead when using the amount of data you are using. Normal practice with GUIs is to use a separate thread for these activites so you don't lock up the UI.

cmptrer
A dictionary initialized from a background thread, started at program start, instead of when needed, is probably the best option here.
Lasse V. Karlsen
The binary search is on the NAME_INDEX array, which is sorted. I just tried to use a static Dictionary, but I'm getting an error "An expression is too long to compile" error. Going to try building the Dictionary at runtime to see how that performs ...
Visualize
+3  A: 

"Should I not be storing all these thousands of elements in static arrays?"

A much better way would be to store your data as binary stream in resources in the assembly and then load from the resources. Will be some more programming overhead but therefore doesn't need any object initialization.

Basic idea would be (no real code):

// Load data (two streams):
indices = ResourceManager.GetStream ("indexData");
strings = ResourceManager.GetStream ("stringData");

// Retrieving an entry:
stringIndex = indices.GetIndexAtPosition (char);
string = strings.GetStringFromPosition (stringIndex);

If you want a really good solution (for even some more work) look into using memmapped data files.

Foxfire
Either way, it's stored in the assembly and still needs to be loaded/initialized. I'd suspect that using embedded resources is *slower* than using static data, though I don't have any benchmarks.
Robert Fraser
No they don't need to be initialized at all. It is raw data that you are accessing there. This has a lot of advantages. Not just that its faster, it also will need less memory AND it will have less runtime overhead (don't forget that the original method will create >17000 objects that have to be managed by the GC and are completely uncollectable).
Foxfire
I added some pseudocode in the answer to show it.
Foxfire
Don't need to be initialized? Wherever your data starts, you have to get it into memory to use it.
Charles
@Charles: No it doesn't it is A) memmapped so only loaded when it is really used B) it doesn't have ANY object creation/initialization overhead. The problem surely isn't loading 100kb from disk because that takes only a millisecond or so.
Foxfire
@Adam: Look at the pseudocode. I'm not trying to save the individual entries as resource entries.I suggest saving the entire data as binary stream. That is WAY faster and needs WAY less memory.
Foxfire
@Adam: That is very similar to how similar Char lookups are done in Mono (although they aren't read from Resources, but that doesn't matter you could take external files, too). And I can tell you (as somebody who partially developed these) they are by far the fastest and memory-efficient way of doing this.
Foxfire
@Adam: DID you even look at the pseudocode? He doesn't need to do ANY search for Unicode character names. His range is unicode. With the method I described he could easily have a complete unicode lookup table. So retrieving an entry is a matter of two index-based lookups which is FAR less than anything else could ever reach.
Foxfire
@Adam Unicode doesn't have any "code pages". But yes, a 65k lookup table. Keep in mind that unless you cry about the few kb of disk space it doesn't matter because it is memmapped.In memory this will still be smaller because it removes the entire object-overhead.
Foxfire
@Adam: If you want to be pedantic they are about 100k. Private use code points are meaningless. But the OP is using .Net and is obviously interested in the Char struct. So its 65k.
Foxfire
@Foxfire Thank you very much! This was the kind of solution that I hadn't thought of. I finally got around to implementing it today, and it is hands down the fastest option. Faster than my original dual arrays or a static or runtime-built dictionary. The resulting "indices" binary is 320KB, and the strings binary is 390KB.
Visualize
+5  A: 

First

A Dictionary<int, string> is going to perform far better than your duelling arrays will. Putting aside how this data gets into the arrays/Dictionary (hardcoded vs. read in from another location, like a resource file), this is still a better and more intuitive storage mechanism

Second

As others have suggested, do your loading in another thread. I'd use a helper function to help you deal with this. You could use an approach like this:

public class YourClass
{
    private static Dictionary<int, string> characterLookup;
    private static ManualResetEvent lookupCreated;

    static YourClass()
    {
        lookupCreated = new ManualResetEvent(false);

        ThreadPool.QueueUserWorkItem(LoadLookup);
    }

    static void LoadLookup(object garbage)
    {
        // add your pairs by calling characterLookup.Add(...)

        lookupCreated.Set();
    }

    public static string GetDescription(int code)
    {
        if (lookupCreated != null)
        {
            lookupCreated.WaitOne();
            lookupCreated.Close();
            lookupCreated = null;
        }

        string output;

        if(!characterLookup.TryGetValue(code, out output)) output = null;

        return output;
    }
}

In your code, call GetDescription in order to translate your integer into the corresponding string. If the UI doesn't call this until later, then you should see a marked decrease in startup time. To be safe, though, I've included a ManualResetEvent that will cause any calls to GetDescription to block until the dictionary has been fully loaded.

Adam Robinson
That will help on the responsiveness of the gui and since you're loading all the elements at first access the delay for the data will be roughly the same.
Rune FS
+1: I simply must upvote this because it really is the best solution. Nevermind that we seem to have come up with the exact same solution :)
Brian Gideon
One tip, TryGetValue will always store a value in its output parameter, but it'll default to `default(TValue)` if it cannot find the key, basically you can shorten that line down to just: `characterLookup.TryGetValue(code, out output);`. For info, see MSDN: http://msdn.microsoft.com/en-us/library/zkw5c9ak(VS.80).aspx
Lasse V. Karlsen
@Lasse: Yes, it will default to `default(T)` so you could simply ignore the sucess or failure. Nonetheless, I find it clearer to set my own default value (yes, I realize it's the same) to a) make changing the default easier later, and b) make my code consistent with similar operations for value types (it's less common that I'd want an unknown int to return 0, for example).
Adam Robinson
Thanks, this seems to be the ideal way to safely load the data. I'm still testing arrays vs. dictionary, I'll post back my results here
Visualize
Ack, I tested this out and the code in LoadLookup that does the dictionary.Add puts the memory usage up to ~300MB and then tapers down under 100MB. I have no clue what's going on with that ...
Visualize
@Visualize: Consider using the constructor that takes an integer. If you know the list ahead of time, you can use this to specify the initial capacity of the `Dictionary` and bypass the repeated resizing that you're seeing now.
Adam Robinson
Yep that's what I'm doing: "new Dictionary<int,string>(21829);"
Visualize
@Visualize: A Dictionary storing an int and a string reference should only add an extra 250KB (assuming an 80% load factor of 21829 items) of overhead. Are you sure the memory increase cannot be attributed to the length of the strings being created?
Brian Gideon
A: 

if you store the arrays in a file you could do a lazy load

public class Class1
    {
        const int CountOfEntries = 17700; //or what ever the count is
        IEnumerable<KeyValuePair<int, string>> load()
        {
            using (var reader = File.OpenText("somefile"))
            {
                while (!reader.EndOfStream)
                {
                    var line = reader.ReadLine();
                    var pair = line.Split(',');
                    yield return new KeyValuePair<int, string>(int.Parse(pair[0]), pair[1]);
                }
            }
        }

        private static Dictionary<int, string> _lookup = new Dictionary<int, string>();
        private static IEnumerator<KeyValuePair<int, string>> _loader = null;
        private string LookUp(int index)
        {

            if (_lookup.Count < CountOfEntries && !_lookup.ContainsKey(index))
            {
                if(_loader == null)
                {
                    _loader = load().GetEnumerator();
                }
                while(_loader.MoveNext())
                {
                    var pair = _loader.Current;
                    _lookup.Add(pair.Key,pair.Value);
                    if (pair.Key == index)
                    {
                        return index;
                    }
                }
            }
            string name;
            if (_lookup.TryGetValue(index,out name))
            {
                return return name;
            }
            throw new KeyNotFoundException("The given index was not found");
        }
    }

the code expectes the file to have one pair on each line like so: index0,name0 index1,name1

If the first index sought is at the end this will perform slower probably (due to IO mainly) but if the access is random the average case woul be reading half of the values the first time if the access is not random make sure to keep the most used in the top of the file

there are a few more issues to considere. The above code is not threadsafe for the load operation and to increase responsiveness of the rest of the code keep the loading in a background thread

hope this helps

Rune FS
Just as a tip, using `TryGetValue` is quite a bit faster than combinding `ContainsKey` and the indexer property.
Adam Robinson
yeah you right will update immediately thanks
Rune FS
A: 

What about using a dictionary instead of two arrays? You could initialize the dictionary asynchronously using a thread or thread pool. The lookup would be O(1) instead of O(log(n)) as well.

public static class Lookup
{
  private static readonly ManualResetEvent m_Initialized = new ManualResetEvent(false);
  private static readonly Dictionary<int, string> m_Dictionary = new Dictionary<int, string>();

  public static Lookup()
  {
    // Start an asynchronous operation to intialize the dictionary.
    // You could use ThreadPool.QueueUserWorkItem instead of creating a new thread.
    Thread thread = new Thread(() => { Initialize(); });
    thread.Start();
  }

  public static string Lookup(int code)
  {
    m_Initialized.WaitOne();
    lock (m_Dictionary)
    {
      return m_Dictionary[code];
    }
  }

  private static void Initialize()
  {
    lock (m_Dictionary)
    {
      m_Dictionary.Add(0x0000, "Exclamation Point");
      // Keep adding items to the dictionary here.
    }
    m_Initialized.Set();
  }
}
Brian Gideon