ansaurus

Question

Large static arrays are slowing down class load, need a better/faster lookup method

Answer 1

+3 A:

Initialize your arrays in separate thread that will not lock the UI
http://msdn.microsoft.com/en-us/library/hz49h034.aspx

Andrey 2010-05-04 18:18:00

That's probably the easiest way to deal with the problem.

Robert Harvey 2010-05-04 18:19:48

Answer 2

+7 A:

So a couple of observations. Binary Search is only going to work if your array is sorted, and from your above code snippet, it doesn't look to be sorted.

Since your primary goal is to find a specific name, your code is begging for a hash table. I would suggest using a Dictionary, it will give you O(1) (on average) lookup, without much more overhead than just having the arrays.

As for the load time, I agree with Andrey that the best way is going to be by using a separate thread. You are going to have some initialization overhead when using the amount of data you are using. Normal practice with GUIs is to use a separate thread for these activites so you don't lock up the UI.

cmptrer 2010-05-04 18:19:56

A dictionary initialized from a background thread, started at program start, instead of when needed, is probably the best option here.

Lasse V. Karlsen 2010-05-04 18:21:10

The binary search is on the NAME_INDEX array, which is sorted. I just tried to use a static Dictionary, but I'm getting an error "An expression is too long to compile" error. Going to try building the Dictionary at runtime to see how that performs ...

Visualize 2010-05-04 20:02:17

Answer 3

+3 A:

"Should I not be storing all these thousands of elements in static arrays?"

A much better way would be to store your data as binary stream in resources in the assembly and then load from the resources. Will be some more programming overhead but therefore doesn't need any object initialization.

Basic idea would be (no real code):

// Load data (two streams):
indices = ResourceManager.GetStream ("indexData");
strings = ResourceManager.GetStream ("stringData");

// Retrieving an entry:
stringIndex = indices.GetIndexAtPosition (char);
string = strings.GetStringFromPosition (stringIndex);

If you want a really good solution (for even some more work) look into using memmapped data files.

Foxfire 2010-05-04 18:21:51

Either way, it's stored in the assembly and still needs to be loaded/initialized. I'd suspect that using embedded resources is *slower* than using static data, though I don't have any benchmarks.

Robert Fraser 2010-05-04 18:24:36

No they don't need to be initialized at all. It is raw data that you are accessing there. This has a lot of advantages. Not just that its faster, it also will need less memory AND it will have less runtime overhead (don't forget that the original method will create >17000 objects that have to be managed by the GC and are completely uncollectable).

Foxfire 2010-05-04 18:27:36

I added some pseudocode in the answer to show it.

Foxfire 2010-05-04 18:37:19

Don't need to be initialized? Wherever your data starts, you have to get it into memory to use it.

Charles 2010-05-04 18:38:19

@Charles: No it doesn't it is A) memmapped so only loaded when it is really used B) it doesn't have ANY object creation/initialization overhead. The problem surely isn't loading 100kb from disk because that takes only a millisecond or so.

Foxfire 2010-05-04 18:40:18

@Adam: Look at the pseudocode. I'm not trying to save the individual entries as resource entries.I suggest saving the entire data as binary stream. That is WAY faster and needs WAY less memory.

Foxfire 2010-05-04 18:42:45

@Adam: That is very similar to how similar Char lookups are done in Mono (although they aren't read from Resources, but that doesn't matter you could take external files, too). And I can tell you (as somebody who partially developed these) they are by far the fastest and memory-efficient way of doing this.

Foxfire 2010-05-04 18:48:36

@Adam: DID you even look at the pseudocode? He doesn't need to do ANY search for Unicode character names. His range is unicode. With the method I described he could easily have a complete unicode lookup table. So retrieving an entry is a matter of two index-based lookups which is FAR less than anything else could ever reach.

Foxfire 2010-05-04 18:55:35

@Adam Unicode doesn't have any "code pages". But yes, a 65k lookup table. Keep in mind that unless you cry about the few kb of disk space it doesn't matter because it is memmapped.In memory this will still be smaller because it removes the entire object-overhead.

Foxfire 2010-05-04 19:05:36

@Adam: If you want to be pedantic they are about 100k. Private use code points are meaningless. But the OP is using .Net and is obviously interested in the Char struct. So its 65k.

Foxfire 2010-05-04 19:16:39

@Foxfire Thank you very much! This was the kind of solution that I hadn't thought of. I finally got around to implementing it today, and it is hands down the fastest option. Faster than my original dual arrays or a static or runtime-built dictionary. The resulting "indices" binary is 320KB, and the strings binary is 390KB.

Visualize 2010-05-08 17:24:53

Answer 4

+5 A:

First

A Dictionary<int, string> is going to perform far better than your duelling arrays will. Putting aside how this data gets into the arrays/Dictionary (hardcoded vs. read in from another location, like a resource file), this is still a better and more intuitive storage mechanism

Second

As others have suggested, do your loading in another thread. I'd use a helper function to help you deal with this. You could use an approach like this:

public class YourClass
{
    private static Dictionary<int, string> characterLookup;
    private static ManualResetEvent lookupCreated;

    static YourClass()
    {
        lookupCreated = new ManualResetEvent(false);

        ThreadPool.QueueUserWorkItem(LoadLookup);
    }

    static void LoadLookup(object garbage)
    {
        // add your pairs by calling characterLookup.Add(...)

        lookupCreated.Set();
    }

    public static string GetDescription(int code)
    {
        if (lookupCreated != null)
        {
            lookupCreated.WaitOne();
            lookupCreated.Close();
            lookupCreated = null;
        }

        string output;

        if(!characterLookup.TryGetValue(code, out output)) output = null;

        return output;
    }
}

In your code, call GetDescription in order to translate your integer into the corresponding string. If the UI doesn't call this until later, then you should see a marked decrease in startup time. To be safe, though, I've included a ManualResetEvent that will cause any calls to GetDescription to block until the dictionary has been fully loaded.

Adam Robinson 2010-05-04 18:29:38

That will help on the responsiveness of the gui and since you're loading all the elements at first access the delay for the data will be roughly the same.

Rune FS 2010-05-04 18:43:30

+1: I simply must upvote this because it really is the best solution. Nevermind that we seem to have come up with the exact same solution :)

Brian Gideon 2010-05-04 18:46:34

One tip, TryGetValue will always store a value in its output parameter, but it'll default to `default(TValue)` if it cannot find the key, basically you can shorten that line down to just: `characterLookup.TryGetValue(code, out output);`. For info, see MSDN: http://msdn.microsoft.com/en-us/library/zkw5c9ak(VS.80).aspx

Lasse V. Karlsen 2010-05-04 18:57:33

@Lasse: Yes, it will default to `default(T)` so you could simply ignore the sucess or failure. Nonetheless, I find it clearer to set my own default value (yes, I realize it's the same) to a) make changing the default easier later, and b) make my code consistent with similar operations for value types (it's less common that I'd want an unknown int to return 0, for example).

Adam Robinson 2010-05-04 19:01:58

Thanks, this seems to be the ideal way to safely load the data. I'm still testing arrays vs. dictionary, I'll post back my results here

Visualize 2010-05-04 20:04:07

Ack, I tested this out and the code in LoadLookup that does the dictionary.Add puts the memory usage up to ~300MB and then tapers down under 100MB. I have no clue what's going on with that ...

Visualize 2010-05-04 20:50:31

@Visualize: Consider using the constructor that takes an integer. If you know the list ahead of time, you can use this to specify the initial capacity of the `Dictionary` and bypass the repeated resizing that you're seeing now.

Adam Robinson 2010-05-04 21:26:32

Yep that's what I'm doing: "new Dictionary<int,string>(21829);"

Visualize 2010-05-04 21:35:29

@Visualize: A Dictionary storing an int and a string reference should only add an extra 250KB (assuming an 80% load factor of 21829 items) of overhead. Are you sure the memory increase cannot be attributed to the length of the strings being created?

Brian Gideon 2010-05-05 13:18:06

Answer 5

A:

if you store the arrays in a file you could do a lazy load

public class Class1
    {
        const int CountOfEntries = 17700; //or what ever the count is
        IEnumerable<KeyValuePair<int, string>> load()
        {
            using (var reader = File.OpenText("somefile"))
            {
                while (!reader.EndOfStream)
                {
                    var line = reader.ReadLine();
                    var pair = line.Split(',');
                    yield return new KeyValuePair<int, string>(int.Parse(pair[0]), pair[1]);
                }
            }
        }

        private static Dictionary<int, string> _lookup = new Dictionary<int, string>();
        private static IEnumerator<KeyValuePair<int, string>> _loader = null;
        private string LookUp(int index)
        {

            if (_lookup.Count < CountOfEntries && !_lookup.ContainsKey(index))
            {
                if(_loader == null)
                {
                    _loader = load().GetEnumerator();
                }
                while(_loader.MoveNext())
                {
                    var pair = _loader.Current;
                    _lookup.Add(pair.Key,pair.Value);
                    if (pair.Key == index)
                    {
                        return index;
                    }
                }
            }
            string name;
            if (_lookup.TryGetValue(index,out name))
            {
                return return name;
            }
            throw new KeyNotFoundException("The given index was not found");
        }
    }

the code expectes the file to have one pair on each line like so: index0,name0 index1,name1

If the first index sought is at the end this will perform slower probably (due to IO mainly) but if the access is random the average case woul be reading half of the values the first time if the access is not random make sure to keep the most used in the top of the file

there are a few more issues to considere. The above code is not threadsafe for the load operation and to increase responsiveness of the rest of the code keep the loading in a background thread

hope this helps

Rune FS 2010-05-04 18:33:15

Just as a tip, using `TryGetValue` is quite a bit faster than combinding `ContainsKey` and the indexer property.

Adam Robinson 2010-05-04 18:35:47

yeah you right will update immediately thanks

Rune FS 2010-05-04 18:36:25

Answer 6

A:

What about using a dictionary instead of two arrays? You could initialize the dictionary asynchronously using a thread or thread pool. The lookup would be O(1) instead of O(log(n)) as well.

public static class Lookup
{
  private static readonly ManualResetEvent m_Initialized = new ManualResetEvent(false);
  private static readonly Dictionary<int, string> m_Dictionary = new Dictionary<int, string>();

  public static Lookup()
  {
    // Start an asynchronous operation to intialize the dictionary.
    // You could use ThreadPool.QueueUserWorkItem instead of creating a new thread.
    Thread thread = new Thread(() => { Initialize(); });
    thread.Start();
  }

  public static string Lookup(int code)
  {
    m_Initialized.WaitOne();
    lock (m_Dictionary)
    {
      return m_Dictionary[code];
    }
  }

  private static void Initialize()
  {
    lock (m_Dictionary)
    {
      m_Dictionary.Add(0x0000, "Exclamation Point");
      // Keep adding items to the dictionary here.
    }
    m_Initialized.Set();
  }
}

Brian Gideon 2010-05-04 18:42:22

ansaurus

tags:

views:

answers:

Large static arrays are slowing down class load, need a better/faster lookup method

related questions