views:

100

answers:

6

Good morning,

I am writing a C# application that needs to read about 130,000 (String, Int32) pairs at startup to a Dictionary. The pairs are stored in a .txt file, and are thus easily modifiable by anyone, which is something dangerous in the context. I would like to ask if there is a way to save this dictionary so that the information can be reasonably safely stored, without losing performance at startup. I have tried using BinaryFormatter, but the problem is that while the original program takes beteen 125ms and 250ms at startup to read the information from the txt and build the dictionary, deserializing the resulting binary files takes up to 2s, which is not too much by itself but when compared to the original performance is a 8-16x decrease in speed.

Note: Encryption is important, but the most important should be a way to save and read the dictionary from the disk - possibly from a binary file - without having to use Convert.ToInt32 on each line, thus improving performance.

Thank you very much.

+1  A: 

Well, using a BinaryFormatter isn't really a safe way to store the pairs, as you can write a very simple program to deserialize it (after, say, running reflector on your code to get the type)

How about encrypting the txt? With something like this for example ? (for maximum performance, try without compression)

ohadsc
Thank you very much for your suggestion. What is the impact on performance of using encryption? And, if I understand well, that is also unsafe because any user can unzip it, change the .txt and zip it again, right?
Miguel
I have no idea, you should probably test for your case. also note Pieter's answer, might be a better idea for encryption (I linked to a compression library, which can also encrypt)
ohadsc
@Miguel - Note though that there is a very good chance your performance impact will be lower when you combine compression and encryption because your IO will be lower. As @ohadsc said, just try it out and see what it gives you.
Pieter
@Pieter true, but you could use the "no compression" setting
ohadsc
+2  A: 

If you want to have the data relatively safely stored, you can encrypt the contents. If you just encrypt it as a string and decrypt it before your current parsing logic, you should be safe. And, this should not impact performance that much.

See http://stackoverflow.com/questions/202011/encrypt-decrypt-string-in-net for more information.

Pieter
+1  A: 

Encryption comes at the cost of key management. And, of course, even the fastest encryption/decryption algorithms are slower than no encryption at all. Same with compression, which will only help if you are I/O-bound.

If performance is your main concern, start looking at where the bottleneck actually is. If the culprit really is the Convert.ToInt32() call, I imagine you can store the Int32 bits directly and get away with a simple cast, which should be faster than parsing a string value. To obfuscate the strings, you can xor each byte with some fixed value, which is fast but provides nothing more than a roadbump for a determined attacker.

Michael Kjörling
+1  A: 

Perhaps something like:

    static void Serialize(string path, IDictionary<string, int> data)
    {
        using (var file = File.Create(path))
        using (var writer = new BinaryWriter(file))
        {
            writer.Write(data.Count);
            foreach(var pair in data)
            {
                writer.Write(pair.Key);
                writer.Write(pair.Value);                    
            }
        }
    }
    static IDictionary<string,int> Deserialize(string path)
    {
        using (var file = File.OpenRead(path))
        using (var reader = new BinaryReader(file))
        {
            int count = reader.ReadInt32();
            var data = new Dictionary<string, int>(count);
            while(count-->0) {
                data.Add(reader.ReadString(), reader.ReadInt32());
            }
            return data;
        }
    }

Note this doesn't do anything re encryption; that is a separate concern. You might also find that adding deflate into the mix reduces file IO and increases performance:

    static void Serialize(string path, IDictionary<string, int> data)
    {
        using (var file = File.Create(path))
        using (var deflate = new DeflateStream(file, CompressionMode.Compress))
        using (var writer = new BinaryWriter(deflate))
        {
            writer.Write(data.Count);
            foreach(var pair in data)
            {
                writer.Write(pair.Key);
                writer.Write(pair.Value);                    
            }
        }
    }
    static IDictionary<string,int> Deserialize(string path)
    {
        using (var file = File.OpenRead(path))
        using (var deflate = new DeflateStream(file, CompressionMode.Decompress))
        using (var reader = new BinaryReader(deflate))
        {
            int count = reader.ReadInt32();
            var data = new Dictionary<string, int>(count);
            while(count-->0) {
                data.Add(reader.ReadString(), reader.ReadInt32());
            }
            return data;
        }
    }
Marc Gravell
+1  A: 

Is it safe enough to use BinaryFormatter instead of storing the contents directly in the text file? Obviously not. Because others can easily "destroy" the file by opening it by notepad and add something, even though he can see strange characters only. It's better if you store it in a database. But if you insist your solution, you can easily improve the performance a lot, by using Parallel Programming in C#4.0 (you can easily get a lot of useful examples by googling it). Something looks like this:

//just an example
Dictionary<string, int> source = GetTheDict();
var grouped = source.GroupBy(x =>
              {
                  if (x.Key.First() >= 'a' && x.Key.First() <= 'z') return "File1";
                  else if (x.Key.First() >= 'A' && x.Key.First() <= 'Z') return "File2";
                  return "File3";
              });
Parallel.ForEach(grouped, g =>
              {
                 ThreeStreamsToWriteToThreeFilesParallelly(g);
              });

Another alternative solution of Parallel is creating several threads, reading from/writing to different files will be faster.

Danny Chen
+1  A: 

interesting question. I did some quick tests and you are right - BinaryFormatter is surprisingly slow:

  • Serialize 130,000 dictionary entries: 547ms
  • Deserialize 130,000 dictionary entries: 1046ms

When I coded it with a StreamReader/StreamWriter with comma separated values I got:

  • Serialize 130,000 dictionary entries: 121ms
  • Deserialize 130,000 dictionary entries: 111ms

But then I tried just using a BinaryWriter/BinaryReader:

  • Serialize 130,000 dictionary entries: 22ms
  • Deserialize 130,000 dictionary entries: 36ms

The code for that looks like this:

public void Serialize(Dictionary<string, int> dictionary, Stream stream)
{
    BinaryWriter writer = new BinaryWriter(stream);
    writer.Write(dictionary.Count);
    foreach (var kvp in dictionary)
    {
        writer.Write(kvp.Key);
        writer.Write(kvp.Value);
    }
    writer.Flush();
}

public Dictionary<string, int> Deserialize(Stream stream)
{
    BinaryReader reader = new BinaryReader(stream);
    int count = reader.ReadInt32();
    var dictionary = new Dictionary<string,int>(count);
    for (int n = 0; n < count; n++)
    {
        var key = reader.ReadString();
        var value = reader.ReadInt32();
        dictionary.Add(key, value);
    }
    return dictionary;                
}

As others have said though, if you are concerned about users tampering with the file, encryption, rather than binary formatting is the way forward.

Mark Heath
Thank you very much for your suggestion!
Miguel
How did you get such a difference using BinaryReader/BinaryWriter? I am getting approximately the same times using FileReader/FileWriter and BinaryReader/BinaryWriter...
Miguel
@Miguel - here's my unit test file: http://pastie.org/1249910 - it may be that my StreamReader/StreamWriter code wasn't as efficient as yours
Mark Heath
Thank you very much Mark. But using your code I am getting similar results... What can be happening for it to be this way?
Miguel
strange - I am using Windows XP and .NET 3.5, perhaps your setup is different. Are you running my tests exactly? It may be that pre-sizing the dictionary on my CustomBinarySerializer is contributing a lot to its speed advantage
Mark Heath
Yes, I am running the tests just as you posted them. I am running Windows 7 and .NET 4.0. But perhaps my hard disk is slower than yours?
Miguel
my tests are running against a MemoryStream, so the hard disk isn't involved
Mark Heath
Yes, you are right... I had modified it first to read and write to the disk and only after that I tried your code, giving similar results...
Miguel