views:

378

answers:

2

I need to pull large Unicode textual strings (e.g. 200Mb) from a Database (nvarchar) and store in memory for processing. i.e. I need random access to all parts of the strings.

Looking at this from strictly memory centric point of view, what are the pro’s and con’s of using a System.IO.MemoryStream versus a System.String as my in memory representation.

Some factors I am trying to research are:

  • How these objects act in a [hypothetical] highly fragmented low memory environment
  • Immutability
  • Actual size in memory (if stream is UTF8, have we nearly halved size)
  • Is there another object I have not thought about?

I am looking for clarity and advice on these points, as well as any other memory considerations I have not thought of?

Note: There may be better way of processing these strings, but at this point I am realy just asking about the memory consideration of storing such an object.

+3  A: 

Hi

Looking at this from strictly memory centric point of view, what are the pro’s and con’s of using a System.IO.MemoryStream versus a System.String as my in memory representation.

Some factors I am trying to research are:

  • How these objects act in a [hypothetical] highly fragmented low memory environment

IMO, a MemoryStream is only useful when the encoding is trivial (e.g. ASCII, ISO-8859-X, etc.). If the encoding is UTF-8 and you have non-ASCII characters then processing will become more difficult. Sure, the MemoryStream will almost certainly consume less memory, but otherwise there's not much of a difference. Under the hood, a MemoryStream uses a byte array, which also needs to be allocated in the contiguous chunck of memory.

  • Actual size in memory (if stream is UTF8, have we nearly halved size)

Right, with purely ASCII chars, a MemoryStream will consume half of what the equivalent string consumes.

  • Is there another object I have not thought about?
List<byte> // has a nicer interface for processing

How are the strings stored in the database? varchar or nvarchar?

Regards,

Andreas

Andreas Huber
Thanks Andreas, strings are coming out of database as nvarchar.
nick_alot
Hi Andreas. After some further reading on MSDN, i have read that the MemoryStream created with default constructor is resizable, so this would indicate that it does not require a contiguous block of memory. Is this a fair assumption?
nick_alot
see: http://msdn.microsoft.com/en-us/library/system.io.memorystream.memorystream.aspx
nick_alot
In this case, if I am not mistaken, then there's no way with LINQ or ADO to avoid System.String. So at least when you load the data, you need to fit the whole string into memory. Converting that to a byte array before processing further will - until GC collects the string - need even more memory.
Andreas Huber
I am now trying to go from DB --> IDataReader --> StreamWriter --> MemoryStream.This seems to work avoiding any in-memory System.String
nick_alot
No, resizing the MemoryStream will internally simply allocate a byte array twice as big as the current size and then copy the contents of the old array into the new array, so the block is still contiguous after resizing. You can use .NET Reflector to look at MemoryStream.SetLength.
Andreas Huber
Yep, going through the DataReader is probably best. I guess you read in the DB string in multiple blocks and convert them to UTF-8 before writing them into the MemoryStream, right? If so, then calling MemoryStream.SetSize before will avoid any additional allocations and copying.
Andreas Huber
+3  A: 

The memory of a string vs a stream is fairly inconsequential. Strings are utf-16, so there may be a small multiple involved, but because of the volumes involved, you would probably be best off writing the data to a scratch file.

To read the data out of the database, use streaming techniques; i.e. use IDataReader (ExecuteReader), with it in sequential-mode, and read chunks of bytes/characters. Don't attempt to read the entire column.

Also, with SQL Server 2008 you miht want to look at the file-stream type.

Examples:

Marc Gravell