Preventing Memory issues when handling large amounts of text

views:

186

answers:

+8 Q:

Preventing Memory issues when handling large amounts of text

I have written a program which analyzes a project's source code and reports various issues and metrics based on the code.

To analyze the source code, I load the code files that exist in the project's directory structure and analyze the code from memory. The code goes through extensive processing before it is passed to other methods to be analyzed further.

The code is passed around to several classes when it is processed.

The other day I was running it on one of the larger project my group has, and my program crapped out on me because there was too much source code loaded into memory. This is a corner case at this point, but I want to be able to handle this issue in the future.

What would be the best way to avoid memory issues?

I'm thinking about loading the code, do the initial processing of the file, then serialize the results to disk, so that when I need to access them again, I do not have to go through the process of manipulating the raw code again. Does this make sense? Or is the serialization/deserialization more expensive then processing the code again?

I want to keep a reasonable level of performance while addressing this problem. Most of the time, the source code will fit into memory without issue, so is there a way to only "page" my information when I am low on memory? Is there a way to tell when my application is running low on memory?

Update: The problem is not that a single file fills memory, its all of the files in memory at once fill memory. My current idea is to rotate off the disk drive when I process them

Use WinDbg with SOS to see what is holding on the string references (or what ever is causing the extreme memory usage).

leppie 2009-09-15 14:29:37

It has to do with the fact that the folder I was analyzing was 1.6GB (including compiled binaries, but I am not loading those, the amount of code is still massive)

phsr 2009-09-15 15:00:52

Serializing/deserializing sounds like a good strategy. I've done a fair amount of this and it is very fast. In fact I have an app that instantiates objects from a DB and then serializes them to the hard drives of my web nodes. It has been a while since I benchmarked it, but it was serializing several hundred a second and maybe over 1k back when I was load testing.

Of course it will depend on the size of your code files. My files were fairly small.

Matt Wrock 2009-09-15 14:47:08

+1 A:

If the problem is that a single copy of your code causing you to fill the memory available then there are atleast two options.

serialize to disk
compress files in memory. If you have a lot of CPU it can be faster to zip and unzip information in memory, instead of caching to disk.

You should also check if you are disposing of objects properly. Do you have memory problems due to old copies of objects being in memory?

Shiraz Bhaiji 2009-09-15 15:12:52

+2 A:

1.6GB is still manageable and by itself should not cause memory problems. Inefficient string operations might do it.

As you parse the source code your probably split it apart into certain substrings - tokens or whatver you call them. If your tokens combined account for entire source code, that doubles memory consumption right there. Depending on the complexity of the processing you do the mutiplier can be even bigger. My first move here would be to have a closer look on how you use your strings and find a way to optimize it - i.e. discarding the origianl after the first pass, compress the whitespaces, or use indexes (pointers) to the original strings rather than actual substrings - there is a number of techniques which can be useful here.

If none of this would help than I would resort to swapping them to and fro the disk

mfeingold 2009-09-15 15:38:16

This makes sense, because I have various states of the file available, probably increasing the size three fold

phsr 2009-09-15 16:07:09

ansaurus

tags:

views:

answers:

Preventing Memory issues when handling large amounts of text

related questions