views:

594

answers:

3

I've got an exception log from one of production code releases.

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.RegularExpressions.Match..ctor(Regex regex, Int32 capcount, String text, Int32 begpos, Int32 len, Int32 startpos)
   at System.Text.RegularExpressions.RegexRunner.InitMatch()
   at System.Text.RegularExpressions.RegexRunner.Scan(Regex regex, String text, Int32 textbeg, Int32 textend, Int32 textstart, Int32 prevlen, Boolean quick)
   at System.Text.RegularExpressions.Regex.Run(Boolean quick, Int32 prevlen, String input, Int32 beginning, Int32 length, Int32 startat)
   at System.Text.RegularExpressions.MatchCollection.GetMatch(Int32 i)
   at System.Text.RegularExpressions.MatchEnumerator.MoveNext()

The data it tries to process was about 800KB.

In my local tests it works perfectly fine. Have you ever seen similar behaviour, what can be the cause?

Shall I split the text before processing it, but obviously in that case regex might not match because the original file split from a random place.

My Regexes:

EDIT 2 :

I think this particular RegEx is causing the problem, when I test it out in an isolated environment it's eating the memory instantly.

((?:( |\.\.|\.|""|'|=)[\/|\?](?:[\w#!:\.\?\+=&@!$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})*)( |\.|\.\.|""|'| ))?

EDIT

I was being wrong with my local test. I was loading up a big string then appending stuff to it which makes .NET Framework dizzy and then give an OOM exception during the RegEx instead of during string operations (or randomly, so ignore the previous stuff I've said).

This is a .NET Framework 2.0 application.

+1  A: 

The first thing I would try, if it is possible for your application, would be to split up the input.

Would it be possible to read the file (if the input is a file) line-by-line, applying the Regular Expression that way?

You should take a look with CLR Profiler. It can take a little time to learn how to use, but it's worth it. It will help you visualize how much memory your objects use.

Terrapin
2 problems with that. I'm originally using SingleLine as options in regex. So reading it one by line might break some regexes. Secondly this would be a bad performance impact on shorter files (obviously I can switch depends on the size but sounds dirty :) however If I can't fix it it's nice idea)
dr. evil
I'm betting that the performance difference between line-by-line for small files, and applying the regex all at once is going to be minimal enough that it won't have a noticeable impact.
Terrapin
Yeah I need to test it and see if it works.
dr. evil
I know that stringbuilder is good for that but what I meant was using stringbuilder with Regex sounds like helps RegEx runner to use memory more efficiently which is totally new to me.
dr. evil
It's possible, and I'm not sure if StringBuilder somehow allows Regex to work more efficiently, but IMO that's not something you should rely on. If you're running out of memory, that may be indicative of a larger design problem that should be addressed.
Terrapin
Apparently I messed up my local test, check out my final edit please.
dr. evil
+1  A: 

Based on your edit, it sounds like your code may be creating strings which take up large amounts of memory. This would mean that even though the out of memory exception is generated from within the Regex code, it's not actually because the Regex itself is taking up too much memory. Therefore, if using StringBuilder in your own code resolves the issue, then that's what you should do.

kvb
exactly you right, I just figured that out and updated my sample. This wasn't the behaviour of the original code though, this was testing code. Check out my final edit.
dr. evil
+1  A: 

Without seeing your Regex, I don't know for sure but sometimes you can get problems like this because your matches are Greedy instead of Lazy.

The Regex engine has to store lots of information internally and Greedy matches can end up causing the Regex to select large sections of your 800k string, many times over.

There's some good information about this over here.

Russ C
put my regexes to the question.
dr. evil
Do you have a small snippet of what you're trying to match ? Is it Html or is it text that might have Urls in it ?
Russ C
On the latest regex, what happenes if you change the last '*' to '*?'
Russ C
It's HTML just use Yahoo's source code it takes ages for everything long actually. I'll try that change now.
dr. evil
Didn't seem to work for me, that regex maxes out my test program with a 30k html file, for Cpu cycles!
Russ C
it works but doesn't match what I want any more. Try this and see the difference " /cool.htm dontmatch" Actually regex itself is rubbish I'm sure there is better way to write this.
dr. evil
What's the regex actually wanting to match ? As you say, I think it's probably doing the wrong thing. The example take you gave me actually generated 19 matches!
Russ C
This runs a lot quickler but I don't know if its accurate:( |\.\.|\.|""|'|=)[\/|\?](?:[\w#!:\.\?\+=\/\(\)\[\]\-]|%[0-9a-f]{2})*( |\.|\.\.|""|'| )
Russ C
This looks better to me, it doesn't match the /css in a link type="text/css" ...(?:="?)([\/|\?](?:[\w#!:\.\?\+=\/\(\)\[\]\-]|%[0-9a-f]{2})*)
Russ C
Hi Slough, Any news ?
Russ C