ansaurus

Question

VERY slow running regular expression when using large documents

Answer 1

+2 A:

First silly question - are you using RegexOptions.Compiled?

Jon Skeet 2008-10-02 16:02:47

yes - I'm using the compiled option - but I've tried both on/off to compare results

2008-10-02 17:34:56

Answer 2

+7 A:

I believe the problem is that if it finds a span|font tag, which has no style attribute defined, it will continue looking for it until the end of the document because of the ".*?". I havent tested it, but changing it to "[^>]*?" might improve performance.

EDIT: Make sure you apply that change for all of the ".*?" you have; even the one capturing the content between tags (use "[^<]*?" there), because if the file is not well-formed, it will capture up to the next closing tag.

Santiago Palladino 2008-10-02 16:07:32

Agreed. Any time you're trying to improve on a regex (whether for performance, clarity, or pretty much anything else), .* is the first thing you should look at and try to eliminate.

Dave Sherohman 2008-10-02 16:10:12

I totally agree. I partially inherited updating this code so I was trying to do it using the same pattern of using a regex - I probably wouldn't have used a straight regex otherwise.

2008-10-02 17:37:42

Answer 3

A:

Try to use the StringBuilder class in your CleanFormatting routine, instead of the String class. Speeds up string construction quite nicely.

Treb 2008-10-02 16:09:23

Answer 4

A:

~~.NET regular expressions does not support recursive constructs.~~ PCRE does, but that doesn't matter here.

Concider

<font style="font-weight: bold;"> text1 <font color="blue"> text2 </font> text3 </font>

It would get converted into

<b> text1 <font color="blue"> text2 </b> text3 </font>

My suggestion would be to use a proper markup parser, and maybe use regexp on the values of the style-tags.

Edit: Scratch that. It seems .NET has a construct for balanced, recursive patterns. But not as powerful as those in PCRE/perl.

(?<N>content) would push N onto a stack if content matches
(?<-N>content) would pop N from the stack, if content matches.
(?(N)yes|no) would match "yes" if N is on the stack, otherwise "no".

See http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx for details.

MizardX 2008-10-02 16:27:54

It is perfectly possible to handle that in NET. See balancing groups in regexes, it allows you to match balancing parenthesis, for instance.http://oreilly.com/catalog/regex2/chapter/index.html

Santiago Palladino 2008-10-02 16:54:43

Answer 5

A:

Wild guess: I believe the cost comes from the alternative and the corresponding match. You might want to try to replace:

"(<(span|font) .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</\\2>", "$1<i>$3</i></$2>"

with two separate expressions:

"(<span .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</span>", "$1<i>$2</i></span>"
"(<font .*?style=\".*?font-style:\\s*italic[^>]*>)(.*?)</font>", "$1<i>$2</i></font>"

Granted, that double the parsing of the file, but the regex being simpler, with less trackbacks, it might be faster in practice. It is not very nice (repetition of code) but as long as it works...

Funnily, I did something similar (I don't have the code at hand) to clean up HTML generated by a tool, simplifying it so that JavaHelp can understand it... It is one case where regexes against HTML is OK, because it is not a human making mistakes or changing little things which creates the HTML, but a process with well defined patterns.

PhiLho 2008-10-07 19:17:56

Answer 6

A:

I have similar problem when using regexp to split SQL script for sections from GO to GO. My regexp is: @"([\s\S]?)^GO(\s|$)" and im ectracting Match Group 1 from it. It works very slow with lage sections because of using "?". I know that faster would be something like this @"([^GO]+)(\s|$)" but it won't match string until GO but string until G or O. I do not know how to correct or improve it.

Pa0l0 2009-08-06 08:29:52

Answer 7

A:

During testing i found strange behavior. When run regexp in separate thread it runs a lot faster. I have sql script that i was spliting to sections from Go to Go using regexp. When working on this script without using separate thread it last for about 2 minutes. But when using multithreading it last only few secounds.

Pa0l0 2009-08-06 10:00:31

ansaurus

tags:

views:

answers:

VERY slow running regular expression when using large documents

related questions