





I've been working on a regular expression to parse the output of a series of SQLIO runs. I've gotten pretty far, but not quite there yet. I'm seeking a 100% regex solution and no pre-manipulation of the input. Could anyone assist with a little guidance with the following regular expression:


Here's a snippet of the output - note the headers, which change during the SQLIO batch run: File


The problem appears to be here:

    using 8KB random IOs
    buffering set to use hardware disk cache (but not file cache)

After capturing the cluster size, you use .*\n to consume the second line before going on to capture the file size, but sometimes there's a third line:

    using 8KB random IOs
    enabling multiple I/Os per thread with 8 outstanding
    buffering set to use hardware disk cache (but not file cache)

I added (?:.*\n)? to the relevant section of the regex, and now it matches all 36 entries.

I know you want to go 100% regex, but have you considered writing the regex in extended format with comments (i.e., IgnorePatternWhitespace mode)? I would also recommend using more literal text in the regex to make it easier to follow. For example,

(?<threads>\d+) threads? reading for (?<Seconds>\d+) secs.*\n

instead of


Unreadable code is unmaintainable code, and regexes need all the help they can get. :-/

Alan Moore
How could I have missed that one. Thanks! I'll have to look into the more reading friendly formats. I definately will not understand my own expression in 2-3 weeks :)

The hell with counting lines, as long as the order doesn't change you can do the following. Oh, and using /x for big regex helps. ;)


(?> # atomic match, dont backtrack in here when matched
.{0,400}? # dont match so far that we can get the next result

\b for\s+(?<Seconds>\d+)\s*sec)

\b using\s+(?<clustersize>\d+)\s*KB)

\b size:\s+(?<currentfilesize>\d+))

\b IOs/sec\D*(?<IOs>\d+\.\d+))

\b MBs/sec\D*(?<IOs>\d+\.\d+))

\b Min_Latency\D*(?<MinLatency_ms>\d+))

\b Avg_Latency\D*(?<AvgLatency_ms>\d+))

\b Max_Latency\D*(?<MaxLatency_ms>\d+))



PCRE/Perl qr§§ used for quoting.
