views:

52

answers:

2

I'm attempting to parse an Oracle trace file using regular expressions. My language of choice is C#, but I chose to use Ruby for this exercise to get some familiarity with it.

The log file is somewhat predictable. Most lines (99.8%, to be specific) match the following pattern:

# [Timestamp]                  [Thread]  [Event]   [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c  (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60  OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/

However, in a few places in the log there much are complicated queried that, for some reason, span several lines:

Screenshot

Two things to point out about these entries is that they appear to cause some sort of corruption in the log file, because they end with unprintable characters, and then suddenly the next entry begins on the same line.

Since this obviously rules out capturing data on a per-line basis, I think the next best option is to match everything between the word "TIME:" and either the next instance of "TIME:" or the end of the file. I'm not sure how to express this using regular expressions.

Is there a more efficient approach? The log file I need to parse will be over 1.5GB. My intention is to normalize the lines, and drop unnecessary lines, to eventually insert them as rows in a database for querying.

Thanks!

+1  A: 

It might be better to do this old-school, i.e. read your file in one line at a time... start at the first 'TIME', and concatenate your lines until you hit the next 'TIME'... you can use regular expressions to filter out any lines you don't want.

I can't speak to Ruby; in C# it would be a StreamReader, of course, which helps you deal with the file size.

James B
+2  A: 

The regex to match potentially multi line data between between "TIME:" and "TIME:" strings or the end of the file is:

/^TIME:(.+?)(?=TIME:|\z)/im

On the other hand as James mentions, tokenizing for "TIME:" substrings, or looking for substring positions of "\r\nTIME:" (after the first "TIME:" entry, depending on line-break format) may prove a better approach.

Rudu