I am working on a project that requires the parsing of log files. I am looking for an fast algorithm that would take groups messages like this:


The temperature at P1 is 35F.

The temperature at P1 is 40F.

The temperature at P3 is 35F.

Logger stopped.

Logger started.

The temperature at P1 is 40F.

and puts out something in the form of a printf()

Here is the desired output.

"The temperature at P%d is %dF.", Int1, Int2" {(1,35), (1, 40), (3, 35), (1,40)}

The algorithm needs to be generic enough to recognize almost any data load in message groups.

I tried googleing for this kind of technology, but I don't even know the correct terms to search for.

+7  A: 

I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().

John Downey
+2  A: 

It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..


@John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.

Chris Jester-Young
+1  A: 

You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.

Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.

However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):

my @vals = ();
while (defined(my $line = <>))
    if ($line =~ /The temperature at P(\d*) is (\d*)F./)
        push(@vals, "($1,$2)");
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < @vals; $i++)
    print $vals[$i];
    if ($i < @vals - 1)
        print ",";
print "}\n";

The output from this isL

The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}

You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.

Derek Park
+2  A: 

Thanks for all the great suggestions. Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings. Almost like predicting the next element in a set, based on the previous two:

1: Everest is 30000 feet high

2: K2 is 28000 feet high

=> What is the pattern? => Answer:

[name] is [number] feet high

Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.

I thought about creating some high level semantic hashes to represent the patterns in the message strings. I would use a tokenizer and give each of the tokens types a specific "weight". Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.

I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.


+1  A: 

I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.

I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.

After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.

In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":

Logger stopped.

Logger started.

Which you can then match with "Logger (.+).".

You can then refine the patterns and find new ones to match your whole log.

Vincent Robert
+6  A: 


A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.

Example input:

The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon


Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}

We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).

The next step is to iterate through lines which generated these frequency lists, so let's take the first example.

  1. The: meets some hand-wavy criteria and the algorithm decides it must be static.
  2. dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
  3. over: same deal as #1; it's static, so leave as is.
  4. the: same deal as #1; it's static, so leave as is.
  5. moon: same deal as #1; it's static, so leave as is.

Thus, just from going over the first line we can put together the following regular expression:

/The ([a-z]+?) jumps over the moon/


  • Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.

  • False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.

  • The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.


@Derek Park: Well, even a strong AI couldn't be sure it had the right answer.

Perhaps some compression-like mechanism could be used:

  1. Find large, frequent substrings
  2. Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])

Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.

Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!

Anders Eurenius


Well, even a strong AI couldn't be sure it had the right answer.

I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).

Of course, it doesn't really matter, since we don't have strong AI. :)

Derek Park