views:

68

answers:

7

I'm working with some log files that are very poorly formatted, the column delimiter is an item that (often) appears within the field and it isn't escaped. For example:

sam,male,september,brown,blue,i like cats, and i like dogs

Where:

name,gender,month,hair,eyes,about

So as you can see, the about contains the column delimiter which means a single parse by the delimiter won't work, because it'll separate the about me into two separate columns. Now imagine this with a chat system... you can visualize the issues I'm sure.

So, theoretically what's the best approach to solving this? I'm not looking for a language specific implementation but more of a general pointer to the correct direction, or some ideas on how others have solved it... without doing it manually.

Edit:

I should clarify, my actual logs are in a much worse state. There are these fields with delimiter characters everywhere, there is no pattern that I can locate.

+4  A: 

If only the last column have unescaped commas, then most language's implementation of string split can limit the number of splits made, e.g. in Python s.split(',',5)

If you want to parse the file as a CSV (comma separated values) parser, then I think the best approach would be to run a fixer that does proper escaping before passing it to the csv parser.

Lie Ryan
I should have mentioned that this isn't my *actual* log, it's *much* worse, this was just a basic example. I can't find any reliable pattern in these.
citricsquid
@criticsquid: can you give a small sample of the actual log? especially, the worst cases.
Lie Ryan
+3  A: 

I suppose you can make certain assumptions on the kind of data. Like gender, month, hair, and eyes have a domain of values then verify that.

It could also make sense that all of the fields except about and maybe name wouldn't contain a comma so perhaps you can parse greedily making the first 5 or 6 commas behave as delimiters and everything else is part of about. Verify again if necessary.

Jeff M
+2  A: 

It might be impossible to perfectly parse them, if no escaping is used.

Lie Ryan noted that if only the last column could have those values, you have an option there.

If that's not the case, are there any columns where you are guaranteed to always have a lack of unescaped, reserved characters? Also, are there any columns where you are guaranteed to always have only a certain set of values?

If either of those are true, you may be able to identify those fields first, and separate out everything else to split it from there.

I'd have to know more specifics about your info to go further.

Andrew Barber
+1  A: 

Here are two ideas that you could try out:

  • Length/format patterns - I think you could be able to identify some patterns in the individual columns of the file. For example, values in some columns may be shorter and values in some columsn may be shorter. Values in some columns are typically numbers or from a limited set of values (e.g. months) or at least contain often some sub-string.

    When you can identify these patterns (based on statistics calculated from items correctly delmited items), then you should could create algorithm that uses these to guess which of the delimiters should be ignored (e.g. when a column would be shorter than expected).

  • Grammatical rules - another idea inspired by your example - are the commas that are not escaped usually followed by some strings (e.g. words "and" or "about"?) If yes, you could use this information to guess which delmiters should be escaped.

Finally, if none of these ad hoc techniques can solve your problem, then you can use some heavy statistics to do the estimation. There are some machine learning frameworks that can do the heavy statistics for you, but it is still quite compilicated problem. For example on .NET, you could use Infer.NET from Microsoft Research.

Tomas Petricek
A: 

One thing I would suggest doing, if possible, is keep something in each data record which indicates the assumptions that were made (possibly keeping the original string), so that if something is found to be wrong with a record the proper data can hopefully be reconstructed (by hand-examining it if nothing else).

supercat
A: 

If the 6th column is always the last, and always unescaped, this bit of perl should do the trick:

$file = '/path/to/my/log.txt';
open(LOG, $file);
@lines = <LOG>;

foreach $line (@lines)
{
    chomp($line);
    if ($line =~ /([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_]+)\,([A-Za-z0-9_\, ]+)/)
    {
        print "Name:         $1\n";
        print "Gender:       $2\n";
        print "Month:        $3\n";
        print "Color #1:     $4\n";
        print "Color #2:     $5\n";
        print "Random Text:  $6\n";
    }
}

close(LOG)
Mentalikryst
A: 

Your logs are ambiguous: you can't be sure which of many possible interpretations to make. Dealing with uncertainty is a job for probability theory. A natural tool then is a probabilistic context-free grammar -- there are algorithms for finding the most-probable parse. (I haven't had occasion to use one myself, though I've done simpler jobs with this kind of statistical approach. Peter Norvig's spelling-corrector article goes into practical detail on one such example.)

For this particular simplified problem: you might enumerate all the possible ways to split a line into N parts (where you already know what N to expect), calculate the probability of each according to some model, and pick the best answer.

(Another example of dealing with data with distinctions erased: I had a dataset of tags from a half-million Flickr photos. The tags came out of their API with all the wordsruntogether with the spaces squished out. I computed the most likely word boundaries using word frequencies tabulated from Internet photography sites, plus code like this SO answer.)

Darius Bacon