views:

93

answers:

3

I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:

apples | bananas | oranges | grapes

data file two looks like this:

quarter, dime, nickel, penny

data file three looks like this:

horse cow pig chicken goat

(the change in the number of columns is also intentional)

The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.

But I can't think of a better way to do this.

Any suggestions?

+2  A: 

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

JoshD
That doesn't really solve the problem. <br/> What I end up with, in that case, is every single character in the file split into its own list, like: "['a'] ['p'] ['p'] ['l'] ['e'] ['s'] [' '] ['|'](...and so forth...). What I'd like, instead, is each line broken into a list like, "['apples', 'bananas', 'oranges', 'grapes']"
Greg Gauthier
I assume you're trying to identify the separator so you can separate the data. Why do you want to identify the separator?
JoshD
@Greg Gauthier: I'm terribly sorry. I meant to say re.split. I've changed the answer to reflect the proper method.
JoshD
<pre><code>infile = 'Data/pipe.txt' records = open(infile,'r').read() for line in records: print line.split('|,; \t')</pre></code>
Greg Gauthier
urgh... no formatting in comments?
Greg Gauthier
@Greg: I still see what you have. I've updated my answer. I had used the wrong split, and have corrected it with an example. I hope this clears things up.
JoshD
-1 This has no chance of working. The first arg of str.split is a string representing a single delimiter; it is NOT a string of multiple 1-character delimiters. Unless the input string actually includes '|, \t;', it will be returned unchanged. How the OP got "every single character ... split into its own list", I can't imagine.
John Machin
@JoshD - Dude, THANK YOU! That works! There's a bit of confusion with the spaces around the other seps (it puts null elements into the lists). BUT, this gets me farther than I got before! :)
Greg Gauthier
@John Machin: Yes. I've since corrected it to the split I intended. It seems to have worked for him. I do appreciate you pointing this out, though.
JoshD
@Greg Gauthier, You might try adding a + (see answer) in the regular expression. Then it will match consecutive delimiters and remove most the empty list items.
JoshD
Thanks again, Josh. My original attempt looked like this: new_line = [i for i in re.split(r'[ ,|;"\t]', line) if i != ''] --- but extending the regex is really a much cleaner approach.
Greg Gauthier
+10  A: 

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','
eumiro
ooh. That one is interesting! Is it available in version 2.6?
Greg Gauthier
Yes, it is available in 2.6.
eumiro
+1: Definitely use the csv module for this. Parsing delimited files, especially if they might contain escaped delimiters, delimiters within quoted strings, newlines within quoted strings etc. is no job for a regex. A regex solution will fail sooner or later, and the bugs will be subtle and mind-numbing to find.
Tim Pietzcker
Wow, I didn't know this existed.
Matthew Schinckel
This is a great answer -- but it won't work for the OPs first example. An input of `apples | bananas | oranges | grapes` claims that the delimiter is `' '`. If you remove the spaces from around the pipes, it will work as expected.
John Ledbetter
A: 
Greg Gauthier