ansaurus

Question

How to parse data in a variable length delimited file?

Answer 1

+2 A:

Here's a way to read fixed width fields using regexp

>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>

gnibbler 2010-08-12 23:59:37

Super elegant! Could you please troubleshoot this : var1,var2 = re.match("(.{%d}) (.{%d})",line2).groups() % (positions[1],positions[2])AttributeError: 'NoneType' object has no attribute 'groups'

ThinkCode 2010-08-13 00:14:43

The regexp didn't match the line.Also, I'm curious why you consider regexps to be more elegant than slicing.

Aaron Gallagher 2010-08-13 00:17:18

You need to move the parameters next to the string like this `re.match("(.{%d}) (.{%d})"% (positions[1],positions[2]),line2).groups() \`

gnibbler 2010-08-13 00:19:43

Answer 2

A:

def parse(your_file):
    first_line = your_file.next().rstrip()
    slices = []
    start = None
    for e, c in enumerate(first_line):
        if c != '#':
            continue

        if start is None:
            start = e
            continue
        slices.append(slice(start, e))
        start = e
    if start is not None:
        slices.append(slice(start, None))

    for line in your_file:
        parsed = [line[s] for s in slices]
        yield parsed

Aaron Gallagher 2010-08-13 00:00:50

Answer 3

+1 A:

Off the top of my head:

f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
    fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]

Update with tested, fixed code:

import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
    posns[-1] = len(line)
    fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
    print fields

Input file:

#      #  #
Foo    BarBaz
123456789abcd

Debug output:

'#      #  #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']

Robustification notes:

This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).

Final(?) update: Leapfrooging @gnibbler's suggestion to use slice(): set up the slices once before looping.

import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
    fields = [line[sl].rstrip() for sl in slices]
    print fields

John Machin 2010-08-13 00:01:27

how about `fields = [line[slice(*x)] for x in zip(posns, posns[1:])]`

gnibbler 2010-08-13 00:11:46

This answer is spot on sir! Works beautifully. Another challenge, what if the delimiters are 3 letters and not all are same, for instance : <A> <B> <C> <B> <A> I know it is asking for too much! Intelligently parsing the delimiters would be awesome. Thanks again...

ThinkCode 2010-08-13 00:44:56

WoW! I tweaked my program a little, used my code for finding the delimiters and passed the positions to your code and boom, it works! Thank you so much!

ThinkCode 2010-08-13 00:53:25

@gnibbler: Thanks for the suggestion to use slices.

John Machin 2010-08-13 01:15:51

@AnonymousDriveByDownVoter: Care to give reasons so that I/we can benefit from your wisdom?

John Machin 2010-08-13 01:17:57

@ThinkCode: With all due respect, your code for finding the delimiters should be abandoned. New/fat delimiters: Presuming the first character of the "delimiter" corresponds to the first character of the field, you could (a) simply use "<" instead of "#" in my solution (b) if you want some more validation, use the re.finditer approach that you saw in another solution to derive the `posns` list (pattern e.g. "<[A-Z]>" -- complicate as necessary/desirable to match what you have in the header). By the way, who's "designing" these headers?

John Machin 2010-08-13 01:28:16

Totally lost track of this q! It is weirdly formatted data that we get from clients sometimes! Not sure why they do this to us!!

ThinkCode 2010-09-21 20:34:36

Answer 4

A:

f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
   print [line[i:j].strip() for i, j in pos]
f.close()

tux21b 2010-08-13 00:05:47

Answer 5

+1 A:

Adapted from John Machin's answer

>>> header = "#     #   #   #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']

You can also write the last line like this

>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]

For the other example you give in the comments, you just need to have the correct header

>>> header = "#       #     #     #"
>>> row    = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>

gnibbler 2010-08-13 00:15:41

it is failing for row = "Techiyi Iniin Viial NiiJ" . I really appreciate your answers. Trying to decipher the code, neat though!

ThinkCode 2010-08-13 00:33:41

@ThinkCode, did you update the value of header to match the row?

gnibbler 2010-08-13 00:45:24

Answer 6

A:

This works well if the delimiter line indicates the fixed width fields.

delim = '#     #   #   # '

#parse delimiter line
slices = (slice(*m.span()) for m in re.finditer(r'# *', delim))

line = 'Techy Inn Val NJ'

#read fields
fields = [line[s] for s in slices]

Jeff M 2010-08-13 00:17:59

Answer 7

A:

How about this?

with open('somefile','r') as source:
    line= source.next()
    sizes= map( len, line.split("#") )[1:]
    positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ] 
    for line in source:
        fields = [ line[start,end] for start,end in positions ]

Is this what you're looking for?

S.Lott 2010-08-13 01:03:03

Answer 8

+1 A:

Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.

Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)

sample="""
            HOTEL     CAT ST DEP ##
Test line   Techy Inn Val NJ FT  FT
"""
data=sample.splitlines()[1:]

def fields(header,line):
    previndex=0
    prevchar=''
    for index,char in enumerate(header):
        if char == '#' or (prevchar != char and prevchar == ' '):
            if previndex or header[0] != ' ':
                yield line[previndex:index]
            previndex=index
        prevchar = char
    yield line[previndex:]

header,dataline = data
print list(fields(header,dataline))

Output

['Techy Inn ', 'Val ', 'NJ ', 'FT  ', 'F', 'T']

One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.

Header from sample line:

'            Techy_Inn Val NJ FT  ##'

Tony Veijalainen 2010-08-13 07:26:13

ansaurus

tags:

views:

answers:

How to parse data in a variable length delimited file?

related questions