tags:

views:

105

answers:

8

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.

Sample text file :

#     #   #   #
Techy Inn Val NJ

Found the position of # using this code :

  1 f = open('sample.txt', 'r')
  2 i = 0
  3 positions = []
  4 for line in f:
  5     if line.find('#') > 0:
  6         print line
  7         for each in line:
  8             i += 1
  9             if each == '#':
 10                 positions.append(i)

1 7 11 15 => Positions

So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:

+2  A: 

Here's a way to read fixed width fields using regexp

>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>> 
gnibbler
Super elegant! Could you please troubleshoot this : var1,var2 = re.match("(.{%d}) (.{%d})",line2).groups() % (positions[1],positions[2])AttributeError: 'NoneType' object has no attribute 'groups'
ThinkCode
The regexp didn't match the line.Also, I'm curious why you consider regexps to be more elegant than slicing.
Aaron Gallagher
You need to move the parameters next to the string like this `re.match("(.{%d}) (.{%d})"% (positions[1],positions[2]),line2).groups() \`
gnibbler
A: 
def parse(your_file):
    first_line = your_file.next().rstrip()
    slices = []
    start = None
    for e, c in enumerate(first_line):
        if c != '#':
            continue

        if start is None:
            start = e
            continue
        slices.append(slice(start, e))
        start = e
    if start is not None:
        slices.append(slice(start, None))

    for line in your_file:
        parsed = [line[s] for s in slices]
        yield parsed
Aaron Gallagher
+1  A: 

Off the top of my head:

f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
    fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]

Update with tested, fixed code:

import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
    posns[-1] = len(line)
    fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
    print fields

Input file:

#      #  #
Foo    BarBaz
123456789abcd

Debug output:

'#      #  #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']

Robustification notes:

  1. This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
  2. The OP needs to consider whether it's an error if the first character of the header is not #.
  3. Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).

Final(?) update: Leapfrooging @gnibbler's suggestion to use slice(): set up the slices once before looping.

import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
    fields = [line[sl].rstrip() for sl in slices]
    print fields
John Machin
how about `fields = [line[slice(*x)] for x in zip(posns, posns[1:])]`
gnibbler
This answer is spot on sir! Works beautifully. Another challenge, what if the delimiters are 3 letters and not all are same, for instance : <A> <B> <C> <B> <A> I know it is asking for too much! Intelligently parsing the delimiters would be awesome. Thanks again...
ThinkCode
WoW! I tweaked my program a little, used my code for finding the delimiters and passed the positions to your code and boom, it works! Thank you so much!
ThinkCode
@gnibbler: Thanks for the suggestion to use slices.
John Machin
@AnonymousDriveByDownVoter: Care to give reasons so that I/we can benefit from your wisdom?
John Machin
@ThinkCode: With all due respect, your code for finding the delimiters should be abandoned. New/fat delimiters: Presuming the first character of the "delimiter" corresponds to the first character of the field, you could (a) simply use "<" instead of "#" in my solution (b) if you want some more validation, use the re.finditer approach that you saw in another solution to derive the `posns` list (pattern e.g. "<[A-Z]>" -- complicate as necessary/desirable to match what you have in the header). By the way, who's "designing" these headers?
John Machin
Totally lost track of this q! It is weirdly formatted data that we get from clients sometimes! Not sure why they do this to us!!
ThinkCode
A: 
f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
   print [line[i:j].strip() for i, j in pos]
f.close()
tux21b
+1  A: 

Adapted from John Machin's answer

>>> header = "#     #   #   #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']

You can also write the last line like this

>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]

For the other example you give in the comments, you just need to have the correct header

>>> header = "#       #     #     #"
>>> row    = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>> 
gnibbler
it is failing for row = "Techiyi Iniin Viial NiiJ" . I really appreciate your answers. Trying to decipher the code, neat though!
ThinkCode
@ThinkCode, did you update the value of header to match the row?
gnibbler
A: 

This works well if the delimiter line indicates the fixed width fields.

delim = '#     #   #   # '

#parse delimiter line
slices = (slice(*m.span()) for m in re.finditer(r'# *', delim))

line = 'Techy Inn Val NJ'

#read fields
fields = [line[s] for s in slices]
Jeff M
A: 

How about this?

with open('somefile','r') as source:
    line= source.next()
    sizes= map( len, line.split("#") )[1:]
    positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ] 
    for line in source:
        fields = [ line[start,end] for start,end in positions ]

Is this what you're looking for?

S.Lott
+1  A: 

Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.

Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)

sample="""
            HOTEL     CAT ST DEP ##
Test line   Techy Inn Val NJ FT  FT
"""
data=sample.splitlines()[1:]

def fields(header,line):
    previndex=0
    prevchar=''
    for index,char in enumerate(header):
        if char == '#' or (prevchar != char and prevchar == ' '):
            if previndex or header[0] != ' ':
                yield line[previndex:index]
            previndex=index
        prevchar = char
    yield line[previndex:]

header,dataline = data
print list(fields(header,dataline))

Output

['Techy Inn ', 'Val ', 'NJ ', 'FT  ', 'F', 'T']

One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.

Header from sample line:

'            Techy_Inn Val NJ FT  ##'
Tony Veijalainen