views:

170

answers:

10

How to go through blocks of lines separated by an empty line? The file looks like the following:

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

I want to loop through the blocks and grab the fields Name, Family name and Age in a list of 3 columns:

Y X 20
F H 23
Y S 13
Z M 25
A: 

Use a dict, namedtuple, or custom class to store each attribute as you come across it, then append the object to a list when you reach a blank line or EOF.

Ignacio Vazquez-Abrams
+4  A: 

Use a generator.

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator
S.Lott
just in order to spice it up a tad: say `continue` after re-initializing accumulator and take out the `else`: same control flow, but one less indentation. it is a matter of taste. also, the 'dangling yield' should be conditional: `if accumulator: yield accumulator`; this avoids spurious empty lists to be yielded.
flow
+4  A: 
import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*\s*           # Match ID: and anything else on that line 
    Name:\s*(.*)\s*     # Match name, capture all characters on this line
    FamilyN:\s*(.*)\s*  # etc. for family name
    Age:\s*(.*)$        # and age""", 
    subject)

Result will then be

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

which can be trivially changed into whatever string representation you want.

Tim Pietzcker
Every time I try re.findall() in a code, it gives me this error message: File "/usr/lib/python2.6/re.py", line 177, in findall return _compile(pattern, flags).findall(string)TypeError: expected string or buffer. What is the reason?
Adia
Well, the error message says that you are not passing a string to it. So what *are* you passing to it?
Tim Pietzcker
Thanks Tim, that one is solved now.
Adia
+1  A: 

If file is not huge you can read whole file with:

content = f.open(filename).read()

then you can split content to blocks using:

blocks = content.split('\n\n')

Now you can create function to parse block of text. I would use split('\n') to get lines from block and split(':') to get key and value, eventually with str.strip() or some help of regular expressions.

Without checking if block has required data code can look like:

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
    person = {}
    for l in block.split('\n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))
Michał Niklas
A: 

simple solution:

result = []
for record in content.split('\n\n'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))
Andrey Gubarev
+1  A: 

If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. And the find_iter() method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall() does).

I do think this solution is overkill for this use case however (still: it's a nice trick to know...)

Steven
+2  A: 

import itertools

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output
Anoop
personally i tend to prefer readable solutions...
flow
Can convert comprehensions to 'for loop' to make it more readable.
Anoop
+2  A: 

Here's another way, using itertools.groupby. The function groupy iterates through lines of the file and calls isa_group_separator(line) for each line. isa_group_separator returns either True or False (called the key), and itertools.groupby then groups all the consecutive lines that yielded the same True or False result.

This is a very convenient way to collect lines into groups.

import itertools

def isa_group_separator(line):
    return line=='\n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:
            data={}
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25
unutbu
A: 

Along with the half-dozen other solutions I already see here, I'm a bit surprised that no one has been so simple-minded (that is, generator-, regex-, map-, and read-free) as to propose, for example,

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

Re-format to taste.

Cameron Laird
Hi, Cameron. This is the Oneliner Saloon; park your surprise with the barkeep upon entry. You may also notice that few if any answers include any checking that the file being read appears even remotely similar to the enquirer's example.
John Machin
You aren't the John Machin who computed pi to 100 places at the beginning of the eighteenth century, are you? Thanks for the welcome. I get your point; 'least, I think I do ... In the comment-constrained absence of paragraph divisions, I'll summarize this way: "simple" depends on where one stands, and which way one is facing.
Cameron Laird
A: 

This answer isn't necessarily better than what's already been posted, but as an illustration of how I approach problems like this it might be useful, especially if you're not used to working with Python's interactive interpreter.

I've started out knowing two things about this problem. First, I'm going to use itertools.groupby to group the input into lists of data lines, one list for each individual data record. Second, I want to represent those records as dictionaries so that I can easily format the output.

One other thing that this shows is how using generators makes breaking a problem like this down into small parts easy.

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25
Robert Rossney