views:

134

answers:

5

I need to parse text lists:

1 List name

1 item
2 item
3 item

2 List name

1 item
2 item
3 item

3 List name

1 item
2 item
3 item

I was trying to use regular expression to split first level list:

import re
def re_show(pat, s):
 print re.compile(pat, re.S).sub("{\g<0>}", s),'\n'

s = '''
1 List name

1 item
2 item
3 item

2 List name

1 item
2 item
3 item

3 List name

1 item
2 item
3 item
'''

re_show('\n\d+.*?(?=\n\n\d+.*?\n\n)', s)

But it doesn't work. Instead of this:

{
1 List name

1 item
2 item
3 item}
{
2 List name

1 item
2 item
3 item}
{
3 List name

1 item
2 item
3 item}

I've got this:

{
1 List name}
{
1 item
2 item
3 item}
{
2 List name}
{
1 item
2 item
3 item}

3 List name

1 item
2 item
3 item

What would you recommend to solve this task?

Thanks for your answers. I've learned many new features of Python.

I think, I will use "state machine" approach as described here

A: 

I suspect I'm missing the point, but isn't this simply a question of looking for List?

wallyk
No, "List name" could be any string
Vanuan
+1  A: 

here's one way using dictionary

f=open("myfile")
d={}
e=0
for line in f:
    line=line.rstrip()
    if "List" in line:
      e=e+1
      d.setdefault(e,[])
    d[e].append(line)
f.close()
for i ,j in d.iteritems():
    print i,j
Thanks! I'm newbie in Python. I didn't know about dictionaries. Very similar to associative arrays.
Vanuan
+1  A: 
class ListParser:

 def __init__(self, s):
  self.str = s.split("\n")
  print self.str
  self.answer = []

 def parse(self):
  self.nextLine()
  self.topList()
  return

 def topList(self):
  while(len(self.str) > 0):
   self.topListItem()

 def topListItem(self):
  l = self.nextLine()
  print "TOP: " + l
  l = self.nextLine()
  if l != '':
   raise Exception("expected blank line but found '%s'" % l)
  sub = self.sublist()

 def nextLine(self):
  return self.str.pop(0)

 def sublist(self):
  while True:
   l = self.nextLine()
   if l == '':
    return # end of sublist marked by blank line
   else:
    print "SUB: " + l

parser = ListParser(s)
parser.parse() 
print "done"

prints

TOP: 1 List name
SUB: 1 item
SUB: 2 item
SUB: 3 item
TOP: 2 List name
SUB: 1 item
SUB: 2 item
SUB: 3 item
TOP: 3 List name
SUB: 1 item
SUB: 2 item
SUB: 3 item
done
Steve Cooper
Your solution put an idea to my head to use "State Machine" approach. Thank you.
Vanuan
A: 

Not sure how fast it is, but this works as long as the spacing doesn't change.

>>>items = text.strip('\n').split('\n\n')
>>>dict((x, y.splitlines()) for x, y in zip(items[::2],items[1::2]))
{'3 List three': ['1 item', '2 item', '3 item'], '2 List two': ['1 item', '2 item', '3 item'], '1 List one': ['1 item', '2 item', '3 item']}
chris
Very straightforward approach :) But no, lists have varying number of items.
Vanuan
It doesn't matter how many items are in the list, just that the spacing (newlines) between elements are the same. You could replace text.strip('\n').split('\n\n') with filter(None, text.split('\n\n')) and that might solve that problem (haven't tested).
chris
+2  A: 

Do you have control over the list format? With just a little editing, you could turn that into config file format, and use the ConfigParser module.

Otherwise, how about with a little recursion?

from collections import defaultdict

def fill_data(data, key, sequence, pred):
    """Recursively fill the data dictionary"""
    for item in sequence:
        # if the pred is true, add it to the list
        if pred(item):
            data[key].append(item)
        # otherwise recurse, with item as key
        else:
            return fill_data(data, item, sequence, pred)
    return data

# a key->list dictionary
data = defaultdict(list)
# Get the text as a sequence of non-empty lines
lines = (l for l in s.splitlines() if l.strip())

def is_data_line(line):
    """Is this line a data line (i.e. two items)?"""
    return len(line.split()) == 2

result = fill_data(data, None, lines, is_data_line )

print dict(result)

Output (prettified):

{'2 List name': 
    ['1 item', '2 item', '3 item'], 
 '3 List name': 
    ['1 item', '2 item', '3 item'], 
 '1 List name': 
    ['1 item', '2 item', '3 item']}
Ryan Ginstrom
No, I haven't. I need to transform this format to another (lists are very long, this is just little example)Your code is working well but I didn't understand it yet (I'm newbie:)) Thank you!
Vanuan
You're right, the code is a bit cryptic. Thanks for pointing it out. I've added some comments that I hope will make what it does clearer.
Ryan Ginstrom