ansaurus

Question

How can I parse marked up text for further processing?

Answer 1

+1 A:

Since you're dealing with an outline situation, you can simplify things by using a stack. Basically, you want to create a stack that has dicts corresponding to the depth of the outline. When you parse a new line and the depth of the outline has increased, you push a new dict onto the stack that was referenced by the previous dict at the top of the stack. When you parse a line that has a lower depth, you pop the stack to get back to the parent. And when you encounter a line that has the same depth, you add it to the dict at the top of the stack.

ealdent 2009-07-07 04:59:28

And to get really fancy you can use the contents of the items and re.match to make sure the next item starts with it plus a dot plus number(s).

Kurt 2009-07-07 05:03:30

Answer 2

+6 A:

Edit: thanks to the clarification and change in the spec I've edited my code, still using an explicit Node class as an intermediate step for clarity -- the logic is to turn the list of lines into a list of nodes, then turn that list of nodes into a tree (by using their indent attribute appropriately), then print that tree in a readable form (this is just a "debug-help" step, to check the tree is well constructed, and can of course get commented out in the final version of the script -- which, just as of course, will take the lines from a file rather than having them hardcoded for debugging!-), finally build the desired Python structure and print it. Here's the code, and as we'll see after that the result is almost as the OP specifies with one exception -- but, the code first:

import sys

class Node(object):
  def __init__(self, title, indent):
    self.title = title
    self.indent = indent
    self.children = []
    self.notes = []
    self.parent = None
  def __repr__(self):
    return 'Node(%s, %s, %r, %s)' % (
        self.indent, self.parent, self.title, self.notes)
  def aspython(self):
    result = dict(title=self.title, children=topython(self.children))
    if self.notes:
      result['notes'] = self.notes
    return result

def print_tree(node):
  print ' ' * node.indent, node.title
  for subnode in node.children:
    print_tree(subnode)
  for note in node.notes:
    print ' ' * node.indent, 'Note:', note

def topython(nodelist):
  return [node.aspython() for node in nodelist]

def lines_to_tree(lines):
  nodes = []
  for line in lines:
    indent = len(line) - len(line.lstrip())
    marker, body = line.strip().split(None, 1)
    if marker == '*':
      nodes.append(Node(body, indent))
    elif marker == '-':
      nodes[-1].notes.append(body)
    else:
      print>>sys.stderr, "Invalid marker %r" % marker

  tree = Node('', -1)
  curr = tree
  for node in nodes:
    while node.indent <= curr.indent:
      curr = curr.parent
    node.parent = curr
    curr.children.append(node)
    curr = node

  return tree


data = """\
* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root
""".splitlines()

def main():
  tree = lines_to_tree(data)
  print_tree(tree)
  print
  alist = topython(tree.children)
  print alist

if __name__ == '__main__':
  main()

When run, this emits:

 1
  1.1
  1.2
  Note: 1.2
 2
 3
 Note: 3

[{'children': [{'children': [], 'title': '1.1'}, {'notes': ['Note for 1.2'], 'children': [], 'title': '1.2'}], 'title': '1'}, {'children': [], 'title': '2'}, {'notes': ['Note for root'], 'children': [], 'title': '3'}]

Apart from the ordering of keys (which is immaterial and not guaranteed in a dict, of course), this is almost as requested -- except that here all notes appear as dict entries with a key of notes and a value that's a list of strings (but the notes entry is omitted if the list would be empty, roughly as done in the example in the question).

In the current version of the question, how to represent the notes is slightly unclear; one note appears as a stand-alone string, others as entries whose value is a string (instead of a list of strings as I'm using). It's not clear what's supposed to imply that the note must appear as a stand-alone string in one case and as a dict entry in all others, so this scheme I'm using is more regular; and if a note (if any) is a single string rather than a list, would that mean it's an error if more than one note appears for a node? In the latter regard, this scheme I'm using is more general (lets a node have any number of notes from 0 up, instead of just 0 or 1 as apparently implied in the question).

Having written so much code (the pre-edit answer was about as long and helped clarify and change the specs) to provide (I hope) 99% of the desired solution, I hope this satisfies the original poster, since the last few tweaks to code and/or specs to make them match each other should be easy for him to do!

Alex Martelli 2009-07-07 05:16:12

I've updated my post to try and clarify things. Now I show that the * or - matter and I fixed up the first output (the {'1.2.3'} should have been just a string and not a dict as I had.)

Rigsby 2009-07-07 07:09:12

Answer 3

+1 A:

Stacks are a really useful datastructure when parsing trees. You just keep the path from the last added node up to the root on the stack at all times so you can find the correct parent by the length of the indent. Something like this should work for parsing your last example:

import re
line_tokens = re.compile('( *)(\\*|-) (.*)')

def parse_tree(data):
    stack = [{'title': 'Root node', 'children': []}]
    for line in data.split("\n"):
        indent, symbol, content = line_tokens.match(line).groups()        
        while len(indent) + 1 < len(stack):
            stack.pop() # Remove everything up to current parent
        if symbol == '-':
            stack[-1].setdefault('notes', []).append(content)
        elif symbol == '*':
            node = {'title': content, 'children': []}
            stack[-1]['children'].append(node)
            stack.append(node) # Add as the current deepest node
    return stack[0]

Ants Aasma 2009-07-07 09:47:20

Answer 4

A:

The syntax you`re using is very similar to Yaml. It has some differences, but it’s quite easy to learn — it’s main focus is to be human readable (and writable).

Take look at Yaml website. There are some python bindings, documentation and other stuff there.

http://www.yaml.org

Maciej Łebkowski 2009-07-21 13:45:08

ansaurus

tags:

views:

answers:

How can I parse marked up text for further processing?

related questions