ansaurus

Question

Answer 1

+1 A:

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Gonzalo Quero 2008-12-01 09:20:58

Answer 2

+4 A:

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

Aaron Digulla 2008-12-01 10:15:25

Answer 3

+4 A:

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.

However, note that XML element names can't start with a number as your example has them.

Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:

from xml.parsers.expat import ParserCreate

open_elements = {}
result_dict = {}

def start_element(name, attrs):
    open_elements[name] = True

def end_element(name):
    del open_elements[name]

def char_data(data):
    for element in open_elements:
        cur = result_dict.setdefault(element, '')
        result_dict[element] = cur + data

if __name__ == '__main__':
    p = ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data

    p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)

    print result_dict

Daniel 2008-12-01 10:31:44

Interesting. Thanks.

JV 2008-12-01 10:48:45

Answer 4

+1 A:

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.

Take the following mini-language:

A certain number of "(" followed by the same number of ")", no matter what the number.

You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.

So you'd have to use a grammar, yes.

harms 2008-12-01 10:45:42

Regular expressions in real languages might be more powerful than regular expressions.

J.F. Sebastian 2008-12-01 15:40:32

Answer 5

+2 A:

from cStringIO   import StringIO
from collections import defaultdict
####from xml.etree   import cElementTree as etree
from lxml import etree

xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"

d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
    d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))

print(dict(d.items()))

Output:

{'on it then it started doing the same thing as the first one': [137], 
'The other system worked for about 1 month': [116], 
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

J.F. Sebastian 2008-12-01 11:40:54

SebastianDid you test it? For me it fails with the following tracebackTraceback (most recent call last): File "recurse_regex.py", line 55, in <module> d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))AttributeError: 'etree._Element' object has no attribute 'itertext'

JV 2008-12-02 05:07:25

@JV: Yes, I've run it. 'xml.etree.cElementTree has no 'itertext', but 'lxml.etree' has. 'lxml' is not a part of Python's standard library, but 'cElementTree' is. I don't know what is better for OP: to install lxml 'easy_install lxml' or to write 'itertext' for cElementTree.

J.F. Sebastian 2008-12-02 14:38:37

Answer 6

A:

Here's an unreliable inefficient recursive regexp solution:

import re

re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)

def iterparse(text, tag=None):
    if tag is not None: yield tag, text
    for m in re_tag.finditer(text):
        for tag, text in iterparse(m.group('content'), m.group('tag')):
            yield tag, text

def strip_tags(content):
    nested = lambda m: re_tag.sub(nested, m.group('content'))
    return re_tag.sub(nested, content)


txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
    d.setdefault(strip_tags(text), []).append(int(tag[:-2]))

print(d)

Output:

{'on it then it started doing the same thing as the first one': [137], 
 'The other system worked for about 1 month': [116], 
 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
 either camera now they are just sitting and collecting dust. ': [133, 135]}

J.F. Sebastian 2008-12-01 14:02:02

ansaurus

tags:

views:

answers:

Parsing a string for nested patterns

related questions