tags:

views:

67

answers:

4

Suppose we have a table:

Key|Val|Flag
01 |AAA| Y
02 |BBB| N
...

wrapped into xml this way:

<Data>
  <R><F>Key</F><F>Val</F><F>Flag</F></R>
  <R><F>01</F><F>AAA</F><F>Y</F></R>
  <R><F>02</F><F>BBB</F><F>N</F></R>
  ...
</Data>

There can be more columns and rows, obviously.

Now I'd like to parse XML back to table using single regex.

I can find all fields with '<F>([\w\d]*)</F>', but I need them to be groupped by rows somehow.

I thought about <R>(<F>([\w\d]*)</F>)*</R>, but Python implementation finds nothing.

Can someone please help to compose regex?

UPDATE Some context of the question.

I'm aware about plenty of XML parsing libraries, but unfortunately my environment is limited to standard libraries. Anyway thanks to everyone who have warned not to use regexes for XML parsing.

And I needed some quick and dirty solution, therefore I decided to start with regexes and switch to parsing later.

So far I have the code:

...
row_p = r'<R>(.*?)</R>'
field_p = r'<F>(.*?)</F>'
table = ''

for row in re.finditer(row_p, xml):
    table += '|'.join(re.findall(field_p, row.group(1))) + '\n'

...

It works for small datasets (about 10'000 rows) but fails for tables larger 500'000 rows.

Maybe I'll do some investigation why it fails, but next step I'm going to take - switch to some standard XML parser. ElementTree is the first candidate.

+2  A: 

Mandatory links:

Use an XML parser. lxml is very good and even provides (among other XML-related thingies) XPath - if you got a fetish with oneliners, I'm sure there is an XPath oneliner to extract these elements ;)

delnan
Thanks for the links. It's good to have weighty arguments to stay away from the dark side at the fingertips.
z4y4ts
A: 

If this question is tagged with Perl, I can post a solution + code for you, but since this is python.

Anyway, I suggest you load the xml file, and read it line by line. Loop each line until the end of the file and find all fields within that line. As far as I know matches in python are stored in an array. There you have it. Wish I can show you with code but this is just the main idea:

load file
foreach line in <file>
    if regex.match('<F>([\w\d]*)</F>', line)
        print matches[1] . '|' . matches[2] . '|' . matches[3] . "\n"
end loop

DISCLAIMER: The above code is just a scratch

Oh by the way, if possible, use an XML parser instead.

Ruel
A: 
import libxml2

txt = '\n<Data>\n  <R><F>Key</F><F>Val</F><F>Flag</F></R>\n  <R><F>01</F><F>AAA</F><F>Y</F></R>\n  <R><F>02</F><F>BBB</F><F>N</F></R>\n</Data>\n'

rows = []
for elem in libxml2.parseDoc(txt):
    if elem.name == 'R':
        curRow = []
        rows.append(curRow)
    elif elem.name == 'F':
        curRow.append(elem.get_content())

returns:

rows = [['Key', 'Val', 'Flag'], ['01', 'AAA', 'Y'], ['02', 'BBB', 'N']] 
eumiro
A: 

lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

ecounysis
lxml is great, but unfortunately my environment is limited to standard library only, anyway thanks.
z4y4ts