tags:

views:

88

answers:

4

Hello all, I have an xml file made like this:

<car>Ferrari</car>
<color>red</color>
<speed>300</speed>
<car>Porsche</car>
<color>black</color>
<speed>310</speed>

I need to have it in this form:

<car name="Ferrari">
    <color>red</color>
    <speed>300</speed>
</car>
<car name="Porsche">
    <color>black</color>
    <speed>310</speed>
</car>

How can I do this? I'm struggling because I can't think of a way to create the structure I need from the flat lis of tags in the original xml file.

My language of choice is Python, but any suggestion is welcome.

+1  A: 

I don't know about python, but presuming you had an XML parser that gave you hierarchial access to the nodes in an XML document, the semantics you'd want would be something like the following (warning, I tend to use PHP). Basically, store any non-"car" tags, and then when you encounter a new "car" tag treat it as a delimiting field and create the assembled XML node:

// Create an input and output handle
input_handle = parse_xml_document();
output_handle = new_xml_document();

// Assuming the <car>, <color> etc. nodes are
// the children of some, get them as an array
list_of_nodes = input_handle.get_list_child_nodes();

// These are empty variables for storing our data as we parse it
var car, color, speed = NULL

foreach(list_of_nodes as node)
{
  if(node.tag_name() == "speed")
  {
    speed = node.value();
    // etc for each type of non-delimiting field          
  }

  if(node.tag_name() == "car")
  {
    // If there's already a car specified, take its data,
    // insert it into the output xml structure and th
    if(car != NULL)
    {
      // Add a new child node to the output document
      node = output_handle.append_child_node("car");
      // Set the attribute on this new output node
      node.set_attribute("name", node.value());
      // Add the stored child attributes
      node.add_child("color", color);
      node.add_child("speed", speed);
    }

    // Replace the value of car afterwards. This allows the
    // first iteration to happen when there is no stored value
    // for "car".
    car = node.value();

  }
}
sargant
Very useful hint, my real problem was of course more complex that the one I posted, but starting from your answers I've fond a solution. Thanks.
Davide Gualano
+1  A: 

IF your real life data is as simple as your example and there are no errors in it, you can use a regular expression substitution to do it in one hit:

import re

guff = """
<car>Ferrari</car>
<color>red</color>
<speed>300</speed>
<car>Porsche</car>
<color>black</color>
<speed>310</speed>
"""

pattern = r"""
<car>([^<]+)</car>\s*
<color>([^<]+)</color>\s*
<speed>([^<]+)</speed>\s*
"""

repl = r"""<car name="\1">
    <color>\2</color>
    <speed>\3</speed>
</car>
"""

regex = re.compile(pattern, re.VERBOSE)
output = regex.sub(repl, guff)
print output

Otherwise you had better read it 3 lines at a time, do some validations, and write it out one "car" element at a time, either using string processing or ElementTree.

John Machin
WTF?! You recommend regex for handling XML? You should be ashamed. Seriously, this is a lousy approach on, on all thinkable levels.
Tomalak
Read the first line of my answer again. It was a conditional "you can", not a recommendation. In any case, calling his input "XML" is a very charitable act.
John Machin
@John: Strictly speaking, yes. But on the other hand, it is not really missing much and the existence of a `<root>` element can safely be extrapolated. As for the "you can" - no, you can't. SO as a whole is in the business of fighting off wave after wave of people who think regex and (HT|X)ML go well together, and nobody should reinforce that widespread but misguided belief by posting code samples that do it. Regex simply is a non-option when it comes to parsing non-regular languages.
Tomalak
@Tomalak: It wasn't the missing root element that I was talking about; it was the fact that his "xml" appears to be a collection of utterances of an *extremely* regular language. Your "fighting off wave after wave" of the unbelievers is a bit excessive -- it seems to me that they come as singletons, not in squadrons.
John Machin
@John: Yes, they come in singletons, but I think it might be 100 a day. If that is even enough. If you do many regex questions for a while, you'll see. ;-) Where the XML came from is entirely irrelevant - XML *as a whole* is the wrong target for regex, and this is a point that really can't be argued. (And you should read the definition of "regular language" again, since there is no regular language that could produce XML as an output.)
Tomalak
@Tomalak: Sigh. His "xml" file consists of "sentences" of the form A x B y C z D where the ABCD are constants and the xyz are variable -- very regular. In no way was I saying that a regular language could emit XML in generality; as you said, XML is not a regular language. Simply: his stuff is regular; regular expressions can handle it. In your crusade to vanquish the heathen hordes, don't throw out the baby with the bathwater :-) BTW I only look at Python questions, no squadrons there, you must hang out in a bad neighbourhood :-)
John Machin
@John: I probably do. ;-) I understand your reasoning, and your code works for this particular case. Still, it's one thing to do this in a very controlled environment where you know nothing can possibly go wrong (and when do you ever) and another to publicly recommend it. And let's be real, code samples on the Internet are typically cut-'n-pasted without further thought by many people. As one of the "just use a goddamn parser" evangelists (and as someone who is quite proficient with regex), it pains me to see code samples that seem to imply regex would be "kinda all-right, if your're careful".
Tomalak
+5  A: 

XSLT is the perfect tool for transforming one XML structure into another.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

  <!-- copy the root element and handle its <car> children -->
  <xsl:template match="/root">
    <xsl:copy>
      <xsl:apply-templates select="car" />
    <xsl:copy>
  </xsl:template>

  <!-- car elements become a container for their properties -->
  <xsl:template match="car">
    <car name="{normalize-space()}">
      <!-- ** see 1) -->
      <xsl:copy-of select="following-sibling::color[1]" />
      <xsl:copy-of select="following-sibling::speed[1]" />
    </car>
  </xsl:template>
</xsl:stylesheet>

1) For this to work, your XML has to have a <color> and a <speed> for every <car>. If that's not guaranteed, or number and kind of properties is generally variable, replace the two lines with the generic form of the copy statement:

<!-- any following-sibling element that "belongs" to the same <car> -->
<xsl:copy-of select="following-sibling::*[
  generate-id(preceding-sibling::car[1]) = generate-id(current())
]" />

Applied to your XML (I implied a document element named <root>), this would be the result

<root>
  <car name="Ferrari">
    <color>red</color>
    <speed>300</speed>
  </car>
  <car name="Porsche">
    <color>black</color>
    <speed>310</speed>
  </car>
</root>

Sample code that applies XSLT to XML in Python should be really easy to find, so I omit that here. It'll be hardly more than 4 or five lines of Python code.

Tomalak
+1: Most interesting. I had put together a solution that iterated root's children in 3s with a little error checking. Very useful to see the XSLT to get the same result "the right way". Thank you.
MattH
Thanks for the suggestion, unfortunately I'm not skilled in xslt and I am not able to adapt you solution to the real case.
Davide Gualano
@Davide: Ahh the merits of mock-up code samples. You should have posted real code then. If you are dealing with XML rather often, then you should consider leaning the basics of XSLT. It's not *that* hard, but, like regex, it's an invaluable weapon to have in one's arsenal.
Tomalak
A: 

Assuming the first element within the root is a car element, and all non-car elements "belong" to the last car:

import xml.etree.cElementTree as etree

root = etree.XML('''<root>
<car>Ferrari</car>
<color>red</color>
<speed>300</speed>
<car>Porsche</car>
<color>black</color>
<speed>310</speed>
</root>''')

new_root = etree.Element('root')

for elem in root:
    if elem.tag == 'car':
        car = etree.SubElement(new_root, 'car', name=elem.text)
    else:
        car.append(elem)

new_root would be:

<root><car name="Ferrari"><color>red</color>
<speed>300</speed>
</car><car name="Porsche"><color>black</color>
<speed>310</speed>
</car></root>

(I've assumed that the pretty whitespace was not important)

Steven