views:

215

answers:

3

I need to process XML documents of varying formats into records in a MySQL database on a daily basis. The data I need from each XML document is interspersed with a good deal of data I don't need, and each document's node names are different. For example:

source #1:

<object id="1">
    <title>URL 1</title>
    <url>http://www.one.com&lt;/url&gt;
    <frequency interval="60" />
    <uselessdata>blah</uselessdata>
</object>
<object id="2">
    <title>URL 2</title>
    <url>http://www.two.com&lt;/url&gt;
    <frequency interval="60" />
    <uselessdata>blah</uselessdata>
</object>

source #2:

<object">
    <objectid>1</objectid>
    <thetitle>URL 1</thetitle>
    <link>http://www.one.com&lt;/link&gt;
    <frequency interval="60" />
   <moreuselessdata>blah</moreuselessdata>
</object>
<object">
    <objectid>2</objectid>
    <thetitle>URL 2</thetitle>
    <link>http://www.two.com&lt;/link&gt;
    <frequency interval="60" />
    <moreuselessdata>blah</moreuselessdata>
</object>

...where I need the object's ID, interval, and URL.

My ideas for approaches are:

1.) Having a separate function to parse each XML document and iteratively create the SQL query from within that function

2.) Having a separate function parse each document and iteratively add each object to my own object class, and have the SQL work done by a class method

3.) Using XSLT to convert all the documents into a common XML format and then writing a parser for that document.

The XML documents themselves aren't all that large, as most will be under 1MB. I don't anticipate their structure changing often (if ever), but there is a strong possibility I will need to add and remove further sources as time goes on. I'm open to all ideas.

Also, sorry if the XML samples above are mangled... they're not terribly important, just a rough idea to show that the node names in each document are different.

+2  A: 

Using XSLT is an overkill. I like approach (2), it makes a lot of sense.

Using Python I'd try to make a class for every document type. The class would inherit from dict and on its __init__ parse the given document and populate itself with the 'id', 'interval' and 'url'.

Then the code in main would be really trivial, just instantiate instances of those classes (which are also dicts) with the appropriate documents and then pass them off as normal dicts.

Amr Mostafa
+1: In your case, the different XML docs are really just different tag names. These, in turn, just alter the XPath string you'd use in the ElementTree find and findall functions. The different XML parsing options are really easy to do.
S.Lott
A: 

I've been successfully using variant the third approach. But documents I've been processing were a lot bigger. If it's a overkill or not, well that really depends how fluent you are with XSLT.

vartec
A: 

If your various input formats are unambiguous, you can do this:

<xsl:template match="object">
  <object>
    <id><xsl:value-of select="@id | objectid" /></id>
    <title><xsl:value-of select="title | thetitle" /></title>
    <url><xsl:value-of select="url | link" /></url>
    <interval><xsl:value-of select="frequency/@interval" /></interval>
  </object>
</xsl:template>

For your sample input, this produces:

<object>
  <id>1</id>
  <title>URL 1</title>
  <url>http://www.one.com&lt;/url&gt;
  <interval>60</interval>
</object>
<object>
  <id>2</id>
  <title>URL 2</title>
  <url>http://www.two.com&lt;/url&gt;
  <interval>60</interval>
</object>
<object>
  <id>1</id>
  <title>URL 1</title>
  <url>http://www.one.com&lt;/url&gt;
  <interval>60</interval>
</object>
<object>
  <id>2</id>
  <title>URL 2</title>
  <url>http://www.two.com&lt;/url&gt;
  <interval>60</interval>
</object>

However, there may be faster methods to achieve a usable result than using XSLT. Just measure how fast each approach is, and how "ugly" if feels for you. I would tend to say that XSLT is the more elegant/maintainable solution to process XML. YMMV.

If your input formats are ambiguous and the above solution produces wrong results, a more explicit aproach is needed, along the lines of:

<xsl:template match="object">
  <object>
    <xsl:choose>
      <xsl:when test="@id and title and url and frequency/@interval">
        <xsl:apply-templates select="." mode="format1" />
      </xsl:when>
      <xsl:when test="objectid and thetitle and link and frequency/@interval">
        <xsl:apply-templates select="." mode="format2" />
      </xsl:when>
    </xsl:choose>
  </object>
</xsl:template>

<xsl:template match="object" mode="format1">
  <id><xsl:value-of select="@id" /></id>
  <title><xsl:value-of select="title" /></title>
  <url><xsl:value-of select="url" /></url>
  <interval><xsl:value-of select="frequency/@interval" /></interval>
</xsl:template>

<xsl:template match="object" mode="format2">
  <id><xsl:value-of select="objectid" /></id>
  <title><xsl:value-of select="thetitle" /></title>
  <url><xsl:value-of select="link" /></url>
  <interval><xsl:value-of select="frequency/@interval" /></interval>
</xsl:template>
Tomalak