views:

112

answers:

3

So I just started diggin SSIS today, so don't hate too much if there is something obvious I am missing.

So I have an XML file (from a third party)

<root>
    <foo>
        <fooId>12345</fooId>
        <name>FOO</name>
        <bars>
            <bar>BAR 1</bar>
            <bar>BAR 2</bar>
            [...]
        </bars>
    </foo>
    [...]
</root>

and corresponding tables in my DB:
Foo with fields (FooID, Name)
Bar with fields (BarID (identity PK), FooID, Name)

So basically Bar is like a set of attributes for Foo.

So I add an XML source that points to that file and it produces 3 different datasets (foo, bars, bar). Problem is that bar set contains bar's value + some autogenerated ID, which is not very useful. The only way I see from here to get a bar set with bar value and fooId is by sorting and merging-joining those sets, which seems rather odd and probably gonna brutally murder performance (we talking hundreds of K's of foo here).

Question is: how to do this properly?

+1  A: 

I have not had a chance to use any XML data sources yet in SSIS. BizTalk is our tool of choice here. Regardless, I did a little research and found a very helpful article here:

http://blogs.msdn.com/b/mattm/archive/2007/12/11/using-xml-source.aspx

Follow the section on dealing with multiple outputs except do the following:

  1. Replace all references to their element with your element
  2. Replace all references to their element with your element

So, based on this, set up your XML data source per the article. Modify it with the advanced property editor using the potins mentioned above. Take the two outputs for bar and bars and route them into the merge join. Inner join them on bars_Id. Select bar and foo_Id as your output columns. This will be able to feed your Bars table.

I know this isn't ideal as you are sorting and merge joining. Hopefully by doing the sorting in the XML data source it will not have to great of a performance impact.

One other soltuion to consider is using an XSLT file to flatten the XML. This is done with an XML task in the control flow. Here is an article that might be helpful as well:

http://blogs.msdn.com/b/mattm/archive/2007/12/15/xml-source-making-things-easier-with-xslt.aspx

Good Luck!

ChrisLoris
thats pretty much what I am doing right now... What I was expecting to find is ability to just add a column to the sub-element's set from its parent element, since it is already given by the structure, but there appears to be no easy way of doing it (or maybe I am blind?)
liho1eye
So, I'd try going with the XSLT approach. It would flatten the XML schema to a point where you could have only a single output from the XML data source. Kind of ironic that we're using XSLT to make XML work like a flat file.
ChrisLoris
+1  A: 

I wouldn't worry about optimising performance yet. Just add another SSIS step to transform the datasets.

When you have the whole thing working review performance. SSIS transformations are easier to maintain than XSLT. Hundreds of K's of foo shouldn't be an issue, depending on how often you run the module. I haven't used SSIS for ETLfor a while, so I'm not quite up yo speed on that, but I am using XSLT, and an extra SSIS step is easier to maintain if you keep it simple.

Just my opinion.

MikeAinOz
+1  A: 

@ your comment to Chris: There's an easy way to add a column on an object. Add a step inside of your data-flow task, use the "Derived Column" transformation step. Inside of there, add/manipulate the columns you need.

XSLT is a pain.

Shlomo