views:

581

answers:

7

I have at least 100 xml files each about 300 MB with email messages basically in the format listed below.

Now my question is, how do I get this data into say SQL Sever database so that I can perform query on this data. My queries would be along the lines of: Has a certain person sent an email to another certain person on a given period with certain keywords on subject/body etc.

Here is what I have tried:

1) Loading each XML file into XML data type field into SQL Server. With this approach I could not come up with the Xpath(?) queries to do what I need. Is it even possible to do this in Xpath?

2) Loading each file into .NET DataSet using ReadXML and ReadSchema. This seems to Load fine and it seems to create the right number of DataTable with the foreign keys etc but this would mean I will have to create 100 sets of table on the database. Somehow join all into one single table then perform the query.

Let me know if you guys have any other suggestions.

Thanks.

<Message>
<MsgID>4651286700000CAA00EF00010000</MsgID>
<MsgTime>2007-05-21-01.04.39.000000</MsgTime>
<MsgTimeUTC>1179723879</MsgTimeUTC>
<MsgLang>CODE 1252</MsgLang>
<Sender>
 <UserInfo>
  <FirstName>X</FirstName>
  <LastName>Y</LastName>
  <AccountName>121212</AccountName>
  <CorporateEmailAddress>[email protected]</CorporateEmailAddress>
 </UserInfo>
</Sender>
<Recipient DeliveryType = " ">
 <UserInfo>
  <FirstName>A</FirstName>
  <LastName>B</LastName>
  <FirmNumber>7593</FirmNumber>
  <AccountName>STRATEGIC AS</AccountName>
  <AccountNumber>604806</AccountNumber>
  <CorporateEmailAddress>[email protected]</CorporateEmailAddress>
 </UserInfo>
</Recipient>
<Subject>
 Please review the following
</Subject>
<Attachment>
 <FileName>37715772.htm</FileName>
 <FileID>503242486522279_37715772.htm</FileID>
 <FileSize>31175</FileSize>
</Attachment>
<MsgBody>
 This is the message Body
</MsgBody>

+3  A: 

Use the Xml Bulk Load component

http://support.microsoft.com/kb/316005

Chris Ballance
don't know how I missed this. Thanks everyone who replied.
coderguy123
+1  A: 

You could try using the XmlReader.ReadSubtree() to read one message at the time. Loading the whole file into memory is not a really good idea.

Passing the data of one message to the SQL server is then easy, you can also pass the XML and parse it on the SQL server (using OPENXML) in a stored procedure.

Lucero
A: 

If you are looking to do queries, why not just use XML LINQ to do this, instead. It would probably be simpler.

James Black
A: 

I'd use the ReadXML method you mentioned, but do it on one XML file at a time, and then use the DataSet/DataTable Merge method. As long as the schemas are the same it should just keep adding to the same tables (more rows instead of duplicate tables), and make sure you dispose of the XMLDocument or XMLTextReader that you had for that file each time, so you'll get the memory back.

When you're all done, use a SqlDataAdapter and SqlCommandBuilder to auto-generate your INSERT statments and flush it to your SQL tables. This is easier if you use Visual Studio's Data Explorer to generate a strongly typed dataset.

You may even want to do this intermittently and clear your dataset so you keep the memory down.

LoveMeSomeCode
A: 

Use SQL Server Integration Services. There is an XML Source Adapter designed for this task, with a handy designer where you can select the nodes or attributes in your XML that will turn into columns in your database.

See this blog post from the SSIS team.

santiiiii
A: 

If you need to do this once, I would create an xslt to transform the xml into SQL insert statements and execute the resulting sql script to populate the table(s). This won't give memory issues due to the large xml files.

Janco
A: 

It sounds like databasing is not a requirement, just a way of being able to query. If that is the case, try XMLMax, which should load your 300MB xml in about 20 seconds. It has an indexed xpath query, and indexing should take about 15 minutes. Other options would be a native xml database. eXistDB is opensource and I had one user tell me he has used it with gigabyte size xml and is very happy with it. But it does take a lot of time to load, index and get it set up for a given xml.

Bill Conniff