views:

1544

answers:

10

I have 5 large XML files which I am keen to analyse. All of them are too large to open in a text editor and so I do not know their XML schemas.

I have tried to import them into SQL server, however the process has given me an error even though I am pretty sure they are valid, as they were sourced from very reputable programmers.

I have also tried other methods but each struggles with the large file sizes (MySQL) or state that the files contain invalid XML characters (Access & Excel).

How can I read and insert the data programmatically? Can this be done via SQL query?

Thanks a lot!

+1  A: 

You kind of have to know the schema. Try downloading TextPad or something similar to view the files.

Once you know the schema you can do a couple of things to get them into SQL. One approach would be to use OpenXML http://msdn.microsoft.com/en-us/library/ms186918.aspx.

brendan
I am using Notepad++ already. That is usually pretty solid. However these files vary between 19mb and 850mb. Unfortunately, the one I really want to see is the big one.
Jon Winstanley
The only editor I know of that can handle files larger than your RAM size is UltraEdit (http://www.ultraedit.com/)
marc_s
I have plenty of RAM, 1.5gb (1.05 available). But will take a look at UltraEdit anyway. Thanks!
Jon Winstanley
+3  A: 

Try the free LogParser utility from Microsoft: http://www.microsoft.com/DownLoads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en

It's designed to give you SQL-like access to large text files including XML. Something like

Select top 1000 * from myFile.xml

...should work to get you started. Also, beware that the documentation will appear in your start menu along side the executable after installation--I don't think there's a good copy on line.

steamer25
A: 

For viewing very large files, I've found the V file viewer to be excellent.

I've used it on files as large as 8GB. For files which are fixed record length, it is extremely easy to navigate based on block size, because it is disk-based.

Note that there is no editing capability.

Having said that, one difficulty with XML is that it's not really a good format for large "streams", since it has an overall beginning and end structure, and a parser which cannot hold the entire file in memory may have to do some pretty fancy tricks to ensure that it complies with a DTD or schema.

Cade Roux
+1  A: 

I've tested the mssql xml parser extensively, the bcp.exe utility works great for this. The trick is coming up with the right row terminator since it has to be a value that cannot occur in your document. For instance you can do this:

create table t1(x xml)

Ceate a simple text file that contains only your chosen delimiter. For example place this string in delim.txt:

-++++++++-

Then concatenate that to the end of your document instance, from the command line:

copy myFile.xml + delim.txt out.xml /b

After this you can BCP it into the database like :

bcp.exe test.dbo.t1 in out.xml -T -c -r -++++++++-

If the document is UTF-16 then replace the -c switch with -w

A: 

Have you tried using OPENROWSET to import your big XML files into a SQL Server table?

CREATE TABLE XmlTable
(
    ID INT IDENTITY,
    XmlData XML
)

INSERT XmlTable(XmlData)
  SELECT * FROM 
    OPENROWSET(BULK '(your path)\xmldata.xml',
    SINGLE_BLOB
) AS X

Since I don't have any 5GB files at hand, I can't really test it myself.

There's another way you might tackle this : streaming Linq-To-Xml. Check out this blog post where James Newton-King shows how to read XElement one-by-one, and a two-part series here and here on the same topic by the Microsoft XML team blog.

Marc

marc_s
A: 

Have you tried SQL Server XML Bulk Load?

Darrel Miller
A: 

You should load your XML into an XML database, e.g. Berkeley DB XML or Xindice

Also, I'm not sure if it can scale to 850mb, but First Object XML Editor, and the parser library on which it's built, can handle quite large files.

Also, Baretail should display your files without breaking a sweat.

ykaganovich
+1  A: 

The first thing I did was to get the first X bytes (e.g. the first 1 MB) of the XML files so I could take a look at them with the editor of my choice.

If you have Cygwin installed you already own a nice GNU utility to achive this: head

head.exe -c1M comments.xml > comments_small.xml

Alternatively you can find a native port of the most GNU utilities here: http://unxutils.sourceforge.net/

VVS
Good plan. Although some uncommon tags might be missed off in the first few records.
Jon Winstanley
+2  A: 

See this blog post by unofficial StackOverflow team member Brent Ozar:
http://www.brentozar.com/archive/2009/06/how-to-import-the-stackoverflow-xml-into-sql-server/

Joel Coehoorn