views:

281

answers:

5

Other than CouchDB and Native XML Databases (such as eXist), which databases can store and query un- and semi-structured data?

I've been handed lots of data in different formats (XML, JSON, YAML, CSV, some custom formats), which I need to merge and work with, but developing a schema in a relational database will be time-consuming and difficult.

Mainly, I'm looking at open-source solutions. Also, at some point I'd like to make this data available some-how.

I understand there will be some overhead to transforming all this data to a common format to work with. There are some good choices out there on the XML front: eXist-db and BerkeleyDB-XML seem to have good XQuery support.

Also, while still quite young, CouchDB looks promising with its simple "document" format.

Other avenues I've looked at are OODBMSs such as ZODB and db4o; write some simple parsing scripts and then just store the resulting record object "as-is". The problem there is querying them afterwards; OODBMSs don't seem to have a good history of query engines.

What I'm not keen on are "blobs-in-relational-db" solutions. Seems like a hack and doesn't really allow for evolution of the data, FT indexing, etc.

Are there any other solutions I've not come across?

A: 

Microsoft SQL Server has an XML column type starting in 2005.

jezell
A: 

If you're really storing unstructured data -- or you plan not to index based on the document's structure -- then I think a full-text search engine like Lucene will be valuable to you. developerWorks has an older but informative article on handling XML with Lucene.

Steven Huwig
The data is semi-structured, but the structures generally don't match any other document/file.
digitala
+1  A: 

Most modern RDBMS support an xml datatype, think an xml document is a value in a table field, with XPath/XQuery to retrieve data from the value. Similiarly you can use a CLOB datatype to represent a large block of characters (i.e. an unstructured document); in which case Oracle, SQL Server, and others have extensions to perform text searches into those fields.

The cool thing is that these semi/un-structured search facilities are implemented as operators accessible from sql so you can blend the results of those searches with you structed queries; maintianing a consistent relational representing to client apps. On a couple projects we've stored native xml data inside an Oracle DB, and used standard SQL Views to project the data as structured.

A: 

IBM DB2 version 9 has PureXML support.

Jonathan Leffler
+1  A: 

If you want something very generic you could use a RDF store. With RDF, you build a graph, which turns out to be a good abstraction for semi-structured data. You simply give the graph the structure inherent to your data. In this context the paper Querying RDF Data from a Graph Database Perspective (or the slides) could be an interesting read.

An easier way is to work more directly with the graph abstraction by using a graph database like neo4j. It's an open source project (and I'm part of it). A graph database doesn't tie your data to a specific application lika OODBs tend to do. And it also doesn't force you to use static schemas that you have to fit your data into.

nawroth