views:

173

answers:

4

I have to implement a search feature which is able to quickly perform arbitrary complex queries to XML-data. If the user makes a query, all XML files must be searched to find possible matches. The users will have lots of XML-Files (a few 10000 or more) which are typically a few kilobytes in size. All the XML-files have almost the same structure.

I already benchmarked XPath, it is too slow for my needs.

How can it be done most efficiently? Is is possible to create indexes for the contents of the XML files (preserving content semantics, not just plain fulltext search)?

Will it be useful to put the XML data into an (embedded) SQL database and do the queries with SQL?

What other possibilities do I have?

A: 

Don't try an re-invent the wheel!

I would import the XML into a database(eg SQLite) (plus meta data, XML information), and query that.

Edit 1:

You could implement a 'drop folder' which is 'indexed'/imported upon first run. A Folder watcher can be implemented to ONLY update new/changes to XML files. SQLite can be run in memeory for the fastest I/O performance.

Darknight
A: 

The fastest way is to create your own in memory model of data available in XML, convert it to simple objects and simple types, and organize it in the structure that suits your queries best. Index it additionally as appropriate for your problem (using Dictionary/SortedDictionary). This approach will be significantly faster then the one with using SQL database, and using SQL database will also be a lot faster then querying each XML. Depending on the complexity of your queries, this could range from a fairly simple thing to do, to a very hard in which case you should definitely go for embedded database.

Ivan
Are you seriously suggesting to load *all data* into memory upon program load? You crazy :) I need a solution which isn't taking an hour at every program start to load thousands of files into memory :)
codymanix
loading of 10000 small xml files shouldn't take an hour. It's probably matter of minutes. After the first load, you can save your data in binary flatfile, and monitor files for changes, updating only data that has changed afterwards. That will make loading instantanious :D.
Ivan
In my question I said "few 10000 or **more**". At some point loading everything into memory is not possible anymore.
codymanix
Oh, sorry I thought it wont be much higher then 10k. Either way you will have to parse all files. Database approach will give you lower memory usage, but the first load time will be equal (actually it would be worse for database, considering that database operations are slower then filling in memory model). Also take into consideration that data held in memory is smaller then the data in XML files (no tag/formatting overhead). Also take note that SQLite database is not advised for data sets larger then 1GB.
Ivan
A: 

The SQL Server 2005+ allows for creating XML indexes. The queries can be performed on the SQL server, without retrieving the XML data on the application side. This feature is present in the free Express edition.

lmsasu
But SQL Server is no embedded database, you have to install it separately. I cannot do this.
codymanix
A: 

For indexing the contents of xml: use Lucene (and a .net based implementation of it). This will allow you to quickly retrieve the xml docs that contain some specific values; then you might pay more attention to these ones.

lmsasu
lucene allow fulltext index, but doesn't take xml semantics into acount
codymanix