views:

1007

answers:

6

In our new project we have to provide a search functionality to retrieve data from hundreds of xml files. I have a brief of our current plan below, I would like to know your suggestions/improvements on this.

These xml files contain personal information, and the search is based on 10 elements in it for example last name, first name, email etc. Our current plan is to create an master XmlDocument with all the searchable data and a key to the actual file. So that when the user searches the data we first look at master file and get the the results. We will also cache the actual xml files from the recent searches so simillar searches later can be handled quickly.

Our application is a .net 2.0 web application.

+4  A: 

First: how big are the xml files? XmlDocument doesn't scale to "huge"... but can handle "large" OK.

Second: can you perhaps put the data into a regular database structure (perhaps SQL Server Express Edition), index it, and access via regular TSQL? That will usually out-perform an xpath search. Equally, if it is structured, SQL Server 2005 and above supports the xml data-type, which shreds data - this allows you to index and query xml data in the database without having the entire DOM in memory (it translates xpath into relational queries).

Marc Gravell
I second this. Toss that info in a db. Sure it might take some time, but that's what they're made for.
SnOrfus
#First: File size is 4-5 KB. We are working on a subset of the actual data. Some of the files can grow up to 100+ KB.
gk
#Second- Are you suggesting us to shred the data and put it as columns or store the xml in a column and set the indexes on xml data. We avoided the former case because it can become quite a big table and would be tough to maintain.
gk
That counts as small, so any approach should work OK. You could scan raw files very quickly at this scale. If the schema makes using regular columns tricky, then sure: an xml table (**correctly indexed**) can work well... search for primary/secondary xml indexing in sql server.
Marc Gravell
+1  A: 

If you can store then data in a SQL Server database then you could make use of SQL Servers in built XPath query functionality.

Dave Barker
+1  A: 

Hmm, sounds like your building a database over the top of Xml, for performance I'd be reading those files into the DB of your choice, and let it handle indexing and searching for you. If that's not an option get really with XPath, or roll your own exhaustive search using XmlReader.

Xml is not the answer to every problem, however clean it appears to be, performance will suck.

MrTelly
A: 

Why dont you store the searchable data in a database table with key to the actual file? So your search would be on database table rather than xml file. I suppose this would be faster because you may index the table for faster searching.

Nahom Tijnam
+1  A: 

Index your XML files. Look into http://www.dotlucene.net/

I recently used it at my previous job to cache our SQL database for fast searching and very little overhead.

It provides fast searching of content inside xml files (all depending on how you organize your cache).

Very easy and straight forward to use.

Much easier than trying to loop through a bunch of files.

Gautam
A: 

LOOK IN http://www.15seconds.com/issue/010410.htm