views:

136

answers:

5

hello
I'm trying to implement an algorithm to search multiple XML files for a certain record. known that the records are not sorted ( i don't have an indexed id) . what is the fastest algorithm to search for that record ?.
please let me know if anything was unclear
thanks in advance

+2  A: 

Without sorting linear search is your best bet. Think about it.

And as I said in the comments: it matters if you want to search one time or multiple times. Because then you may need to build an index. But if you search only once this would be useless.

galambalazs
A: 

Sequential line-by-line search comes to mind. Use multiple threads to take multiple files at once.

Adam
If they are all on the same disk device then the search will be most likely I/O-bound and the multiple threads will not avail much.
Vilx-
Very true, but you won't know where they are coming from or how big they are. Also, it depends if you stream the file in line-by-line or load it all into memory first and then parse.
Adam
+3  A: 

It really depends how often you want to perform the task on these files. If the records are unsorted, you can only search through them linearly. But if you have to do that more often on the same set of records, you can create an index, or sort them during the first run.

relet
+2  A: 

Everything you need to decide is here Sorting Algorithms

OddCore
+1  A: 

galambalazs is correct: Unsorted data means you have to go through it all to search for what you need. But that's only addressing a small part of the question.

In processing several files, probably most of your processing time will be taken up by file I/O. It takes a long time, by computer standards, to find a file in a directory and open it. But this is a cost you will incur basically regardless of which program you end up using.

Another part of the performance equation is the kind of parser you use. Depending on the structure of your XML, you have a choice of using a hand-written parser, a DOM XML parser or a Sax parser.

If the tags surrounding your sought data always occur on the same line as that data and no ambiguity is possible, then reading the file line-by-line and searching either by string search or regexp is a valid possibility. Many people on SO will protest that regexp matching is a horrible way to process XML and this is generally correct; it is a quick and dirty way to do searches in a very specific and limited set of cases, and is very brittle with respect to the XML structure you end up working with.

A DOM parser "inhales" your entire XML document into an in-memory structure, which your application then can search sequentially for whatever it is. DOMs are great when you want to do a number of complex operations on an XML tree; for a sequential search they are a horrible idea because

  • the amount of memory required is proportional to file size, so a large file could run you out of memory.
  • a large data structure has to be built from the file contents. After one search, it will be immediately discarded. Computing and memory resources will end up wasted.

Therefore, the most recommended approach would be to use a SAX parser. Googling will find you one for your favorite language. A SAX parser scans through your input file once, producing events at every element which you can (and must!) process in an appropriate way. Data is processed sequentially and there's no storage other than what you decide to do with the data you find. SAX parsers are usually dramatically faster than DOM parsers but need a little planning on how to process the events.

Carl Smotricz
Also, XPath can be used. Though, implementation details matters. E.g. default Java XPath implementation is based on DOM parser as far as I remember thus inheriting all its performance implications. But XPath is so expressive as to overweight performance on occasions =)
Rorick
Now that you mention it, a sensible and very "XML-y" way to do this might be to use XSLT to transform an XML input document into an arbitrary output document containing just the search strings. The appeal here is that it's quite possible to hook a Transformer to a SAX source, thus ensuring (probably?) that the input will only be processed sequentially. This would let you combine the expressiveness of XPath expressions for defining the search with the speed of SAX parsing.
Carl Smotricz