fastest search algorithm

+2 A:

Without sorting linear search is your best bet. Think about it.

And as I said in the comments: it matters if you want to search one time or multiple times. Because then you may need to build an index. But if you search only once this would be useless.

galambalazs 2010-07-13 10:28:41

A:

Sequential line-by-line search comes to mind. Use multiple threads to take multiple files at once.

Adam 2010-07-13 10:28:56

If they are all on the same disk device then the search will be most likely I/O-bound and the multiple threads will not avail much.

Vilx- 2010-07-13 10:45:00

Very true, but you won't know where they are coming from or how big they are. Also, it depends if you stream the file in line-by-line or load it all into memory first and then parse.

Adam 2010-07-13 11:02:06

+3 A:

It really depends how often you want to perform the task on these files. If the records are unsorted, you can only search through them linearly. But if you have to do that more often on the same set of records, you can create an index, or sort them during the first run.

relet 2010-07-13 10:31:26

+2 A:

Everything you need to decide is here Sorting Algorithms

OddCore 2010-07-13 10:33:45

+1 A:

galambalazs is correct: Unsorted data means you have to go through it all to search for what you need. But that's only addressing a small part of the question.

In processing several files, probably most of your processing time will be taken up by file I/O. It takes a long time, by computer standards, to find a file in a directory and open it. But this is a cost you will incur basically regardless of which program you end up using.

Another part of the performance equation is the kind of parser you use. Depending on the structure of your XML, you have a choice of using a hand-written parser, a DOM XML parser or a Sax parser.

If the tags surrounding your sought data always occur on the same line as that data and no ambiguity is possible, then reading the file line-by-line and searching either by string search or regexp is a valid possibility. Many people on SO will protest that regexp matching is a horrible way to process XML and this is generally correct; it is a quick and dirty way to do searches in a very specific and limited set of cases, and is very brittle with respect to the XML structure you end up working with.

A DOM parser "inhales" your entire XML document into an in-memory structure, which your application then can search sequentially for whatever it is. DOMs are great when you want to do a number of complex operations on an XML tree; for a sequential search they are a horrible idea because

the amount of memory required is proportional to file size, so a large file could run you out of memory.
a large data structure has to be built from the file contents. After one search, it will be immediately discarded. Computing and memory resources will end up wasted.

Therefore, the most recommended approach would be to use a SAX parser. Googling will find you one for your favorite language. A SAX parser scans through your input file once, producing events at every element which you can (and must!) process in an appropriate way. Data is processed sequentially and there's no storage other than what you decide to do with the data you find. SAX parsers are usually dramatically faster than DOM parsers but need a little planning on how to process the events.

Carl Smotricz 2010-07-13 10:38:18

Also, XPath can be used. Though, implementation details matters. E.g. default Java XPath implementation is based on DOM parser as far as I remember thus inheriting all its performance implications. But XPath is so expressive as to overweight performance on occasions =)

Rorick 2010-07-13 12:29:28

Now that you mention it, a sensible and very "XML-y" way to do this might be to use XSLT to transform an XML input document into an arbitrary output document containing just the search strings. The appeal here is that it's quite possible to hook a Transformer to a SAX source, thus ensuring (probably?) that the input will only be processed sequentially. This would let you combine the expressiveness of XPath expressions for defining the search with the speed of SAX parsing.

Carl Smotricz 2010-07-13 13:15:47

ansaurus

tags:

views:

answers:

fastest search algorithm

related questions