hello
I'm trying to implement an algorithm to search multiple XML files for a certain record.
known that the records are not sorted ( i don't have an indexed id) .
what is the fastest algorithm to search for that record ?.
please let me know if anything was unclear
thanks in advance
views:
136answers:
5Without sorting linear search is your best bet. Think about it.
And as I said in the comments: it matters if you want to search one time or multiple times. Because then you may need to build an index. But if you search only once this would be useless.
Sequential line-by-line search comes to mind. Use multiple threads to take multiple files at once.
It really depends how often you want to perform the task on these files. If the records are unsorted, you can only search through them linearly. But if you have to do that more often on the same set of records, you can create an index, or sort them during the first run.
galambalazs is correct: Unsorted data means you have to go through it all to search for what you need. But that's only addressing a small part of the question.
In processing several files, probably most of your processing time will be taken up by file I/O. It takes a long time, by computer standards, to find a file in a directory and open it. But this is a cost you will incur basically regardless of which program you end up using.
Another part of the performance equation is the kind of parser you use. Depending on the structure of your XML, you have a choice of using a hand-written parser, a DOM XML parser or a Sax parser.
If the tags surrounding your sought data always occur on the same line as that data and no ambiguity is possible, then reading the file line-by-line and searching either by string search or regexp is a valid possibility. Many people on SO will protest that regexp matching is a horrible way to process XML and this is generally correct; it is a quick and dirty way to do searches in a very specific and limited set of cases, and is very brittle with respect to the XML structure you end up working with.
A DOM parser "inhales" your entire XML document into an in-memory structure, which your application then can search sequentially for whatever it is. DOMs are great when you want to do a number of complex operations on an XML tree; for a sequential search they are a horrible idea because
- the amount of memory required is proportional to file size, so a large file could run you out of memory.
- a large data structure has to be built from the file contents. After one search, it will be immediately discarded. Computing and memory resources will end up wasted.
Therefore, the most recommended approach would be to use a SAX parser. Googling will find you one for your favorite language. A SAX parser scans through your input file once, producing events at every element which you can (and must!) process in an appropriate way. Data is processed sequentially and there's no storage other than what you decide to do with the data you find. SAX parsers are usually dramatically faster than DOM parsers but need a little planning on how to process the events.