Quickest way to find substrings in text files

views:

200

answers:

Quickest way to find substrings in text files

What's the fastest way to find strings in text files ? Case scenario : Looking for a particular path in a text file with around 50000 file paths listed (each path has it's own line).

+2 A:

A file of that size should easily fit in memory and you can make it into a std::set (or even better a hashset, if you have a library of that at hand) with the paths as its items. Checking if an exact path is there will then be very fast.

If you need to look for sub-paths as well, a sorted std::vector (if you're looking for prefixes only) may be the only useful approach -- or if you're looking for completely general substrings of paths then you'll need to scan through all the vector anyway, but unless you have to do it a zillion times even that wouldn't be too bad.

Alex Martelli 2009-08-07 04:59:07

I doubt that this is the fastest way - its the simplest. If a particular path is searched the fastest way would be, to read each line, compare it to the searched path and abort as soon as having found the match. Everything else is overhead. Besides that a std::hash_set usually is much faster than std::set.

RED SOFT ADAIR 2009-08-07 11:26:06

Yep, I did recommend a hash set if you have a library with that at hand -- remember it's NOT in the C++ standard (yet) despite the standard-violating `std:` prefix some libraries use. Reading a few 100 KB's in one gulp is empirically faster (at least on multi-tasking systems with good FSs, disk caches, readahead etc) than mixing I/O and CPU work as you suggest -- today the cost of disk I/O is much more in seeks than in linear reads (100KB < 1msec), and the mixing is liable to allow context switches, causing seeks (as other processes will be looking elsewhere on the disk).

Alex Martelli 2009-08-07 16:07:41

I took the time and wrote a benchmark sample. You are wrong: Reading a 5MB file with 80000 Lines takes about 0.60s on a good machine including a strcmp for each line read. If i omit the strcmp and instead build a std::set the runtime increases to 0.75s.

RED SOFT ADAIR 2009-08-10 11:41:15

This is the very field for regular expressions; you should look into grep and awk.

Tordek 2009-08-07 05:49:41

+2 A:

Do you have to find one string once in the file, the same string repeatitly in several files, several strings in the same file?

Depending on the scenario, you have several possible answers.

building a data stucture (like the set proposed by Alex) is usefull if you have to find several strings in the same file
using an algorithm like Boyer-Moore is efficient if you have to search for one string
using a regular expression engine will probably be preferable if you have to search for several strings.

AProgrammer 2009-08-07 06:43:38

I am not sure the extent you would like to use search, but FSM are good options to use.

Here is the discussion: http://stackoverflow.com/questions/525004/short-example-of-regular-expression-converted-to-a-state-machine

2009-08-07 07:14:49

ansaurus

tags:

views:

answers:

Quickest way to find substrings in text files

related questions