ansaurus

Question

Fastest Text search method in a large text file

Answer 1

+1 A:

First, don't explicitly decode bytes.

from io import open

Second, consider things like this.

with open(path,'r',encoding='UTF-8') as src:
    found= None
    for line in src:
        if len(line) == 0: break #happens at end of file, then stop loop
        if target in line:
            found= line
            break
    return found

This can be simplified slightly to use return None or return line instead of break. It should run a hair faster, but it's slightly harder to make changes when there are multiple returns.

S.Lott 2010-08-10 13:26:14

Answer 2

+6 A:

Load the whole text in RAM at once. Don't read line by line.
Search for the pattern in the blob. If you find it, use text.count('\n',0,pos) to get the line number.
If you don't need the line number, look for the previous and next EOL to cut the line out of the text.

The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.

If that's not fast enough, use an external program like grep.

Aaron Digulla 2010-08-10 13:27:28

Answer 3

+3 A:

If you are searching the same text file over and over, consider indexing the file. For example, create a dictionary that maps each word to which lines it's on. This will take a while to create, but will then make searches O(1).

If you are searching different text files, or can't index the file for some reason, you probably won't get any faster than the KMP algorithm.

EDIT: The index I described will only work for single word searches, not multi-word searches. If you want to search for multiple words (any string) then you probably won't be able to index it.

Niki Yoshiuchi 2010-08-10 13:31:59

Good suggestion, you can write an algorithm that will do multi-word searches off a single word index. A multi-word index would most likely be a waste of time. Also, you can store the character of the word boundary as the index. Regexes would make this a trivial task.

marr75 2010-08-10 13:46:28

Good point. At the very least it will be easy to determine if a line contains all the words in the sentence. However I don't think that searches on parts of words ("uick brown fo" for example) will be indexable in a meaningful way.

Niki Yoshiuchi 2010-08-10 14:00:43

ansaurus

tags:

views:

answers:

Fastest Text search method in a large text file

related questions