tags:

views:

1115

answers:

12

I have a large text file that I need to search for a specific string. Is there a fast way to do this without reading line by line? This method is extremely slow because of the size of the files (100mb+)

+3  A: 

Given the size of the files would you really want to read them entirely into memory beforehand? Line by line is likely to be the best approach here.

Sychare Jedko
+2  A: 

In all cases, you will have to go over all of the file.

Lookup Rabin-Karp string search or similar.

Pavel Radzivilovsky
Not necessarily every time you search it. If the same file is to be searched very many times, it would probably make sense to build an index of the file, so there would only need to be a single pass over the whole file to enable any number of rapid lookups.
Daniel Earwicker
+1  A: 

If you want to speed up the line-by-line reading you can create a queue-based application:
One thread reads the lines and enqeues them into a threadsafe Queue. A second one can then process the strings

winSharp93
A: 

I have a large text file that I need to search for a specific string. Is there a fast way to do this without reading line by line?

The only way to avoid searching across the entire file is to sort or organize the input beforehand. For example, if this is an XML file and you need to do many of these searches, it would make sense to parse the XML file into a DOM tree. Or if this is a list of words and you're looking for all the words which start with the letters "aero", it might make sense to sort the entire input first if you do a lot of that kind of searching on the same file.

John Feminella
A: 

The speed issue here could well be the speed taken to load the file into memory before performing the search. Try profiling your application to see where the bottleneck is. If it is loading the file you could try "chunking" the file load to that the file is streamed in small chunks and each chunk has the search performed on it.

obviously if the part of the string to be found is at the end of the file there will be no performance gain.

Sheff
+1  A: 

You could buffer a large amount of data from the file into memory at one time, up to whatever constraint you wish, than search it for the string. This would have the effect of reducing the number of reads on the file and would likely be a faster method, but would be more of a memory hog if you set the buffer size too high.

Chris Kannon
A: 

You should be able to read the file character by character matching each character in the search string until you reach the end of the search string in which case you have a match. If at any point the character you've read doesn't match the character you're looking for, reset the matched count to 0 and start again. For example (*pseudocode/not tested*):

byte[] lookingFor = System.Text.Encoding.UTF8.GetBytes("hello world");
int index = 0;
int position = 0;
bool matchFound = false;

using (FileStream fileStream = new FileStream(fileName, FileMode.Open))
{
  while (fileStream.ReadByte() == lookingFor[index])
  {
    index++;

    if (index == lookingFor.length) 
    {
       matchFound = true;
       position = File.position - lookingFor.length;
       break;
    }
  }
}

That is one of many algorithms you could use (although it may be off by one with the length check). It will only find the first match so you probably want to wrap the while loop in another loop to find multiple matches.

Also, one thing to note about reading the file line by line is that if the desired string to match spans lines you're not going to find it. If that's fine then you can search line by line but if you need search strings to span lines you'll want to use an algorithm like I detailed above.

Finally, if you're looking for best speed, which it sounds like you are, you'll want to migrate the code above to use a StreamReader or some other buffered reader.

Brian Hasden
+1  A: 

Is your project needing to search different files for the same or different string every time, or searching the same file for different strings every time?

If it's the latter, you could build an index of the file. But there's no point doing this if the file changes frequently, because building the index will be expensive.

To index a file for full text searching, you could use the Lucene.NET library.

http://incubator.apache.org/lucene.net/

Daniel Earwicker
A: 

If you're only looking for a specific string, I'd say line-by-line is the best and most efficient mechanism. On the other hand, if you're going to be looking for multiple strings, particularly at several different points in the application, you might want to look into Lucene.Net to create an index and then query the index. If this is a one-off run (i.e., you won't need to query the same file again later), you can create the index in a temporary file that will be cleaned up automatically by the system (usually boot time; or you can delete it yourself when your program exits). If you need to search the same file again later, you can save the index in a known location and get much better performance the second time around.

Dathan
A: 

Stick it into SQL Server 2005/2008 and use its full-text search capability.

ebpower
A: 

Here is my solution that uses a stream to read in one character at a time. I created a custom class to search for the value one character at a time until the entire value is found.

I ran some tests with a 100MB file saved on a network drive and the speed was totally dependent on how fast it could read in the file. If the file was buffered in Windows a search of the entire file took less than 3 seconds. Otherwise it could take anywhere from 7 seconds to 60 seconds, depending on network speed.

The search itself took less than a second if run against a String in memory and there were no matching characters. If a lot of the leading characters found matches the search could take a lot longer.

public static int FindInFile(string fileName, string value)
{   // returns complement of number of characters in file if not found
    // else returns index where value found
    int index = 0;
    using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
    {
        if (String.IsNullOrEmpty(value))
            return 0;
        StringSearch valueSearch = new StringSearch(value);
        int readChar;
        while ((readChar = reader.Read()) >= 0)
        {
            ++index;
            if (valueSearch.Found(readChar))
                return index - value.Length;
        }
    }
    return ~index;
}
public class StringSearch
{   // Call Found one character at a time until string found
    private readonly string value;
    private readonly List<int> indexList = new List<int>();
    public StringSearch(string value)
    {
        this.value = value;
    }
    public bool Found(int nextChar)
    {
        for (int index = 0; index < indexList.Count; )
        {
            int valueIndex = indexList[index];
            if (value[valueIndex] == nextChar)
            {
                ++valueIndex;
                if (valueIndex == value.Length)
                {
                    indexList[index] = indexList[indexList.Count - 1];
                    indexList.RemoveAt(indexList.Count - 1);
                    return true;
                }
                else
                {
                    indexList[index] = valueIndex;
                    ++index;
                }
            }
            else
            {   // next char does not match
                indexList[index] = indexList[indexList.Count - 1];
                indexList.RemoveAt(indexList.Count - 1);
            }
        }
        if (value[0] == nextChar)
        {
            if (value.Length == 1)
                return true;
            indexList.Add(1);
        }
        return false;
    }
    public void Reset()
    {
        indexList.Clear();
    }
}
DRBlaise
+1  A: 

Fastest method for searching is Boyer-Moore algorithm This method not require to read all bytes from the files, but require random access to bytes. Also, this method is simple in realization.

Vitaly Dyatlov