ansaurus

Question

How to scan through really huge files on disk?

Answer 1

+5 A:

Although you can use memory mapping, you don't have to. If you read the file sequentially in small chunks, say 1 MB each, the file will never be present in memory all at once.

If your search code is actually slower than your hard disk, you can still hand chunks off to worker threads if you like.

Thomas 2010-01-31 12:21:11

I daresay that the only way that searching could be slower than reading from disk would be for truly pathological cases (e.g. searching for 999,999 `A` characters followed by a `B` in a file containing just 1,000,000 `A` characters) when using a naive search method (like that typically implemented for `strstr()`). For any linear-time string search (like Knuth-Morris-Pratt) disk I/O will be at least 100x slower.

j_random_hacker 2010-01-31 12:57:45

Yeah, that's why I wrote the "if" and "actually" in "If your search code is actually slower..." :)

Thomas 2010-01-31 16:10:56

Answer 2

+10 A:

Creating 20 threads, each supposing to handle some 100 MB of the file is likely to only worsen performance since The HD will have to read from several unrelated places at the same time.

HD performance is at its peak when it reads sequential data. So assuming your huge file is not fragmented, the best thing to do would probably be to use just one thread and read from start to end in chunks of a few (say 4) MB.

But what do I know. File systems and caches are complex. Do some testing and see what works best.

shoosh 2010-01-31 12:32:55

Answer 3

A:

Using a memory mapped file has the additional benefit of avoiding a copy from the filesystem cache memory to the (managed) application memory if you use a read-only view (although you have to use byte* pointers then to access the memory). And instead of creating many threads use one thread to sequentially scan through the file using for example 100MB memory mapped views into the file (don't map the entire file into the process address space at once).

Pent Ploompuu 2010-02-04 11:04:01

Answer 4

A:

I would do it with asynchronous reads into a double buffer. So when one buffer has been read from file, start reading the next buffer while scanning the first buffer. This means you do CPU and IO in parallel. Another advantage is that you will always have data around data boundaries. However I don't know if double buffering is possible with memory mapped files.

Hope this helps.

Regards,

Sebastiaan

Sebastiaan Megens 2010-02-04 11:25:39

Answer 5

+1 A:

I'd go with only one thread too, not only for HD performance issues, but because you might have trouble managing side effects when splitting your file : what if there's an occurrence of your pattern right where you split your file ?

Sylvestre Equy 2010-02-04 12:56:38

Answer 6

+2 A:

I would have one thread read the file (possibly as a stream) into an array and have another thread process it. I wouldnt map several at one time because of disk seeks. I would probably have a ManualResetEvent to tell my thread when the next ? bytes are ready to be processed. Assuming your process code is faster then the hdd i would have 2 buffers, one to fill and the other to process and just switch between them each time.

2010-02-04 20:52:42

Answer 7

A:

Tim Bray (and his readers) explored this in depth in his Wide Finder Project and Wide Finder 2. Benchmark results show that multithreaded implementations can outperform a single-threaded solution on a massive Sun multicore server. On usual PC hardware, multithreading won't gain you that much, if at all.

oefe 2010-02-06 17:11:57

Answer 8

+2 A:

Multithreading is only going to make this go slower unless you want to scan multiple files with each on a different hard drive. Otherwise you are just going to seek.

I wrote a simple test function using memory mapped files, with a single thread a 1.4 Gb file took about 20 seconds to scan. With two threads, each taking half the file (even 1MB chunks to one thread, odd to the other), it took more than 80 seconds.

1 thread: 20015 milliseconds
2 threads: 83985 milliseconds

That's right, 2 threads was Four times slower than 1 thread!

Here's the code I used, this is the single threaded version, I used a 1 byte scan pattern, so the code to locate matches that straddle map boundaries is untested.

HRESULT ScanForPattern(LPCTSTR pszFilename, LPBYTE pbPattern, UINT cbPattern, LONGLONG * pcFound)
{
   HRESULT hr = S_OK;

   *pcFound = 0;
   if ( ! pbPattern || ! cbPattern)
      return E_INVALIDARG;

   //  Open the file
   //
   HANDLE hf = CreateFile(pszFilename,
                          GENERIC_READ,
                          FILE_SHARE_READ, NULL,
                          OPEN_EXISTING,
                          FILE_FLAG_SEQUENTIAL_SCAN,
                          NULL);

   if (INVALID_HANDLE_VALUE == hf)
      {
      hr = HRESULT_FROM_WIN32(ERROR_FILE_NOT_FOUND);
      // catch an open file that exists but is in use
      if (ERROR_SHARING_VIOLATION == GetLastError())
         hr = HRESULT_FROM_WIN32(ERROR_SHARING_VIOLATION);
      return hr;
      }

   // get the file length
   //
   ULARGE_INTEGER  uli;
   uli.LowPart = GetFileSize(hf, &uli.HighPart);
   LONGLONG cbFileSize = uli.QuadPart;
   if (0 == cbFileSize)
      {
      CloseHandle (hf);
      return S_OK;
      }

   const LONGLONG cbStride = 1 * 1024 * 1024; // 1 MB stride.
   LONGLONG cFound  = 0;
   LPBYTE   pbGap = (LPBYTE) malloc(cbPattern * 2);

   //  Create a mapping of the file.
   //
   HANDLE hmap = CreateFileMapping(hf, NULL, PAGE_READONLY, 0, 0, NULL);
   if (NULL != hmap)
      {
      for (LONGLONG ix = 0; ix < cbFileSize; ix += cbStride)
         {
         uli.QuadPart = ix;
         UINT cbMap = (UINT) min(cbFileSize - ix, cbStride);
         LPCBYTE pb = (LPCBYTE) MapViewOfFile(hmap, FILE_MAP_READ, uli.HighPart, uli.LowPart, cbMap);
         if ( ! pb)
            {
            hr = HRESULT_FROM_WIN32(GetLastError());
            break;
            }
         // handle pattern scanning over the gap.
         if (cbPattern > 1 && ix > 0)
            {
            CopyMemory(pbGap + cbPattern - 1, &pb[0], cbPattern - 1);
            for (UINT ii = 1; ii < cbPattern; ++ii)
               {
               if (pb[ii] == pbPattern[0] && 0 == memcmp(&pb[ii], pbPattern, cbPattern))
                  {
                  ++cFound; 
                  // advance by cbPattern-1 to avoid detecting overlapping patterns
                  }
               }
            }

         for (UINT ii = 0; ii < cbMap - cbPattern + 1; ++ii)
            {
            if (pb[ii] == pbPattern[0] && 
                ((cbPattern == 1) || 0 == memcmp(&pb[ii], pbPattern, cbPattern)))
               {
               ++cFound; 
               // advance by cbPattern-1 to avoid detecting overlapping patterns
               }
            }
         if (cbPattern > 1 && cbMap >= cbPattern)
            {
            // save end of the view in our gap buffer so we can detect map-straddling patterns
            CopyMemory(pbGap, &pb[cbMap - cbPattern + 1], cbPattern - 1);
            }
         UnmapViewOfFile(pb);
         }

      CloseHandle (hmap);
      }
   CloseHandle (hf);

   *pcFound = cFound;
   return hr;
}

John Knoeller 2010-02-10 05:33:31

ansaurus

tags:

views:

answers:

How to scan through really huge files on disk?

related questions