views:

514

answers:

4

I am working on a project where I search through a large text file (large is relative, file size is about 1 Gig) for a piece of data. I am looking for a token and I want a dollar value immediately after that token. For example,

this is the token 9,999,999.99

So here's is how I am approaching this problem. After a little analysis it appears that the token is usually near the end of the file so I thought I would start searching from the end of the file. Here is the code I have so far (vb.net):

    Dim sToken As String = "This is a token"
    Dim sr As New StreamReader(sFileName_IN)

    Dim FileSize As Long = GetFileSize(sFileName_IN)
    Dim BlockSize As Integer = CInt(FileSize / 1000)
    Dim buffer(BlockSize) As Char
    Dim Position As Long = -BlockSize
    Dim sBuffer As String
    Dim CurrentBlock As Integer = 0
    Dim Value As Double

    Dim i As Integer

    Dim found As Boolean = False
    While Not found And CurrentBlock < 1000
        CurrentBlock += 1
        Position = -CurrentBlock * BlockSize

        sr.BaseStream.Seek(Position, SeekOrigin.End)
        i = sr.ReadBlock(buffer, 0, BlockSize)
        sBuffer = New String(buffer)

        found = SearchBuffer(sBuffer, sToken, Value)
    End While

GetFileSize is a function that returns the filesize. SearchBuffer is a function that will search a string for the token. I am not familiar with regular expressions but will explore it for that function.

Basically I read in a small chunk of the file search it and if I don't find it load another chunk and so on...

Am I on the right track or is there a better way?

+2  A: 

I think you've got the right idea in chunking the file. You may want to read chunks in at line breaks rather than a set number of bytes, though. In your current implementation, if the token lies on a 1000 byte boundary it could get cut in half, preventing you from finding it. The same thing could cause the data to be cut off as well.

Bill the Lizard
A: 

If you're going to use chunks, it would be wise to use blocks which are multiples of 512 bytes long, and seek on a 512 byte alignment, because that will tend to be more efficient in accessing the disk (which ultimately will be in 512 byte blocks).

There may be other granularities even better than that, but 512 would be a good start.

Will Dean
A: 

If you wanted to do something more complicated but possibly faster, then you could look at reading the blocks asynchronously, so that you're searching one while the next is loading.

That way you get to perform the search at the same time as the data is chugging into memory.

I have to say though that unless your search is very expensive, disk read time will probably completely dominate this, and so complicated overlapping won't be worth the additional complexity.

Will Dean
A: 

Wait you people...

What if the token is broken between two chunks? Have you considered this?

Andrei Rinea