tags:

views:

133

answers:

5

I need to parse a file that could be many gbs in size. I would like to do this in C. Can anyone suggest any methods to accomplish this?

The file that I need to open and parse is a hard drive dump that I get from my mac's hard drive. However, I plan on running my program inside of 64-bit Ubuntu 10.04. Also given the large file size, the more optimized the method the better.

+1  A: 

Depending on the Chomsky level there may be several free and commercial toolkits to create parsers for file format. I think the real problem you think you have is how to 'handle' several GB's of data.

Do you want all of the data in memory simultaneously ?
One way is to write out parts of file on to disk in temporary files, when not in use. Simple fread / fwrite of struct, and some clever ref-counted 'on demand' loading and writing can do this,

Vardhan Varma
A: 

Assuming you're on a linux/bsd/mac/notwindows 64-bit system (and seriously, who isn't these days?), mmap performs extremely well. It essentially lets you map a whole file into a process' address space and let the kernel perform caching/paging for you.

And if you MUST use windows, here's the same concept, but made by the friendly folks at Redmond. Note that for either of these, you will want to be running on a 64-bit system as the ABSOLUTE largest file you can map on a 32-bit system is ~4GB.

Clark Gaebel
-1 for "who isn't these days?" about 64-bit. I would wager 95-99% of real-world (as opposed to gamer-kiddie-world) PCs/workstations are 32-bit.
R..
Memory mapping is nice, but given the OP's simple question, I think he'd be looking for the most fundamental interface, performance hasn't been mentioned either.
Matt Joiner
Also note that it's possible to have files larger than can be mapped even on 64-bit systems, since current-generation 64-bit CPUs actually only have a 48-bit address space. 128TB or thereabouts may seem big now, but there are disk arrays on that order of size...
bdonlan
@R: I doubt he's planning on reading huge files on a common PC. It's most likely a server this is for.
Clark Gaebel
+4  A: 

On both *nix and Windows, there are extensions to the I/O routines that touch file size that will support sizes larger than 2GB or 4GB. Naturally, the underlying file system must also support a file that large. On Windows, NTFS does, but FAT doesn't for instance. This is generally known as "large file support".

The two routines that are most critical for these purposes are fseek() and ftell() so that you can do random access to the whole file. Otherwise, the ordinary fopen() and fread() and friends can do sequential access to any size of file as long as the underlying OS and stdio implementation support large files.

RBerteig
+1  A: 

In addition to RBerteig's and Matt's answer:

If you enable the 64 bit IO support correctly and carefully for all your files in your project (for which the methods are systemn dependent) you don't have to be worried about integer overflow if you use the correct types, I think. off_t should then be the correct choice to position your file pointer.

If all else fails go with the correct C99 types if you make assumptions about the width of the type. Using int or long is almost always the wrong thing to do, they are too much compiler/platform dependent. Use int64_t (or int_fast64_t if you don't have that).

Jens Gustedt
This answer is fairly misleading. Just using the right types is not enough; you need to use functions that can handle the offsets. On most unix systems (certainly on Linux) the default compilation environment has `off_t` defined as `long`, so on 32-bit machines it will only be 32-bit. You typically need to build your whole program with `-D_FILE_OFFSET_BITS=64` and use the `fseeko` and `ftello` functions to get sane behavior - and of course use the `off_t` type for your offset variables. BTW, even if you never perform any seeks, IO will fail on files larger than 2gb without large file support.
R..
@R: did your read the first words of my answer? Obviously the rest of what I say is not enough, but that was given in the answer of RBerteig. Perhaps my wording is not good enough, I am not a native speaker of English, but how would you have said something like that?
Jens Gustedt
Well, it's misleading because it doesn't mention how to enable 64-bit IO. Just using `off_t` without defining the proper macros won't work.
bdonlan
@bdonlan: as you mention in your comment to Matt's answer, there is no general rule to do this. But I will try to reformulate my answer to make this clearer.
Jens Gustedt
+1  A: 

Define the macro -D_FILE_OFFSET_BITS=64 or #define _FILE_OFFSET_BITS 64 for all relevant sources (preferably the entire project). This common macro is provided automatically by several common build systems. Then use off_t (which will be 64 bit now) wherever the API requires it.

Matt Joiner
This applies for POSIX systems only.
bdonlan