views:

407

answers:

8

I am working on a c parser and wondering how expert manage large amount of text / string (> 100mb) to store in memory? the content is expected to be accessible all the time in fast pace. bg: redhat / gcc / libc

a single char array would be out of boundary causing segmentation fault... any idea or experience is welcomed to share / discuss...

+9  A: 

mmap(2) the file in the the VM, and just use that.

Will Hartung
I will give a short on this, thx for suggestion!
nikcname
PLEASE NOTE that if someone truncates the file while you have it mmap'd, you may get a SIGBUS. But this is still by far the best option.
Dietrich Epp
+4  A: 

"a single char array would be out of boundary causing segmentation fault" - I think this isn't right. A segmentation fault is caused by accessing protected memory, not by allocating too big a chunk. In any case, you should be able to allocate up 2-3GB on a 32-bit machine and much more on 64-bit.

You can use a char array, but if you want fast access then perhaps you need some sort of indexing on top of that.

Could you clarify your usecase more? Are you trying to create a parser for the c language? Why do you expect to have such long input or output: neither source nor binaries are commonly that big.

abc
You will see a segfault if you try to do `char foo[100000000]; foo[0] = 0` - the OP isn't totally wrong.
Jefromi
I guess he probably means "page fault"?
Tom Smith
@Jefromi: How about `char * foo; foo = malloc(10000000); foo[0] = 0;`, so we're allocating on the heap, not the stack?
David Thornley
That should behave better; I was just trying to explain why I thought the OP was probably saying segfault. Of course, I'm with everyone else on this - hopefully there is some much better way to store this data than a single block.
Jefromi
I was mallocing hundreds of megabytes (char array) in 1996/7 on a Windows 95 machine with 128MB of RAM with no trouble at all. Had to access this carefully (not randomly) to avoid thrashing, but never got any complaints from the system. Voxel data - if your curious.
phkahler
A: 

You could save a lot of space by compressing the tokens as your read them from the source stream (presumably a text file). Eliminating excess whitespace and comments as you read the input text could reduce your memory requirements by up to 50%.

But I'm curious why you need to store so much in memory at once. String literals, identifiers, and symbol table entries can be cached to disk when you're at a point in parsing that makes them inaccessible or out of scope.

Loadmaster
Caching to disk isn't something you do manually on modern systems with VM.
Dietrich Epp
+1  A: 

Such large amount of data better be stored as

  1. Global array if the data is going to be constant.
  2. In heap (memory allocated dynamically) if globals are not allowed in you case.

But make it a point not to store it in stack, lest it might overflow and cause other problems.

If you are asking about specific data structures that can be efficiently used to store/access this data then I suggest:

  1. Hash table
  2. Array
  3. List.
goldenmean
@goldenmean: how can one overflow the stack with data before causing buffer overflow?
Rom
A: 

sorry if it's a beginner pbm, segmentation fault appears with the following.

int a = 10000000; char content2[a]; content2[0] = 'a';

the use case is, the file is a daily generated with a structural plain text format before parsing (similar as xml) the data itself which is quite static, I want to make it accessible as fast as I can, so I prefer to keep it in memory after parsed

nikcname
I could be wrong, but I think this is caused by Linux's overcommit memory management policy. If you ask for memory it will succeed, and if you use the memory, then it is allocated. If the kernel runs out of memory to give to processes, it kills processes to reclaim the memory see http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html
Justin Smith
@Justin: Most likely it is just that the array size exceeds the default allocated stack size. Looking in `man ld` on my Pentium 4 Linux box says 4Mb is allocated by default, though this can be changed with the option `--stack`.
j_random_hacker
The max stack size on my Linux box defaults to 10 MiB. This is why this fails. Use malloc instead. Has nothing to do with overcommit, overcommit causes the OOM killer to fire up, it does NOT send SIGSEGV.
Dietrich Epp
Oh, I *was* wrong then, OK.
Justin Smith
+1  A: 

If you are allocating a > 100Mb char array on the stack, you'll most likely overflow the stack. While you could increase the stack size using compiler/linker options, this won't necessarily solve the problem as some OSes expect approximately linear access to stack pages (google "stack guard pages")

Instead, if you know the size at compile time, try allocating a static char array instead. Better yet, use malloc(). (The code you posted declares an array whose size depends on the variable a -- this is called a "variable-length array", which is a C99 extension that not all compilers support. OTOH every C implementation lets you call malloc() to allocate memory dynamically.)

j_random_hacker
+2  A: 

It's a very unusual C parser that needs the source text (if that is what you are talking about) to be held in memory. Most parsers read the source effectively a token at a time and convert it immediately into some internal representation. And they typically hold the representation for only a single source file (plus #includes), which is highly unlikely to be as big as 100Mb - perhaps you have some design issues here?

anon
It's only the nice C parsers that store the source text -- they're the ones that give the nice error messages.
Dietrich Epp
@Dietrich You can produce pretty good error messages from the internal rep, though admittedly not the exact spacing. This is called re-creation. And even if you need to hang on to the original during a parse, you certainly don't need to hang on to it after you have parsed it.
anon
+3  A: 

mmap is the best way to deal with a large amount of data that is stored in a file, if you want random access to that data.

mmap tells the virtual memory system to map a contiguous portion of address space to contain the data found in the file. The virtual memory system will allocate a range of address space, backed by that file. When you access any location in that address space, it will allocate a page of physical memory, read that section of the file in from the disk, and point that portion of your virtual address space to the physical memory that it used to read the file. When it needs to make more room in physical memory, it will write out any changes to disk (if applicable), and remove the mapping of that section of virtual address space.

You would use it like so:

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* the header where mmap is defined */
#include <fcntl.h>

int file;
char *contents;
struct stat statbuf;
off_t len;

file = open("path/to/file", O_RDONLY);
if (file < 0)
  exit(1); /* or otherwise handle the error */

if (fstat(file, &statbuf) < 0)
  exit(1);

len = statbuf.st_size;

contents = mmap(0, len, PROT_READ, MAP_SHARED, file, 0);
if (contents == MAP_FAILED)
  exit(1);

// Now you can use contents as a pointer to the contents of the file

// When you're done, unmap and close the file.

munmap(contents, len);
close(file);
Brian Campbell
I believe mmap should be the solution for my case, my situation is like a editor which requires parsing for matching anytime, e.g. vimit's very usual to open a large file and parse back and fourth.
nikcname