ansaurus

Question

Processing huge text files

Answer 1

A:

c based solution?

I think perl was born for this exact purpose.

Oren Mazor 2009-10-26 14:58:47

I would agree. Handling text files like this is realtively natural in Perl.

Martin York 2009-10-27 00:03:55

Then again, coding this solution in C++ is straightforward and easy (notwithstanding multithreading, which will probably pose the same problems in C++ and Perl).

Konrad Rudolph 2009-10-27 10:19:04

the idea that you need to use c++ to count instances of words in a file, however big, is bizarre to me.I mean no offense. I'm sure the solutions presented here are perfectly palatable to some people, but I'm old fashioned. 10 lines of perl will getter done.

Oren Mazor 2009-10-27 14:17:20

I don't know why this post gets down voted. Just because it is not a C solution? Like Oren said, if perl (or awk, ...) was used at beginning, the original problem was already resolved. The speed may not be as fast as a refined C version, but they are basically at the same complexity level.

Codism 2009-10-27 14:54:43

Answer 2

+13 A:

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.

In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.

In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)

Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

Jonathan Leffler 2009-10-26 15:08:10

I wish I could double-upvote this answer.

jprete 2009-10-26 15:14:01

Depending on the contents of the 3gb file and how much memory you have, reading all of it into a map may be be too large to fit into memory when the memory overhead of a map is added in.

jthg 2009-10-26 15:14:49

There are about 100,000 words in the English language. Let's assume that the definition of 'word' doesn't do case-mapping, and catches punctuation, so that there are 5 variants on each word. Let's assume that on average, a word is 10 characters (overkill), and the map overhead is, oh, 22 bytes. Then we have 5 * 100,000 * 32 = 16 MB. What size computer is going to have problems with that?

Jonathan Leffler 2009-10-26 15:26:15

I am not sure it will. In the case where there are only a few thousands of words, the map won't be so big. So it really depends on the number of distinct words, not the file size.

Matthieu M. 2009-10-26 15:28:33

The English word list I have handy (I think it's the Moby list, but don't remember with certainty) shows around three hundred thousand words, but that's still not particularly difficult to deal with. I would not take punctuation into account though -- I'd strip it off before counting the words (except, perhaps, in the case of contractions). Most people don't want `x` and `x,` treated as two different words.

Jerry Coffin 2009-10-26 15:40:22

You guys have a very good point. I'm not sure what I was thinking. Maybe I was a little confused by the OP wanting to search through the entire file every single time - which sort of gave me the impression that the file actually contained on the order of 1gb of unique words.

jthg 2009-10-26 16:38:49

@JonathanI completely agree and I am grateful to you for introducing those two books. The only problem in your answer is how fast i can read through the file. Your answer is probably fine in counting the unique words using the map or the likes.But, the real question how fast i can traverse the file? This is the reason my proposed solution use threads, but that doesn't seems to be fine as per the below answers

Vadi 2009-10-27 07:22:01

I've done a very similar thing, except for one detail. My initial code indeed used a `std::map<std::string, T>` but once I got the bugs out I changed it to `std::map<StringShim, T>`. `StringShim` was a simple 4-byte class wrapping a `char*`; the actual strings were managed by a single `StringPool`. This was significantly more efficient.

MSalters 2009-10-27 09:58:21

@Vadi: threads won’t help you read the file faster *at all*, quite the contrary. Multithreading only makes sense when you can process work in parallel on multiple processors. But there still is only a single disk read head to read data from the hard drive.

Konrad Rudolph 2009-10-27 10:25:36

@Konrad: But I can use threads for traversing through the buffer as mentioned by Jerry Cofin below. Apparently, i need to count the repeated words so threads with more than one dedicated buffers could be more efficient.

Vadi 2009-10-27 10:33:23

FWIW: on a single disk machine, I did 'dd if=/dev/zero of=./junk bs=64k count=49152' in 39 seconds and 'cat ./junk >/dev/null' in 46. Those are about as simple, in terms of 'no processing' as you can get, and show transfer rates of 67-79 MB/s. You can do a lot of processing in that time without running into trouble. Consider memory mapping the whole file if you've got a 64-bit machine (you probably wouldn't get away with that on a 32-bit machine, though it might be worth a try). In the absence of better disk drives, the processing isn't going to be the bottleneck.

Jonathan Leffler 2009-10-27 14:04:38

Answer 3

+6 A:

I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.

In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.

sellibitze 2009-10-26 15:09:14

I would call this a near-certainty. The CPU should be processing the bytes far faster than the disk can pull them off the platter, so there isn't really anything to parallelize.

jprete 2009-10-26 15:13:24

I concur. I might even take it one step further and say that even if the entire file is in memory, the CPU will still process the words faster than they can be read from memory.

jthg 2009-10-26 15:16:38

Disagree with the last statement. Reading the text from memory will trigger the CPU's prefetcher. That's bloody fast. The bottleneck will be the O(log N) random-access search for the word counter. They are unlikely to all fit in L2 cache.

MSalters 2009-10-27 10:01:00

Yes, the prefetcher helps a lot, but I was already assuming perfect prefetching. Just for the sake of seeing whether what I said was preposterous, I did a rough order-of-magnitude calculation: Memory bandwidth is around 10 gigabytes/second. Assuming that the cpu is a core 2 which can do one comparison per cycle (for a single core), that works out to about (8 bytes/cycle) * (2e9 cycles/second) = 16 gigabytes/second. The order of magnitude works out to be about the same. Anyone care to put in some more exact numbers?

jthg 2009-10-27 17:47:55

Answer 4

A:

stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.

What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.

By example:

Thread #i is ready and ask main thread to give it next part,
Main thread read next 1Mb and provide them to thread 1,
Thread #i read the 1Mb and count words as you want,
Thread #i finishes its work and ask again for the next 1Mb.

By this way you can separate stream reading to stream analysis.

Patrice Bernassola 2009-10-26 15:10:20

I don't think there's any value in messing with threading. This sort of task will absolutely be I/O bound. Your hard drive won't be able to feed data fast enough to load even a since core.

swillden 2009-10-26 15:22:46

Answer 5

A:

What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:

http://stackoverflow.com/questions/181624/c-what-regex-library-should-i-use

ryber 2009-10-26 15:11:12

I can't even imagine the horrors of trying to search a 3gb file via RegEx.

jthg 2009-10-26 15:17:33

Unless... the regex engine is optimized for stream processing.

jthg 2009-10-26 15:18:16

I have a program that regexs a that much data regularly and it's quite zippy.

ryber 2009-10-26 18:55:10

Answer 6

A:

First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.

But, assuming your constraints, here's what I'd do.

1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:

index = 0

numwords = 0

mysplitfile = openfile(index-split.txt)

while (bigfile >> word)

mysplitfile << word

numwords ++

if (numwords > 5000)

    mysplitfile.close()

    index++

    mysplitfile = openfile(index-split.txt)

2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:

maplock = create_pthread_lock()

sharedmap = std::map()

for every index-split.txt file:

spawn-new-thread(myfunction, filename, sharedmap, lock)

dump_map(sharedmap)

void myfunction(filename, sharedmap) {

localmap = std::map<string, size_t>();

file = openfile(filename)

while (file >> word)

    if !localmap.contains(word)
         localmap[word] = 0

    localmap[word]++

acquire(lock)
for key,value in localmap
    if !sharedmap.contains(key)
         sharedmap[key] = 0

    sharedmap[key] += value
release(lock)

}

Sorry for the syntax. I've been writing a lot of python lately.

spitzanator 2009-10-26 15:16:52

Using a lock is definitely not a good idea. You're killing parallelism. It is much simpler, if you want to go MT, to actually have each thread play with its own map and just merge them at the end.

Matthieu M. 2009-10-26 15:26:09

hay spitzanator, have you read Natural Language processing with python ?

zeroin23 2009-10-26 15:26:36

Can someone shed little light on why this is downvoted? Is this appropriate answer or as mentioned earlier reading disc with multiple thread are not effective? or just because of the pythonicpseudocode?

Vadi 2009-10-27 11:49:44

Answer 7

+1 A:

While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.

One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.

The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open() and read() (and won't gain you a whole lot). On Windows, that's CreateFile and ReadFile, passing the FILE_FLAG_NO_BUFFERING flag to CreateFile -- and it'll probably roughly double your speed if you do it right.

You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.

The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).

As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.

Jerry Coffin 2009-10-26 15:34:06

Answer 8

+3 A:

First - decide on the datastructure for saving the words.

The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word. You can insert into the trie using a stream and reading your file characterbased.

Second - multithreading yes or no? This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.

Singlethreaded - straitforward and easy to implement.
Multithreaded with multiple reader threads and one datastructur. Then you have to synchronize the access to the datastructure. In a Trie, you only need to lock the node you are actually in, so multiple readers can access the datastructure without much interference. A self-balancing tree might be different, especially when rebalancing.
Multithreaded with multiple reader threads, each with their own datastructure. Each thread builds it's own datastructure while reading a part of the file. After each one is finished, the results have to be combined (which should be easy).

One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).

Tobias Langner 2009-10-26 16:10:51

Good summary of possibilities, and +1 for mentioning the trie as a non-obvious solution.

Konrad Rudolph 2009-10-27 10:21:45

Answer 9

+1 A:

As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.

MSalters 2009-10-27 10:07:56

ansaurus

tags:

views:

answers:

Processing huge text files

related questions