views:

489

answers:

9

Problem: I've a huge raw text file (assume of 3gig), I need to go through each word in the file and find out that a word appears how many times in the file.

My Proposed Solution: Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example, all the words starting with "a" will be stored in a "*a.dic" file. So, at any time we will not execeed more than 26 files.

The problem in this approach is,

I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?

Any better approach?

Note: It should be a pure c++ or c based solution. No databases etc., are allowed.

A: 

c based solution?

I think perl was born for this exact purpose.

Oren Mazor
I would agree. Handling text files like this is realtively natural in Perl.
Martin York
Then again, coding this solution in C++ is straightforward and easy (notwithstanding multithreading, which will probably pose the same problems in C++ and Perl).
Konrad Rudolph
the idea that you need to use c++ to count instances of words in a file, however big, is bizarre to me.I mean no offense. I'm sure the solutions presented here are perfectly palatable to some people, but I'm old fashioned. 10 lines of perl will getter done.
Oren Mazor
I don't know why this post gets down voted. Just because it is not a C solution? Like Oren said, if perl (or awk, ...) was used at beginning, the original problem was already resolved. The speed may not be as fast as a refined C version, but they are basically at the same complexity level.
Codism
+13  A: 

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.

In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.

In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)

Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

Jonathan Leffler
I wish I could double-upvote this answer.
jprete
Depending on the contents of the 3gb file and how much memory you have, reading all of it into a map may be be too large to fit into memory when the memory overhead of a map is added in.
jthg
There are about 100,000 words in the English language. Let's assume that the definition of 'word' doesn't do case-mapping, and catches punctuation, so that there are 5 variants on each word. Let's assume that on average, a word is 10 characters (overkill), and the map overhead is, oh, 22 bytes. Then we have 5 * 100,000 * 32 = 16 MB. What size computer is going to have problems with that?
Jonathan Leffler
I am not sure it will. In the case where there are only a few thousands of words, the map won't be so big. So it really depends on the number of distinct words, not the file size.
Matthieu M.
The English word list I have handy (I think it's the Moby list, but don't remember with certainty) shows around three hundred thousand words, but that's still not particularly difficult to deal with. I would not take punctuation into account though -- I'd strip it off before counting the words (except, perhaps, in the case of contractions). Most people don't want `x` and `x,` treated as two different words.
Jerry Coffin
You guys have a very good point. I'm not sure what I was thinking. Maybe I was a little confused by the OP wanting to search through the entire file every single time - which sort of gave me the impression that the file actually contained on the order of 1gb of unique words.
jthg
@JonathanI completely agree and I am grateful to you for introducing those two books. The only problem in your answer is how fast i can read through the file. Your answer is probably fine in counting the unique words using the map or the likes.But, the real question how fast i can traverse the file? This is the reason my proposed solution use threads, but that doesn't seems to be fine as per the below answers
Vadi
I've done a very similar thing, except for one detail. My initial code indeed used a `std::map<std::string, T>` but once I got the bugs out I changed it to `std::map<StringShim, T>`. `StringShim` was a simple 4-byte class wrapping a `char*`; the actual strings were managed by a single `StringPool`. This was significantly more efficient.
MSalters
@Vadi: threads won’t help you read the file faster *at all*, quite the contrary. Multithreading only makes sense when you can process work in parallel on multiple processors. But there still is only a single disk read head to read data from the hard drive.
Konrad Rudolph
@Konrad: But I can use threads for traversing through the buffer as mentioned by Jerry Cofin below. Apparently, i need to count the repeated words so threads with more than one dedicated buffers could be more efficient.
Vadi
FWIW: on a single disk machine, I did 'dd if=/dev/zero of=./junk bs=64k count=49152' in 39 seconds and 'cat ./junk >/dev/null' in 46. Those are about as simple, in terms of 'no processing' as you can get, and show transfer rates of 67-79 MB/s. You can do a lot of processing in that time without running into trouble. Consider memory mapping the whole file if you've got a 64-bit machine (you probably wouldn't get away with that on a 32-bit machine, though it might be worth a try). In the absence of better disk drives, the processing isn't going to be the bottleneck.
Jonathan Leffler
+6  A: 

I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.

In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.

sellibitze
I would call this a near-certainty. The CPU should be processing the bytes far faster than the disk can pull them off the platter, so there isn't really anything to parallelize.
jprete
I concur. I might even take it one step further and say that even if the entire file is in memory, the CPU will still process the words faster than they can be read from memory.
jthg
Disagree with the last statement. Reading the text from memory will trigger the CPU's prefetcher. That's bloody fast. The bottleneck will be the O(log N) random-access search for the word counter. They are unlikely to all fit in L2 cache.
MSalters
Yes, the prefetcher helps a lot, but I was already assuming perfect prefetching. Just for the sake of seeing whether what I said was preposterous, I did a rough order-of-magnitude calculation: Memory bandwidth is around 10 gigabytes/second. Assuming that the cpu is a core 2 which can do one comparison per cycle (for a single core), that works out to about (8 bytes/cycle) * (2e9 cycles/second) = 16 gigabytes/second. The order of magnitude works out to be about the same. Anyone care to put in some more exact numbers?
jthg
A: 

stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.

What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.

By example:

  • Thread #i is ready and ask main thread to give it next part,
  • Main thread read next 1Mb and provide them to thread 1,
  • Thread #i read the 1Mb and count words as you want,
  • Thread #i finishes its work and ask again for the next 1Mb.

By this way you can separate stream reading to stream analysis.

Patrice Bernassola
I don't think there's any value in messing with threading. This sort of task will absolutely be I/O bound. Your hard drive won't be able to feed data fast enough to load even a since core.
swillden
A: 

What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:

http://stackoverflow.com/questions/181624/c-what-regex-library-should-i-use

ryber
I can't even imagine the horrors of trying to search a 3gb file via RegEx.
jthg
Unless... the regex engine is optimized for stream processing.
jthg
I have a program that regexs a that much data regularly and it's quite zippy.
ryber
A: 

First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.

But, assuming your constraints, here's what I'd do.

1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:

index = 0

numwords = 0

mysplitfile = openfile(index-split.txt)

while (bigfile >> word)

mysplitfile << word

numwords ++

if (numwords > 5000)

    mysplitfile.close()

    index++

    mysplitfile = openfile(index-split.txt)

2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:

maplock = create_pthread_lock()

sharedmap = std::map()

for every index-split.txt file:

spawn-new-thread(myfunction, filename, sharedmap, lock)

dump_map(sharedmap)

void myfunction(filename, sharedmap) {

localmap = std::map<string, size_t>();

file = openfile(filename)

while (file >> word)

    if !localmap.contains(word)
         localmap[word] = 0

    localmap[word]++

acquire(lock)
for key,value in localmap
    if !sharedmap.contains(key)
         sharedmap[key] = 0

    sharedmap[key] += value
release(lock)

}

Sorry for the syntax. I've been writing a lot of python lately.

spitzanator
Using a lock is definitely not a good idea. You're killing parallelism. It is much simpler, if you want to go MT, to actually have each thread play with its own map and just merge them at the end.
Matthieu M.
hay spitzanator, have you read Natural Language processing with python ?
zeroin23
Can someone shed little light on why this is downvoted? Is this appropriate answer or as mentioned earlier reading disc with multiple thread are not effective? or just because of the pythonicpseudocode?
Vadi
+1  A: 

While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.

One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.

The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open() and read() (and won't gain you a whole lot). On Windows, that's CreateFile and ReadFile, passing the FILE_FLAG_NO_BUFFERING flag to CreateFile -- and it'll probably roughly double your speed if you do it right.

You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.

The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).

As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.

Jerry Coffin
+3  A: 

First - decide on the datastructure for saving the words.

The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word. You can insert into the trie using a stream and reading your file characterbased.

Second - multithreading yes or no? This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.

  1. Singlethreaded - straitforward and easy to implement.
  2. Multithreaded with multiple reader threads and one datastructur. Then you have to synchronize the access to the datastructure. In a Trie, you only need to lock the node you are actually in, so multiple readers can access the datastructure without much interference. A self-balancing tree might be different, especially when rebalancing.
  3. Multithreaded with multiple reader threads, each with their own datastructure. Each thread builds it's own datastructure while reading a part of the file. After each one is finished, the results have to be combined (which should be easy).

One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).

Tobias Langner
Good summary of possibilities, and +1 for mentioning the trie as a non-obvious solution.
Konrad Rudolph
+1  A: 

As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.

MSalters