I need to go over a huge amount of text (> 2 Tb, a Wikipedia full dump) and keep two counters for each seen token (each counter is incremented depending on the current event). The only operation that I will need for these counters is increase. On a second phase, I should calculate two floats based on these counters and store them.
It should perform the following steps:
- Go over huge amounts of text and increase two counters for each word it finds, depending on the current event.
- Go over all tokens and, for each of them, compute two additional floats based on these counters.
- Allow queries (getting the values for any given token).
Requirements and other details:
- It must scale up to O(10^8) tokens.
- The final result needs to be queried very fast!
- While going over the texts, only increasing of two counters will be done. This is a one-time processing, so there will be no queries during processing. Only value updating.
- No need for dynamic/updateable schema.
I have been trying CouchDB and MongoDB without too good results.
What do you think is the best approach to this problem?
Thank you!
EDIT 1: I have been suggested to try a Patricia trie and test if all the keys fit into memory (I suspect they do not). A custom Patricia trie with an extra operator for increasing the values of each key in one step might be a possible solution.
EDIT 2: Clarified what I mean by "huge": > 2 Tb of text. More clarifications.
EDIT 3: Unique token estimation. As suggested by Mike Dunlavey, I tried to do a quick estimation of the unique tokens. In the first 830Mb of the dataset, unique tokens grow linearly to 52134. Unless the number of unique tokens grows slower after processing more data (which is likely), there should be O(10^8) unique tokens.
EDIT 4: Java and Python solutions are preferred but any other language is ok too.
EDIT 5: Usually tokens will contain only printable ASCII characters, but they can contain any printable Unicode character. I will try the same process both with lower and upper-case untouched; and for lower-case only.