tags:

views:

1194

answers:

5

Has anyone successfully used Tokyo Cabinet / Tokyo Tyrant with large datasets? I am trying to upload a subgraph of the Wikipedia datasource. After hitting about 30 million records, I get exponential slow down. This occurs with both the HDB and BDB databases. I adjusted bnum to 2-4x the expected number of records for the HDB case with only a slight speed up. I also set xmsiz to 1GB or so but ultimately I still hit a wall.

It seems that Tokyo Tyrant is basically an in memory database and after you exceed the xmsiz or your RAM, you get a barely usable database. Has anyone else encountered this problem before? Were you able to solve it?

+2  A: 

I am starting to work on a solution to add sharding to tokyo cabinet called Shardy.

http://github.com/cardmagic/shardy/tree/master

Lucas Carlson
A: 

I get the same, I have 8 files of biological sequence data, each with around 30 million records. I played around with the paramaters, and like you say, as soon as the memory fills, then it gets exponentially slow when it has to write to disk. I see people on forums talking about datasets bigger than mine, and how great it is, but I can't get it to work.

+3  A: 

I think I may have cracked this one, and I haven't seen this solution anywhere else. On Linux, there are generally two reasons that Tokyo starts to slow down. Lets go through the usual culprits. First, is if you set your bnum too low, you want it to be at least equal to half of the number of items in the hash. (Preferrably more.) Second, you want to try to set your xmsiz to be close to the size of the bucket array. To get the size of the bucket array, just create an empty db with the correct bnum and Tokyo will initialize the file to the appropriate size. (For example, bnum=200000000 is approx 1.5GB for an empty db.)

But now, you'll notice that it still slows down, albeit a bit farther along. We found that the trick was to turn off journalling in the filesystem -- for some reason the journalling (on ext3) spikes as your hash file size grows beyond 2-3GB. (The way we realized this was spikes in I/O not corresponding to the changes of the file on disk, alongside daemon CPU bursts of kjournald)

For Linux, just unmount and remount your ext3 partition as an ext2. Build your db, and remount as ext3. When journalling was disabled we could build 180M key sized db's without a problem.

Greg Fodor
A: 

Tokyo Cabinet's key-value store is relly good, I think people say slow due to they use Tokyo Cabinet's table-like store.

If you want store document data use mongodb or other nosql engine.

muyufan
Did you even read the question? He is using hash database and B+ tree database.
ionut bizau
+2  A: 

Tokyo scales wonderfully!! But you have to set your bnum and xmsiz appropriately. bnum should be .025 to 4 times greater than the records you are planning to store. xmsiz should match the size of BNUM. Also set opts=l if you are planning to store more than 2GB.

See Greg Fodor's post above about getting the value for size of xmsiz. Be careful to note that when setting xmsiz the value is in bytes.

Finally, if you are using a disk based hash it is very, very, VERY important to turn off journaling on the filesystem that the tokyo data lives on. This is true for Linux, Mac OSX and probably Windows though I have not tested it there yet.

If journaling is turned on you will see severe drops in performance as you approach 30+ million rows. With journaling turned off and other options appropriately set Tokyo is a great tool.

Anwar Abdus-Samad