Conclusion: It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)
HDF5 does the job done 1000% better in saving tons and tons of data. Reading/modifying the data from 200 million rows is a pain though, so that's the next problem to tackle.
I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.
I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.
Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.
def addInFile(path, scoreToAdd):
num = scoreToAdd
try:
shutil.copyfile(path, '/tmp/tmp.txt')
fp = open('/tmp/tmp.txt', 'r')
num += int(fp.readlines()[0])
fp.close()
except:
pass
fp = open('/tmp/tmp.txt', 'w')
fp.write(str(num))
fp.close()
shutil.copyfile('/tmp/tmp.txt', path)
- Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
- I previously tried performing linux console commands for these but it was way slower.
- I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
- Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.
I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.
How do you suggest I fix this? Do I change my directory tree structure?
All I need is to quickly access each file in this very huge pool of files*