After an unfortunate misadventure with MySQL, I finally gave up on using it.
What I have?
Large set of files in the following format:
ID1: String String String String
ID2: String String String String
ID3: String String String String
ID4: String String String String
What I did?
Used MySQL on a powerful machine to import everything into a database in the following form:
ID1 String
ID1 String
ID1 String
ID1 String
...
...
What happened?
The database import was successful. Indexing is failing because apparently it requires more than 200 GB for 2 billion records. Reasonable request but I simply don't have that much space because the table itself is occupying about 240 GB after normalizing.
What I am planning to do?
I have a cluster of 20 nodes with about 80GB access for all of them combined (all of them have an NFS mount). I setup the nodes for distributed computing using Parallel Python. I am planning to rewrite my logic to utilize the power of the cluster.
My Question:
I need to do a lot of the following type of lookups:
What IDs contain a given string?
For instance, given an arbitrary string: "String1", I need to know that say, "ID1, ID2234" contain it.
I know of two methods for now:
- Using python call grep
- Each of the 20 nodes takes control over a set of files and upon a request for search, searches their associated files.
Can someone suggest a good approach to speed up this otherwise inefficient task?