ansaurus

Question

Python for indexing and searching using a cluster?

Answer 1

A:

I'd suggest looking at the use of a non-relational database to support this. There are a number of key/value stores that you could look at for storing your data, which should be more efficient than a database. You may want to look at NoSQL on Wikipedia to start with.

EDIT: Are you using the most compact data types possible for the data in your database? Are your IDs integers of the lowest size possible to store the range of IDs? If your strings are ASCII, are you storing them as ASCII strings rather than Unicode (VARCHAR rather than NVARCHAR)?

SamStephens 2010-09-28 05:45:05

@SamStephens: Thanks for the reply. I have been looking at them but assuming that I picked BerekelyDB for my need (as it has nice python bindings), how useful will it be considering that my cluster doesn't have that much space and relies upon an NFS mount? I mean, how does the workflow look like? Can you give me some insights from this perspective?

Legend 2010-09-28 06:17:27

@Legend: I'm afraid I don't have experience with BerkeleyDB, my knowledge is theoretical. I simply imagined you'd use the NoSQL database instead of the SQL database. Having said that, now I think further, a NoSQL database may not index your data more compactly, and may not solve your issue. I've voted Don's answer up, his point is certainly valuable. Also am editing my question to ask more.

SamStephens 2010-09-28 07:21:40

Answer 2

+1 A:

For the requirement of looking up which IDs are associated with a given string, I suggest inverting the ID/string relation so the records are keyed by unique strings and the associated data is a sequence of IDs. A string lookup can the be implemented by either a binary search if sorted, or a hash algorithm. This may concentrate your data considerably if you have a lot of the same strings repeating.

Don O'Donnell 2010-09-28 06:29:24

ansaurus

tags:

views:

answers:

Python for indexing and searching using a cluster?

related questions