tags:

views:

251

answers:

7

I need to index up to 500,000 entries for fastest read. The index needs to be rebuilt periodically , on disk. I am trying to decide between a simple file like a hash on disk or a single table in an embedded database. I have no need for an RDBMS engine.

A: 

This is what MapReduce was invented for. Hadoop is a cool java implementation.

sblundy
MapReduce has nothing to do with reducing index response times in databases.
1800 INFORMATION
No, it doesn't. But as I understand the question, it's about reading from disk.
sblundy
+1  A: 

I'm assuming you're referring to indexing tables on a relational DBMS (like mySql, Oracle, or Postgres).

Indexes are secondary data stores that keep a record of a subset of fields for a table in a specific order.

If you create an index, any query that includes the subset of fields that are indexed in its WHERE clause will perform faster.

However, adding indexes will reduce INSERT performance.

In general, indexes don't need to be rebuilt unless they become corrupted. They should be maintained on the fly by your DBMS.

kbaribeau
A: 

If the data doesn't need to be completely up to date, you might also like to think about using a data warehousing tool for OLAP purposes (such as MSOLAP). The can perform lightning fast read-only queries based on pre-calculated data.

1800 INFORMATION
+1  A: 

Perhaps BDB? It is a high perf. database that doesn't use a DBMS.

1800 INFORMATION
+1  A: 

If you've storing state objects by key, how about Berkeley DB.

sblundy
+1  A: 

cdb if the data does not change.

/Allan

Allan Wind
A: 

PyTables Pro claims that "for situations that don't require fast updates or deletions, OPSI is probably one of the best indexing engines available". However I've not personally used it, but the F/OSS version of PyTables gives already gives you good performance:

http://www.pytables.org/moin/PyTablesPro

Wheat