views:

1237

answers:

21

Hi All,

I have found relational database options, but I want to see if there are any faster non-relational databases that could do better.

  • I am storing up to 32 GB of these records which contain 3 ints, a byte[8] array (or binary(8)), and a double.
  • I will RARELY RARELY query the database, but I need to make sure that if I do read from the database it won't take forever. Are there non-relational databases that support indexing? Queries will be ranged based (give me records where 0 < x < 100), so I am not looking for a hash indexing scheme.
  • Since I insert large volumes of data at very high rates, insert time MUST be fast
  • I only need one connection
  • Does not need to support ACID properties
  • Must work on both linux and windows
  • Bonus points for working on JAVA or using JDBC.

Please recommend a database for me or make some suggestions

Thanks in advance, jbu

By the way, would an XML solution be ok? Does that support fast lookups?

+3  A: 

bdb springs to mind.

althought, if queries are really seldom; it could be far better to simply use a fixed-length binary record and append to file. it will be really fast to write, and much faster to scan than a text-based one (like XML). it's also easier to do an out-of-line indexer, just write a pointer file (called a 'waldo') that says how far you've already read, and pick any new records.

take a hint from how snort does it: there used to be several output formats, but most of them had severe performance problems when scaling to high-throughput networks, so the favored scheme is the 'unified' format, which is simply a concatenation of fixed-length records. another program, barnyard does all post-processing and stores to a 'real' DB, not caring if it lags some time behind.

Javier
I'm testing berkeley db JE right now, it is very fast but I have not optimized it enough to get good lookup speeds. It also costs money :(
Berkeley DB doesn't necessarily cost money ( http://www.oracle.com/technology/software/products/berkeley-db/htdocs/oslicense.html ). That license may not be suitable for this project, but it may be suitable for other projects.
Max Lybbert
Based on the comment "i'll need a smallint, int, and tinyint. The records are inserted in the order of the attribute which is a double. ..." I really don't think you'll find anything faster than BerkelyDb. Many of the alternatives are implemented with BerkeleyDb.
Max Lybbert
I don't know - it sounds to me like a custom storage solution may well be faster - but we need to know what the queries will be like. (Always one field? If so, which? etc...)
Jon Skeet
queries: they will be on one or two fields (either an int field, a double field, or both together).
I've heard enough horror stories at scale to suggest that BDB is not going to be a good solution for 32GB of data. However, it *is* very easy to test this out...
Rich
Personally I haven't heard those horror stories. The two big open source LDAP servers use Berkeley, and Google uses a high availability version for account signon ( http://docs.huihoo.com/berkeley/cs_google_1005D.pdf ).
Max Lybbert
+1  A: 

You will need to use a niche high demand database, and I'm not familiar with them. XML is totally out of the question because XML is about portability which makes you loose performance (speed).

TravisO
Anybody who voted this comment down obviously ignored the part of his question where he says he has 32GB of data. You can't efficiently load that via XML, get real people!
TravisO
+2  A: 

XML is not the solution I don't even know what you would be thinking of with. XML is used as a transport of data sometimes as you can usually gleam from the tags and attributes what the data describes. It also allows easy parsing with error some error correction (plus it is easy to locate the error record). It has an extreme performance loss compared to a db.

I would stick with a relational db, they are probably the most optimized systems you can find in real world easy to use solutions for data storage.

PeteT
It's "glean", not "gleam".
Rich
+3  A: 

XML would be a pretty terrible solution, IMO. I can't see any benefit in using it. However, you could design your own on-disk layout for this in a binary format.

How much memory does your machine have? Could you get all the data in memory? (I know 32GB is quite a lot, but it's possible.)

Will you be querying on multiple fields of the records, or just one?

Jon Skeet
2GB of main memory, so no, the database won't fit in memory.-jbu
Okay. That's a shame. Knowing more about the queries would help. Basically you want to be able to just append to files, to get a quick insert. If you can pick the right file based on field values, that could help a lot. What size ints do you need?
Jon Skeet
i'll need a smallint, int, and tinyint.The records are inserted in the order of the attribute which is a double. I.e. for every record inserted, its double is greater than all the previous records'.
What exactly do you mean by smallint, int and tinyint? Those terms mean different things in different contexts. Numbers of bits would be helpful. Ditto for what you mean by "double". What will you be querying by?
Jon Skeet
smallint = 2 bytes, int = 4bytes, tinyint = 1byte, Double is 8 bytes.indexing on the double attribute, records will be inserted in sorted order by this double. A query will ask for records between two values of the double attribute, and may additionally ask for a certain value of one ofthe ints.
Hmmm... fetching between two values of the double attribute is really easy, given that it's just increasing - you can just blindly append to a file and then read the file with a binary chop to find the right value. The int makes things trickier though... you'd want a separate index for that.
Jon Skeet
A: 

Extending Javier's answer, here's a thread on free alternatives to Berkeley DB.

sblundy
+4  A: 

XML is a data format, not a database. And XML is slow -- it's huge number of extra bytes to marshall and unmarshall simple objects.

Only relational databases work with JDBC, so the JDBC requirement along with the non-relational question is confusing.

Why not serialize the objects and write them to a file the cheapest possible way? Your queries are limited by the fact that you have to read through all 32Gb. However, you can also invent your own indexing schema using a hashmap (or treemap) and serializing that, too.

Or, get a Java OODBMS, Look at JDOInstruments, this devx article on OODBMS lists four commercial products.

S.Lott
+1  A: 

Codebase runs just about everywhere, it's simple, fast, well supported, and rock-solid.

Quote:

The maximum table size is 2.1 billion records.

In addition, compression support can be used to save disk space and improve performance.

The maximum size of index files, which contain sort information, is 4096 gigabytes. We estimate that a table containing 128 indexes, each built on a 60 character field, could support approximately 2 billion records.The maximum table size is 2.1 billion records.

In addition, compression support can be used to save disk space and improve performance.

The maximum size of index files, which contain sort information, is 4096 gigabytes. We estimate that a table containing 128 indexes, each built on a 60 character field, could support approximately 2 billion records.

le dorfier
+5  A: 

Another possibility is CouchDB. It is a "document" database accessible over rest-like http, with json data and fairly nice queries.

Ali A
A: 

Take a look at db4o though it might be an overkill here.

Goran
+1  A: 

From what you have described it sounds like you may want to write a custom data storage mechanism.

There are specialty databases such as Citadel. This database was designed to efficiently store large numbers of data points whose values change very little over time. Because the source and type of data is known before hand it can use several techniques to keep the actual files on disk small. I list this just as an example, since I don't think you can just buy Citadel without buying LabvIEW. A very expensive product. There used to be others in the field similar to this, but I can no longer find them via google.

You may want to go the other direction and look at Hypertable and Hadoop. Though these depend on a cluster of machines to reach their full potential, and that seems a little beyond what you described.

grieve
A: 

While I would recommend a second look at Berkeley DB, there's Hadoop.* Yes, I know Hadoop isn't really a database. But, from that page, "The following links provide interesting data samples that are most efficiently manipulated using distributed systems techniques." i.e., Hadoop is used to manipulate data in the size you're looking at.

EDIT from a comment, the data store used by Hadoop is called HDFS.

Max Lybbert
Hadoop is a map-reduce framework. It is not a data store.You are thinking of HDFS, which is a scalable distributed file system that serves as Hadoop's backend. Sadly, it is write-once -- you cannot insert or append after a file is created. THe whole thing is way overkill for the OP's purpose.
SquareCog
32GB is not big enough to warrant a distributed system.
Rich
> Hadoop is a map-reduce framework. It is not a data store> " Yes, I know Hadoop isn't really a database"But thanks for naming HDFS.I haven't had a need to use Hadoop. But I understand it can be used on a single machine for smaller datasets.
Max Lybbert
+1  A: 

You can do much better than with a relational database. They were build with a totally different access profile (and ACID) in mind. On the other hand, avoid premature optimization. Fast enough is fast enough, are can quickly become limited by engineering (programming) capacity.

For performance, as said before, XML is out of the question.

Start with a simple file-based solution. Keep as much data in ram as possible, make sure to use int's and not Integer's.

Think about splitting over multiple machines. If your data is random, use a radix to distribute over multiple machines. If you have repeating data, (does it zip very well?) normalize it by extracting the repeating part and replacing it by an index.

Stephan Eggermont
+4  A: 

Sounds to me like your getting hung up on calling it a database and you just want a simple ISAM file. Have a hunt around for an implementation that suits you.

If you're after an actual non-relational product then Btrieve has a long and honourable history -

Personally in these circumstances I just opt for SQLLite. It's small, fast, and the convenience of a portable implementation that offers ACID if needed is tricky to beat.

Cruachan
+3  A: 

SQLLite is built on top of a BTree library, which you can use, bypassing the SQL layer.

Celestial M Weasel
A: 

We have a similar application, and Berkeley DB works well for us.

Mark Harrison
A: 

Is the data being written in "real time", all the time?

I would consider a couple of things.

First, I would think long and hard about not updating the indexes during "operations". That is, if you can, say, batch the index update "in the middle of the night", that would be a bonus.

Next, given the volume of your records, I would try and find an implementation that batch update the index. Most databases are designed to update their indexes one row at a time. But some have a "special mode" where it can make (or enforce) some base assumptions. For example, if the system "knows" that the data it is indexing is already sorted, it can make some good performance improvements to speed up the load.

If the query doesn't need to be "up to date" (i.e. it's typically a historical query, rather than stuff that happened in the past 5 minutes), you could likely index the data yourself quite easily, simply by sorting it by each key, and then doing a binary search on the data. This will save space as well. The MergeSort algorithm on memory mapped files can be quite performant.

Your indexes are likely going to be as large, and potentially even larger, than your raw data since your rows are quite small.

If you're searching by something as simple as "record number", then you likely don't even need an index at all, as the system will already be in record number order, and you can binary search that easily.

Really depends on how much new data you're getting every day, and how fast it's coming in.

Another option is to record the new data in its own file, then simply merge it in to the older data (and the indexes). That can be very fast. If you get, say, 1GB of data per day, and run merge at night, it's straightforward to merge the 1GB in to the master 32GB data store and indexes. Lot of I/O either way, but it's almost all bulk, sequential, streaming I/O.

Will Hartung
A: 

XML is definitely not what you are looking for.

What you describe sounds like a job for HDF5: http://www.hdfgroup.org/HDF5/

There are Java interfaces to this library.

Also interesting is the PyTables project: http://www.pytables.org/moin

Scott
A: 

I know this is a little old, but I just ran across the C++ library stxxl, which "implements containers and algorithms that can process huge volumes of data that only fit on disks" (that is, "volumes of data too big to fit in memory").

Bullet points are:

  • Transparent support of parallel disks.
  • The library is able to handle problems of very large size (up to dozens of terabytes).
  • Supports explicitly overlapping between I/O and computation.
  • Small constant factors in I/O volume. ... "pipelining" can save more than half the number of I/Os performed by many applications.
  • Shorter development times due to well known STL-compatible interfaces for external memory algorithms and data structures.
Max Lybbert
A: 

Check out InterSystems Caché http://www.intersystems.com/ It is the fastest Data Non-Relational (and they claim) the fastest relational SQL Compliant database on the market.

CurtTampa
A: 

Read the site of Monash (http://www.dbms2.com/), he writes about all different kind of databases.

tuinstoel
A: 

Personally I would use Redis, it is an advanced data structures server that provides advanced data structures such as strings, lists, sets, sorted sets and hashes. It is very fast which based on these benchmarks can do about 110000 SETs per second, about 81000 GETs per second.

There is a client library for pretty much every active language in use today including a few for Java.

mythz