views:

383

answers:

5

Given that Solid State Disks (SSDs) are decreasing in price and soon will become more prevalent as system drives, and given that their access rates are significantly higher than rotating magnetic media, what standard algorithms will gain in performance from the use of SSDs for local storage? For example, the high random read speed of SSDs makes something like a disk-based hashtable a viability for large hashstables; 4GB of disk space is readily available, which makes hashing to the entire range of a 32-bit integer viable (more for lookup than population, though, which would still take a long time); while this size of a hashtable would be prohibitive to work with with rotating media due to the access speed, it shouldn't be as much of an issue with SSDs.

Are there any other areas where the impending transition to SSDs will provide potential gains in algorithmic performance? I'd rather see reasoning as to how one thing will work rather than opinion; I don't want this to turn contentious.

A: 

Don't kid yourself. SSDs are still a whole lot slower than system memory. Any algorithm that chooses to use system memory over the hard disk is still going to be much faster, all other things being equal.

Triptych
The point is, not all other things are equal. Specifically as an example, 4GB of SSD space is relatively easy to find; 4GB of system memory easily addressable is much harder to find.
McWafflestix
4GB of RAM is pretty standard on any computer that needs to sort 4GB worth of stuff.
Triptych
Price per gigabyte of memory is still lower for RAM versus SSD. 64-bit address space is common in servers and becoming more common on the desktop.
Michael
@Triptych: yes, 4GB of RAM is pretty standard, when you fill that 4GB up with the hashtable, where is your OS going to sit? Your application?
McWafflestix
@Michael: yes, it's a good point, but typically RAM is at a premium; persistent storage less so.
McWafflestix
@McWafflestix - I messed up my statement, RAM is still more expensive per gigabyte. But 4 GB of RAM is cheaper than a decent performing SSD.
Michael
@Michael: yes, it's a good point; still, consider the situation where a user application would like to use a large precomputed hashtable. At installation time, you can probably expect to have 4GB of storage space, and sometime in the near future you can possibly expect that to be solid state; I don't know how viable it would be to expect the user to have 4GB of EXTRA RAM to put your hashtable in. I agree, though, that for server type uses, it's usually easier to add more RAM; consider, however, the point Arno Setagaya makes above of a 50GB hashtable.
McWafflestix
@McWafflestix - It makes me feel like we're trying to force situations where SSD's would be beneficial by creating scenarios that are pathological for rotational media. A 50GB hash table that must have better than 50 MB/s read bandwidth would benefit from an SSD - but just because we picked the worse datastore imaginable for a mechanical drive. If the data could be reorganized into a B-tree like layout, or we separate the indexes and keep them in memory, caching large chunks of the table in memory, and so on, we could still achieve decent performance.
Michael
@Michael: it's a good point; I don't particularly want to get into too much fiddling over details; the point is that the hashtable idea is just one idea; I wanted to know if anyone had any others.
McWafflestix
+6  A: 

Your example of hashtables is indeed the key database structure that will benefit. Instead of having to load a whole 4GB or more file into memory to probe for values, the SSD can be probed directly. The SSD is still slower than RAM, by orders of magnitude, but it's quite reasonable to have a 50GB hash table on disk, but not in RAM unless you pay big money for big iron.

An example is chess position databases. I have over 50GB of hashed positions. There is complex code to try to group related positions near each other in the hash, so I can page in 10MB of the table at a time and hope to reuse some of it for multiple similar position queries. There's a ton of code and complexity to make this efficient.

Replaced with an SSD, I was able to drop all the complexity of the clustering and just use really dumb randomized hashes. I also got an increase in performance since I only fetch the data I need from the disk, not big 10MB chunks. The latency is indeed larger, but the net speedup is significant.. and the super-clean code (20 lines, not 800+), is perhaps even nicer.

SPWorley
Excellent example and good point; I hadn't thought about chess positions, but it's a very interesting case.
McWafflestix
+2  A: 

SSD's are only significantly faster for random access. Sequential access to disk they are only twice as performant as mainstream rotational drives. Many SSD's have poorer performance in many scenarios causing them to perform worse, as described here.

While SSD's do move the needle considerably, they are still much slower than CPU operations and physical memory. For your 4GB hash table example, you may be able to sustain 250+ MB/s off of an SSD for accessing random hash table buckets. For a rotational drive, you'd be lucky to break the single digit MB/s. If you can keep this 4 GB hash table in memory, you could access it on the order of gigabytes a second - much faster than even a very swift SSD.

The referenced article lists several changes MS made for Windows 7 when running on SSD's, which can give you an idea of the sort of changes you could consider making. First, SuperFetch for prefetching data off of disk is disabled - it's designed to get around slow random access times for disk which are alleviated by SSD's. Defrag is disabled, because having files scattered across the disk aren't a performance hit for SSD's.

Michael
You're talking more about optimizations for SSDs; I'm more considering types of algorithms that are made possible (or more viable) by the SSD performance. I'm less interested in the optimizations that are possible (or necessary) than I am in the different types of algorithms or applications which were simply not possible with slower persistent storage.
McWafflestix
+1  A: 

Ipso facto, any algorithm you can think of which requires lots of random disk I/O (random being the key word, which helps to throw the principle of locality to the birds, thus eliminating the usefulness of a lot of caching that goes on).

I could see certain database systems gaining from this though. MySQL, for instance using the MyISAM storage engine (where data records are basically glorified CSVs). However, I think very large hashtables are going to be your best bet for good examples.

Chris
Actually, kind of the point was that algorithms themselves don't use disks; the point was, what standard algorithms can be enabled by using the performance increases of SSDs? Much like how managed code was enabled by computers of a certain speed and size...
McWafflestix
Algorithms themselves **don't** use disks - implementations of algorithms do - upon that we can agree. Yes, managed code was made possible by hardware improvements - but it ook many orders of magnitude "better" computer hardware to do so. The jump between HDD's and SSD's isn't (pardon the expression) oodles of magnitudes. The only reliable benefit is random access. Going back to my initial response "...which requires lots of random disk I/O..."
Chris
A: 

SSD are a lot faster for random reads, a bit for sequential reads and properly slower for writes (random or not).

So a diskbased hashtable is properly not useful with an SSD, since it now takes significantly time to update it, but searching the disk becomes (compared to a normal hdd) very cheap.

tomjen
Note that in the original question, I mentioned that the hashtable is more viable for lookup than population for that precise reason; consider the concept of a "pre-populated" hashtable that ships with software to allow for predefinition of a hash lookup; 4GB of installation space is quite reasonable for modern apps.
McWafflestix