tags:

views:

434

answers:

7

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?

I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.

But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?

In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?

+9  A: 

But in reality, does companies like Amazon/Ebay, sort terabytes of data? I know, they store tons of info but sorting them???

Yes. Last time I checked Google processed over 20 petabytes of data daily.

Why wouldn't they keep them sorted at the first place instead of sorting terabytes of data, is my question in a nutshell.

EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.

NullUserException
I agree. But doubt is sorting so much data at one shot, why would someone do it.
+1. Recently, a team of programmers was able to sort 1 terabyte in 1 minute.
Fosco
Perhaps he wants the existing data sorted according to a new or modified criterion?
Peter G.
Focso, Really? Doing what algorithm and for which firm.
The number of items and the size of their keys is what matters. If I have 13G videos, and I'm sorting by filename, that terabyte doesn't sound like much. But if I'm sorting strings of roughly 50 characters...
Steven Sudit
@nsivakr http://scienceblog.com/36957/data-sorting-world-record-falls-computer-scientists-break-terabyte-sort-barrier-in-60-seconds/ and http://sortbenchmark.org/tritonsort_2010_May_15.pdf
Kotti
Steve, sounds correct. Sorting string of terabytes of data, is too much.
Sorting terabyte of data (consisting of short strings) is no big task, using a 20 0.085$ machines on AWS using EMR you can sort strings in under a under an hour.Though uploading dataset to S3 is actually more tedious and time consuming.
Akshay Bhat
Akshay, under an hour is terrific. Did you write the code of your own? Do you mind, sharing, how you achieved it?
@nsivakr: Why would you say this is "too much"? It's perfectly doable, people are doing it, and it's useful.
David Thornley
David, I was curious to know, under which circumstances, someone would sort terabytes or even higher of data. (Purely sorting). I'm not talking about maintaining maps / indexes in database.
yikes 10 hours and not 1 hour.http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.htmlcheck above linkits simple, if you understand hadoop / map reduce, sorting is a mandatory step after map operation. by using identity mapper and no reducer. The output would be sorted. for custom output you would need to emit key to be sorted as output of map function. http://en.wikipedia.org/wiki/MapReduce
Akshay Bhat
A: 

Umm, how would they get it sorted in the first place?

Oli Charlesworth
I think the idea is as the data comes in you insert it into your dataset in sorted order, so at no time do you have a huge amount of unsorted data you suddenly need to sort.
+3  A: 

Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.

relet
+1  A: 

Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.

GreenMatt
+5  A: 

Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.

Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.

For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.

http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/

The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.

However for sorting 1 TB I would use map-reduce using Hadoop. Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.

Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.

Here is the link to the identity mapper: http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.

Akshay Bhat
+3  A: 

Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.

Vlad Lazarenko
+4  A: 

Yes, certain companies certainly sort at least that much data every day.

Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.

Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.

Dean J
Dean, do you work for Google? How do they handle errors and network failures? That sounds exciting project to develop.
I meant, if there is an error, does another thread/process takes over from where it was left etc.,?
have a look at apache hadoop, they do checkpointing and replication to handle failures
Akshay Bhat
I do not work for Google. :-) I'm not sure on the implementation, but the PDF explaining MapReduce explains that network failures are handled transparently to the algorithm; they've split apart the algorithm you're trying to run from the code required to make it run well on multiple boxes, so that developers work more on algorithms and don't waste extra time making it reliable across the network. Hadoop is probably what you want.
Dean J