From the online discussion groups and blogs, I have seen a lot of interview questions are related to handling large scale dataset. I am wondering is there a systematic approach to analyze this type of questions? Or in more specific, is there any data structure or algorithms that can be used to deal with this? Any suggestions are really appreciated.
There is no single data structure or algorithm for "handling" large datasets of any nature whatsoever and for every possible purpose -- there is, rather, a vast collection of such architectures, data structures, and algorithms, for such many varied kind of data, and of required "handling" (in single-task, SMP, and distributed environments -- they may well require very different approaches in many cases).
When people describe a Large data set, they frequently mean one where the entire data set can not be stored in memory. This creates challenges as to what data to load and when to load and unload it.
One approach is to use a sequential data file and process from beginning to end. That is effective when the nature of the processing is sequential, but doesn't work well when the processing needs to combine data from various parts of the data set.
Another approach is some sort of indexed file, retrieving necessary bits of data as they are needed.
A specialisation of this is the use of memory mapped files, where you let the memory manager handle the loading and caching of data.
A DBMS can greatly simplify data access, but does add some system overhead.
"Large-scale" data sets fall into several categories that I've seen, each of which presents different challenges for you to think about.
- Data that's too big to fit in memory. Here, some key techniques are:
- Caching data that's frequently used for better performance
- Working on data from a file a chunk at a time instead of trying to read the whole file into memory at once (if you're not working sequentially through the file, this can be particularly challenging!)
- Distributing data across the memory of several machines.
- Data that's too big to fit into a single file because of file system or hardware architecture limits. This is pretty easy to solve – split the file – but there's a practical question in many cases of what a sensible split would be.
- Data that's too big to fit on a single hard disk. Here, mostly the techniques are to buy bigger disks :-), or to distribute the data across several machines.
- Distributing data across multiple machines poses interesting challenges when you need to do analysis or transformations on the data. This is a deep topic with a lot of different approaches and challenges. Map/reduce frameworks like CouchDB and Hadoop have recently become popular tools for research and application in this area.
- Data that's too big for a single database instance. This can be a problem of disk size (ran out of space) or performance (memory cache keeps getting blown, indices have simply gotten too big). Maintaining robustness and performance of data split across multiple DB instances, potentially in multiple data centers, is an area of perennial interest to large enterprises. Here, the choices are:
- Vertical splits (different tables to different DBs)
- Horizontal splits (same table on different DBs, but holding different data)
Other problems often associated with large-scale data sets, but not size-related problems per se, are:
- Data that is coming in fast. Think of systems that need to scale to millions or even billions of transactions per minute.
- Data that is constantly changing. How do you deal with stale data or data that's being modified while you're working on it?