Google's MapReduce framework
How do you quickly sort a list of a billion elements? Or multiply two matrices, each with a million rows and a million columns?
In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That can be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging.
The answer is to separate the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level.
Specifically, the data flows through a sequence of stages:
The input stage divides the input into chunks, usually...