Apache Hadoop
Apache Hadoop is an open source software platform that processes very large datasets in a distributed environment with respect to storage and computational power, and is mainly built on low cost commodity hardware. It came into existence thanks to a Google File System paper that was published in October 2003. Another research paper from Google MapReduce looked at simplified data processing in large clusters.
Apache Nutch is a highly-scalable and extensible open source web crawler project which implemented the MapReduce facility and the distributed file system based on Google's research paper. These facilities were later announced as a sub-project called Apache Hadoop.
Apache Hadoop is designed to easily scale up from a few to thousands of servers. It helps you to process locally stored data in an overall parallel processing setup. One of the benefits of Hadoop is that it handles failure at a software level. The following figure illustrates the overall architecture of the Hadoop...