Spark architecture in a cluster
Hadoop-based MapReduce framework has widely used for the last few years; however, it has some issues with I/O, algorithmic complexity, low-latency streaming jobs, and fully disk-based operation. Hadoop the Hadoop Distributed File System (HDFS) for efficient computing and storing big data cheaply, but you can only do the computations with a high-latency batch model or static data using the Hadoop-based MapReduce framework. The main big data paradigm that Spark has brought for us is the introduction of in-memory computing and caching abstraction. This makes Spark ideal for large-scale data processing and enables the computing nodes to perform multiple operations by accessing the same input data.
Spark's Resilient Distributed Dataset (RDD) model do everything that the MapReduce paradigm can, and even more. Nevertheless, Spark can perform iterative computations on your dataset at scale. This option helps to execute machine learning, general purpose data processing...