Interoperability with streaming platforms (Apache Kafka)
Spark Streaming has very good integration with Apache Kafka, which is the most popular messaging platform currently. Kafka integration has several approaches, and the mechanism has over time to improve the performance and reliability.
There are three main approaches for integrating Spark Streaming with Kafka:
- Receiver-based approach
- Direct stream approach
- Structured streaming
Receiver-based approach
The receiver-based approach was the first between Spark and Kafka. In this approach, the driver starts receivers on the executors that pull data using high-level APIs, from Kafka brokers. Since receivers are pulling events from Kafka brokers, receivers update the offsets into Zookeeper, which is also used by Kafka cluster. The key aspect is the usage of a WAL (Write Ahead Log), which the receiver keeps writing to as it consumes data from Kafka. So, when there is a problem and executors or receivers are lost or restarted, the WAL can be used...