





















































Apache Flink has been following the philosophy of taking a unified approach to batch and streaming data processing. The core building block is “continuous processing of unbounded data streams.” With this continuous processing, users can also do offline processing of bounded data sets.
The batch is considered as the special case of streaming and is supported by various projects such as Flink, Beam, etc. It is known as a powerful way of building data applications that generalize across real-time and offline processing and further reduces the complexity of data infrastructures.
“Batch is just a special case of streaming does not mean that any stream processor is now the right tool for your batch processing use cases.” Pure stream processing systems are slow at batch processing workloads. A stream processor that shuffles through message queues to analyze large amounts of available data is not useful.
Unified APIs such as Apache Beam delegate to different runtimes based on whether the data is continuous/unbounded of fix/bounded. For example, the implementations of the batch and streaming runtime of Google Cloud Dataflow are different, to get the desired performance and resilience in each case. Apache Flink has a streaming API that can do bounded/unbounded use cases and also offers a separate DataSet API and runtime stack which is faster for batch use cases.
To make Flink’s experience on bounded data (batch) state-of-the-art, few enhancements are required.
Currently, the bounded and unbounded operators have a different network and threading model which doesn’t mix and match. Continuous streaming operators are the foundation in a unified stack. While operating on bounded data without latency constraints, the API or the query optimizer can easily select from a larger set of operators.
While input data is bounded, it is possible to completely buffer data during shuffles and also replay that data after a failure. This makes recovery fine-grained and much more efficient.
A continuous unbounded streaming application needs all the operators that are running at the same time. An application on bounded data can schedule operations depending on how the operators consume data which increases resource efficiency.
Currently, only the Table API activates the optimizations while working on bounded data.
In order to be competitive with the best batch engines, Flink needs more coverage and performance for the SQL query execution. As the core data-plane in Flink is high performance, the speed of SQL execution depends on optimizer rules, a rich set of operators, and also features like code generation.
As Blink’s code is currently available as a branch in the Apache Flink repository, it is difficult to merge big amount of changes and making the merge process as non-disruptive as possible.
The merge plan focuses on the bounded/batch processing features and follows the following approach to ensure a smooth integration:
To know more, check out Apache’s official post.
LLVM officially migrating to GitHub from Apache SVN
Apache NetBeans IDE 10.0 released with support for JDK 11, JUnit 5 and more!
Confluent, an Apache Kafka service provider adopts a new license to fight against cloud service providers