





















































(For more resources related to this topic, see here.)
In today's world, real-time information is continuously getting generated by applications (business, social, or any other type), and this information needs easy ways to be reliably and quickly routed to multiple types of receivers. Most of the time, applications that are producing information and applications that are consuming this information are well apart and inaccessible to each other. This, at times, leads to redevelopment of information of producers or consumers to provide an integration point between them. Therefore, a mechanism is required for seamless integration of information of producers and consumers to avoid any kind of rewriting of an application at either end.
In the present era of big data, the first challenge is to collect the data and the second challenge is to analyze it. As it is a huge amount of data, the analysis typically includes the following and much more:
Message publishing is a mechanism for connecting various applications with the help of messages that are routed between them, for example, by a message broker such as Kafka. Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are.
Apache Kafka is an open source, distributed publish-subscribe messaging system, mainly designed with the following characteristics:
Kafka provides a real-time publish-subscribe solution, which overcomes the challenges of real-time data usage for consumption, for data volumes that may grow in order of magnitude, larger that the real data. Kafka also supports parallel data loading in the Hadoop systems.
The following diagram shows a typical big data aggregation-and-analysis scenario supported by the Apache Kafka messaging system:
At the production side, there are different kinds of producers, such as the following:
At the consumption side, there are different kinds of consumers, such as the following:
A large amount of data is generated by companies having any form of web-based presence and activity. Data is one of the newer ingredients in these Internet-based systems. This data typically includes user-activity events corresponding to logins, page visits, clicks, social networking activities such as likes, sharing, and comments, and operational and system metrics. This data is typically handled by logging and traditional log aggregation solutions due to high throughput (millions of messages per second). These traditional solutions are the viable solutions for providing logging data to an offline analysis system such as Hadoop. However, the solutions are very limiting for building real-time processing systems.
According to the new trends in Internet applications, activity data has become a part of production data and is used to run analytics at real time. These analytics can be:
Real-time usage of these multiple sets of data collected from production systems has become a challenge because of the volume of data collected and processed.
Apache Kafka aims to unify offline and online processing by providing a mechanism for parallel load in Hadoop systems as well as the ability to partition real-time consumption over a cluster of machines. Kafka can be compared with Scribe or Flume as it is useful for processing activity stream data; but from the architecture perspective, it is closer to traditional messaging systems such as ActiveMQ or RabitMQ.
Some of the companies that are using Apache Kafka in their respective use cases are as follows:
The source of the above information is https: //cwiki. apache.org/confluence/display/KAFKA/Powered+By.
In this article, we have seen how companies are evolving the mechanism of collecting and processing application-generated data, and that of utilizing the real power of this data by running analytics over it.
Further resources on this subject: