Batch processing versus real-time processing
Before we dive deep into different data ingestion techniques, let's discuss the difference between batch and real-time (stream) processing. The following explains the difference between these two ecosystems.
Batch processing
The following points describe the batch processing system:
- Very efficient in processing a high volume of data.
- All data processing steps (that is, data collection, data ingestion, data processing, and results presentation) are done as one single batch job.
- Throughput carries more importance than latency. Latency is always more than a single minute.
- Throughput directly depends on the size of the data and available computational system resources.
- Available tools include Apache Sqoop, MapReduce jobs, Spark jobs, Hadoop DistCp utility, and so on.
Real-time processing
The following points describe how real-time processing is different from batch processing:
- Latency is extremely important, for example, less than one second
- Computation is relatively...