How to build robust ETL pipelines with AWS SQS
Scraping a large quantity of sites and data can be a complicated and slow process. But it is one that can take great advantage of parallel processing, either locally with multiple processor threads, or distributing scraping requests to report scrapers using a message queue system. There may also be the need for multiple steps in a process similar to an Extract, Transform, and Load pipeline (ETL). These pipelines can also be easily built using a message queuing architecture in conjunction with the scraping.
Using a message queuing architecture gives our pipeline two advantages:
- Robustness
- Scalability
The processing becomes robust, as if processing of an individual message fails, then the message can be re-queued for processing again. So if the scraper fails, we can restart it and not lose the request for scraping the page, or the message queue system will deliver the request to another scraper.
It provides scalability, as multiple scrapers on the...