Chapter 8. Creating Datasources from Redshift
In this chapter, we will use the power of SQL queries to address non-linear datasets. Creating datasources in Redshift or RDS gives us the potential for upstream SQL-based feature engineering prior to the datasource creation. We implemented a similar approach in Chapter 4, Loading and Preparing the Dataset, by leveraging the new AWS Athena service to apply preliminary transformations on the data before creating the datasource. This enabled us to expand the Titanic
dataset by creating new features, such as the Deck
number, replacing the Fare
with its log or replacing missing values for the Age
variable. The SQL transformations were simple, but allowed us to expand the original dataset in a very flexible way. The AWS Athena service is S3 based. It allows us to run SQL queries on datasets hosted on S3 and dump the results in S3 buckets. We were still creating Amazon ML datasources from S3, but simply adding an extra data preprocessing layer to massage...