In this article by Alexis Perrier, author of the book Effective Amazon Machine Learning says artificial intelligence and big data have become a ubiquitous part of our everyday lives; cloud-based machine learning services are part of a rising billion-dollar industry. Among the several such services currently available on the market, Amazon Machine Learning stands out for its simplicity. Amazon Machine Learning was launched in April 2015 with a clear goal of lowering the barrier to predictive analytics by offering a service accessible to companies without the need for highly skilled technical resources.

(For more resources related to this topic, see here.)

Working with datasets

You cannot do predictive analytics without a dataset. Although we are surrounded by data, finding datasets that are adapted to predictive analytics is not always straightforward. In this section, we present some resources that are freely available. The Titanic datasetis a classic introductory datasets for predictive analytics.

Finding open datasets

There is a multitude of dataset repositories available online, from local to global public institutions to non-profit and data-focused start-ups. Here’s a small list of open dataset resources that are well suited forpredictive analytics. This, by far, is not an exhaustive list.

This thread on Quora points to many other interesting data sources:https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public.You can also ask for specific datasets on Reddit at https://www.reddit.com/r/datasets/.

The UCI Machine Learning Repository is a collection of datasets maintained by UC Irvine since 1987, hosting over 300 datasets related to classification, clustering, regression, and other ML tasks
Mldata.org from the University of Berlinor the Stanford Large Network Dataset Collection and other major universities alsooffer great collections of open datasets
Kdnuggets.com has an extensive list of open datasets at http://www.kdnuggets.com/datasets
Data.gov and other US government agencies;data.UN.org and other UN agencies
AWS offers open datasets via partners at https://aws.amazon.com/government-education/open-data/.

The following startups are data centered and give open access to rich data repositories:

Quandl and quantopian for financial datasets
Datahub.io, Enigma.com, and Data.world are dataset-sharing sites
Datamarket.com is great for time series datasets
Kaggle.com, the data science competition website, hosts over 100 very interesting datasets

AWS public datasets:AWS hosts a variety of public datasets,such as the Million Song Dataset, the mapping of the Human Genome, the US Census data as well as many others in Astrology, Biology, Math, Economics, and so on. These datasets are mostly available via EBS snapshots although some are directly accessible on S3. The datasets are large, from a few gigabytes to several terabytes, and are not meant to be downloaded on your local machine; they are only to be accessible via an EC2 instance (take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-public-data-sets.htmlfor further details).AWS public datasets are accessible at https://aws.amazon.com/public-datasets/.

Introducing the Titanic dataset

We will use the classic Titanic dataset. The dataconsists of demographic and traveling information for1,309 of the Titanic passengers, and the goal isto predict the survival of these passengers. The full Titanic dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv)in several formats. The Encyclopedia Titanica website (https://www.encyclopedia-titanica.org/) is the website of reference regarding the Titanic. It contains all the facts, history, and data surrounding the Titanic, including a full list of passengers and crew members. The Titanic datasetis also the subject of the introductory competition on Kaggle.com (https://www.kaggle.com/c/titanic, requires opening an account with Kaggle). You can also find a csv version in GitHub repository at https://github.com/alexperrier/packt-aml/blob/master/ch4.

The Titanic data containsa mix of textual, Boolean, continuous, and categorical variables. It exhibits interesting characteristics such as missing values, outliers, and text variables ripe for text mining--a rich database that will allow us to demonstrate data transformations.

Here’s a brief summary of the 14attributes:

pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our target
name: A field rich in information as it contains title and family names
sex: male/female
age: Age, asignificant portion of values aremissing
sibsp: Number of siblings/spouses aboard
parch: Number of parents/children aboard
ticket: Ticket number.
fare: Passenger fare (British Pound).
cabin: Doesthe location of the cabin influence chances of survival?
embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat: Lifeboat, many missing values
body: Body Identification Number
home.dest: Home/destination

Take a look at http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf for more details on these variables.

We have 1,309 records and 14 attributes, three of which we will discard. The home.dest attribute hastoo few existing values, the boat attribute is only present for passengers who have survived, and thebody attributeis only for passengers who have not survived. We will discard these three columnslater on while using the data schema.

Preparing the data

Now that we have the initial raw dataset, we are going to shuffle it, split it into a training and a held-out subset, and load it to an S3 bucket.

Splitting the data

In order to build and select the best model, we need to split the dataset into three parts: training, validation, and test, with the usual ratios being 60%, 20%, and 20%. The training and validation sets are used to build several models and select the best one while the test or held-out set, is used for the final performance evaluation on previously unseen data.

Since Amazon ML does the job of splitting the dataset used for model training and model evaluation into a training and a validation subsets, we only need to split our initial dataset into two parts: the global training/evaluation subset (80%) for model building and selection, and the held-out subset (20%) for predictions and final model performance evaluation.

Shuffle before you split:If you download the original data from the Vanderbilt University website,you will notice that it is ordered by pclass, the class of the passenger and by alphabetical order of the name column. The first 323 rows correspond to the 1st class followed by 2nd (277) and 3rd (709) class passengers. It is important to shuffle the data before you split it so that all the different variables have have similar distributions in each training and held-out subsets. You can shuffle the data directly in the spreadsheet by creating a new column, generating a random number for each row and then ordering by that column.

On GitHub: You will find an already shuffledtitanic.csv file at https://github.com/alexperrier/packt-aml/blob/master/ch4/titanic.csv. In addition to shuffling the data, we have removed punctuation in the name column: commas, quotes, and parenthesis, which can add confusion when parsing a csv file.

We end up with two files:titanic_train.csv with 1047 rows and titanic_heldout.csv with 263rows. These files are also available in the GitHub repo (https://github.com/alexperrier/packt-aml/blob/master/ch4). The next step is to upload these files on S3 so that Amazon ML can access them.

Loading data on S3

AWS S3 is one of the main AWS services dedicated to hosting files and managing their access. Files in S3 can be public and open to the internet or have access restricted to specific users, roles, or services.S3 is also used extensively by AWS for operations such as storing log files or results (predictions, scripts, queries, and so on).

Files in S3 are organized around the notion of buckets. Buckets are placeholders with unique names similar to domain names for websites. A file in S3 will have a unique locator URI: s3://bucket_name/{path_of_folders}/filename. The bucket name is unique across S3. In this section, we will create a bucket for our data, upload the titanic training file, and open its access to Amazon ML.

Go to https://console.aws.amazon.com/s3/home, and open an S3 account if you don’t have one yet.

S3 pricing:S3 charges for the total volume of files you host and the volume of file transfers depends on the region where the files are hosted. At time of writing, for less than 1TB, AWS S3 charges $0.03/GB per month in the US east region. All S3 prices are available at https://aws.amazon.com/s3/pricing/. See also http://calculator.s3.amazonaws.com/index.htmlfor the AWS cost calculator.

Creating a bucket

Once you have created your S3 account, the next step is to create a bucket for your files.Click on the Create bucket button:

introduction-titanic-datasets-img-0

Choose a name and a region, since bucket names are unique across S3, you must choose a name for your bucket that has not been already taken. We chose the name aml.packt for our bucket, and we will use this bucket throughout. Regarding the region, you should always select a region that is the closest to the person or application accessing the files in order to reduce latency and prices.
Set Versioning, Logging, and Tags, versioning will keep a copy of every version of your files, which prevents from accidental deletions. Since versioning and logging induce extra costs, we chose to disable them.
Set permissions.
Review and save.

introduction-titanic-datasets-img-1

Loading the data

To upload the data, simply click on the upload button and select the titanic_train.csv file we created earlier on. You should, at this point, have the training dataset uploaded to your AWS S3 bucket. We added a/data folder in our aml.packt bucket to compartmentalize our objects. It will be useful later on when the bucket will also contain folders created by S3.

At this point, only the owner of the bucket (you) is able to access and modify its contents. We need to grant the Amazon ML service permissions to read the data and add other files to the bucket. When creating the Amazon ML datasource, we will be prompted to grant these permissions inthe Amazon ML console. We can also modify the bucket’s policy upfront.

Granting permissions

We need to edit the policy of the aml.packt bucket. To do so, we have to perform the following steps:

Click into your bucket.

Select the Permissions tab.

In the drop down, select Bucket Policy as shown in the following screenshot. This will open an editor:

introduction-titanic-datasets-img-2

Paste in the following JSON. Make sure to replace {YOUR_BUCKET_NAME} with the name of your bucket and save:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AmazonML_s3:ListBucket”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “machinelearning.amazonaws.com”
},
“Action”: “s3:ListBucket”,
“Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}”,
“Condition”: {
“StringLike”: {
“s3:prefix”: “*”
 }
 }
},
 {
“Sid”: “AmazonML_s3:GetObject”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “machinelearning.amazonaws.com”
},
“Action”: “s3:GetObject”,
“Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*”
 },
 {
“Sid”: “AmazonML_s3:PutObject”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “machinelearning.amazonaws.com”
 },
“Action”: “s3:PutObject”,
“Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*”
 }
 ]
}

Further details on this policy are available at http://docs.aws.amazon.com/machine-learning/latest/dg/granting-amazon-ml-permissions-to-read-your-data-from-amazon-s3.html. Once again, this step is optional since Amazon ML will prompt you for access to the bucket when you create the datasource.

Formatting the data

Amazon ML works on comma separated values files (.csv)--a very simple format where each rowis an observation and each column is a variable or attribute. There are, however, a few conditionsthat shouldbe met:

The data must be encoded in plain text using a character set, such asASCII, Unicode, or EBCDIC
All values must be separated by commas; if a value contains a comma, it should be enclosed by double quotes
Each observation (row) must be smaller than 100k

There are also conditions regarding end of line characters that separate rows. Special care must be taken when using Excel on OS X (Mac) as explained on this page:

http://docs.aws.amazon.com/machine-learning/latest/dg/understanding-the-data-format-for-amazon-ml.html

What about other data file formats?

Unfortunately, Amazon ML datasource are only compatible with csv files and Redshift databases and does not accept formats such as JSON, TSV, or XML. However, other services such as Athena, a serverless database service, do accept a wider range of formats.

Summary

In this article we learnt about how to use and work around with datasets using Amazon web services and Titanic datasets. We also learnt how prepare data and Amazon S3 services.