Packt+ | Advance your knowledge in tech

You're reading from Machine Learning with Spark Develop intelligent, distributed machine learning systems

Product type Paperback

Published in Apr 2017

Publisher Packt

ISBN-13 9781785889936

Length 532 pages

Edition 2nd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (2):

Dua

Ghotra

View More author details

Table of Contents (19) Chapters

Title Page

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Getting Up and Running with Spark FREE CHAPTER

2. Math for Machine Learning

3. Designing a Machine Learning System

4. Obtaining, Processing, and Preparing Data with Spark

5. Building a Recommendation Engine with Spark

6. Building a Classification Model with Spark

7. Building a Regression Model with Spark

8. Building a Clustering Model with Spark

9. Dimensionality Reduction with Spark

10. Advanced Text Processing with Spark

11. Real-Time Machine Learning with Spark Streaming

12. Pipeline APIs for Spark ML

Getting Spark running on Amazon EC2

The Spark project provides scripts to run a Spark cluster in the cloud on Amazon's EC2 service. These scripts are located in the ec2 directory. You can run the spark-ec2 script contained in this directory with the following command:

>./ec2/spark-ec2

Running it in this way without an argument will show the help output:

Usage: spark-ec2 [options] <action> <cluster_name>
<action> can be: launch, destroy, login, stop, start, get-master

Options:
...

Before creating a Spark EC2 cluster, you will need to ensure that you have anAmazon account.

Note

If you don't have an Amazon Web Services account, you can sign up at http://aws.amazon.com/.The AWS console is available at http://aws.amazon.com/console/.

You will also need to create an Amazon EC2 key pair and retrieve the relevant security credentials. The Spark documentation for EC2 (available at http://spark.apache.org/docs/latest/ec2-scripts.html) explains the requirements:

Create an Amazon EC2 key pair for yourself. This can be done by logging into your Amazon Web Services account through the AWS console, clicking on Key Pairs on the left sidebar, and creating and downloading a key. Make sure that you set the permissions for the private key file to 600 (that is, only you can read and write it) so that ssh will work.

Whenever you want to use the spark-ec2 script, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key, respectively. These can be obtained from the AWS homepage by clicking Account | Security Credentials | Access Credentials.

When creating a key pair, choose a name that is easy to remember. We will simply use the name spark for the key pair. The key pair file itself will be called spark.pem. As mentioned earlier, ensure that the key pair file permissions are set appropriately and that the environment variables for the AWS credentials are exported using the following commands:

  $ chmod 600 spark.pem
  $ export AWS_ACCESS_KEY_ID="..."
  $ export AWS_SECRET_ACCESS_KEY="..."

You should also be careful to keep your downloaded key pair file safe and not lose it, as it can only be downloaded once when it is created!

Note that launching an Amazon EC2 cluster in the following section will incur costs to your AWS account.

Launching an EC2 Spark cluster

We're now ready to launch a small Spark cluster by changing into the ec2 directory and then running the cluster launch command:

 $  cd ec2
 $ ./spark-ec2 --key-pair=rd_spark-user1 --identity-file=spark.pem  
    --region=us-east-1 --zone=us-east-1a launch my-spark-cluster

This will launch a new Spark cluster called test-cluster with one master and one slave node of instance type m3.medium. This cluster will be launched with a Spark version built for Hadoop 2. The key pair name we used is spark, and the key pair file is spark.pem (if you gave the files different names or have an existing AWS key pair, use that name instead).

It might take quite a while for the cluster to fully launch and initialize. You should see something like the following immediately after running the launch command:

Setting up security groups...
Creating security group my-spark-cluster-master
Creating security group my-spark-cluster-slaves
Searching for existing cluster my-spark-cluster in region 
    us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 1 slave in us-east-1a, regid = r-5a893af2
Launched master in us-east-1a, regid = r-39883b91
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state...........
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute- 
    1.amazonaws.com port 22: Connection refused
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
    1.amazonaws.com port 22: Connection refused
Warnig: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
    1.amazonaws.com port 22: Connection refused
Cluster is now in 'ssh-ready' state. Waited 510 seconds.

If the cluster has launched successfully, you should eventually see a console output similar to the following listing:

./tachyon/setup.sh: line 5: /root/tachyon/bin/tachyon: 
    No such file or directory
./tachyon/setup.sh: line 9: /root/tachyon/bin/tachyon-start.sh: 
    No such file or directory
[timing] tachyon setup:  00h 00m 01s
Setting up rstudio
spark-ec2/setup.sh: line 110: ./rstudio/setup.sh: 
    No such file or directory
[timing] rstudio setup:  00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
ec2-52-91-214-206.compute-1.amazonaws.com
Shutting down GANGLIA gmond:                               [FAILED]
Starting GANGLIA gmond:                                    [  OK  ]
Shutting down GANGLIA gmond:                               [FAILED]
Starting GANGLIA gmond:                                    [  OK  ]
Connection to ec2-52-91-214-206.compute-1.amazonaws.com closed.
Shutting down GANGLIA gmetad:                              [FAILED]
Starting GANGLIA gmetad:                                   [  OK  ]
Stopping httpd:                                            [FAILED]
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd
    /conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so 
    into server: /etc/httpd/modules/mod_authz_core.so: cannot open 
    shared object file: No such file or directory            [FAILED]
[timing] ganglia setup:  00h 00m 03s
Connection to ec2-52-90-110-128.compute-1.amazonaws.com closed.
Spark standalone cluster started at 
    http://ec2-52-90-110-128.compute-1.amazonaws.com:8080
Ganglia started at http://ec2-52-90-110-128.compute-
    1.amazonaws.com:5080/ganglia
Done!
ubuntu@ubuntu:~/work/spark-1.6.0-bin-hadoop2.6/ec2$

This will create two VMs - Spark Master and Spark Slave of type m1.large as shown in the following screenshot :

To test whether we can connect to our new cluster, we can run the following command:

$ ssh -i spark.pem root@ ec2-52-90-110-128.compute-1.amazonaws.com

Remember to replace the public domain name of the master node (the address after root@ in the preceding command) with the correct Amazon EC2 public domain name that will be shown in your console output after launching the cluster.

You can also retrieve your cluster's master public domain name by running this line of code:

$ ./spark-ec2 -i spark.pem get-master test-cluster

After successfully running the ssh command, you will be connected to your Spark master node in EC2, and your terminal output should match the following screenshot:

We can test whether our cluster is correctly set up with Spark by changing into the Spark directory and running an example in the local mode:

  $ cd spark
  $ MASTER=local[2] ./bin/run-example SparkPi

You should see output similar to what you would get on running the same command on your local computer:

...
14/01/30 20:20:21 INFO SparkContext: Job finished: reduce at 
SparkPi.scala:35, took 0.864044012 s
Pi is roughly 3.14032
...

Now that we have an actual cluster with multiple nodes, we can test Spark in the cluster mode. We can run the same example on the cluster, using our one slave node by passing in the master URL instead of the local version:

$ MASTER=spark:// ec2-52-90-110-128.compute-
      1.amazonaws.com:7077 ./bin/run-example SparkPi

Note

Note that you will need to substitute the preceding master domain name with the correct domain name for your specific cluster.

Again, the output should be similar to running the example locally; however, the log messages will show that your driver program has connected to the Spark master:

...
14/01/30 20:26:17 INFO client.Client$ClientActor: Connecting to 
    master spark://ec2-54-220-189-136.eu-
    west-1.compute.amazonaws.com:7077
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend: 
    Connected to Spark cluster with app ID app-20140130202617-0001
14/01/30 20:26:17 INFO client.Client$ClientActor: Executor added: 
    app-20140130202617-0001/0 on worker-20140130201049-
    ip-10-34-137-45.eu-west-1.compute.internal-57119 
    (ip-10-34-137-45.eu-west-1.compute.internal:57119) with 1 cores
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend:
    Granted executor ID app-20140130202617-0001/0 on hostPort 
    ip-10-34-137-45.eu-west-1.compute.internal:57119 with 1 cores, 
    2.4 GB RAM
14/01/30 20:26:17 INFO client.Client$ClientActor: 
    Executor updated: app-20140130202617-0001/0 is now RUNNING
14/01/30 20:26:18 INFO spark.SparkContext: Starting job: reduce at 
    SparkPi.scala:39
...

Feel free to experiment with your cluster. Try out the interactive console in Scala, for example:

$ ./bin/spark-shell --master spark:// ec2-52-90-110-128.compute-
    1.amazonaws.com:7077

Once you've finished, type exit to leave the console. You can also try the PySpark console by running the following command:

$ ./bin/pyspark --master spark:// ec2-52-90-110-128.compute-
    1.amazonaws.com:7077

You can use the Spark Master web interface to see the applications registered with the master. To load the Master Web UI, navigate to ec2-52-90-110-128.compute-1.amazonaws.com:8080 (again, remember to replace this domain name with your own master domain name).

Remember that you will be charged by Amazon for usage of the cluster. Don't forget to stop or terminate this test cluster once you're done with it. To do this, you can first exit the ssh session by typing exit to return to your own local system and then run the following command:

$ ./ec2/spark-ec2 -k spark -i spark.pem destroy test-cluster

You should see the following output:

Are you sure you want to destroy the cluster test-cluster?
The following instances will be terminated:
Searching for existing cluster test-cluster...
Found 1 master(s), 1 slaves
> ec2-54-227-127-14.compute-1.amazonaws.com
> ec2-54-91-61-225.compute-1.amazonaws.com
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster test-cluster (y/N): y
Terminating master...
Terminating slaves...

Hit Y and then Enter to destroy the cluster.

Congratulations! You've just set up a Spark cluster in the cloud, run a fully parallel example program on this cluster, and terminated it. If you would like to try out any of the example code in the subsequent chapters (or your own Spark programs) on a cluster, feel free to experiment with the Spark EC2 scripts and launch a cluster of your chosen size and instance profile. (Just be mindful of the costs and remember to shut it down when you're done!)