34. Introducing Amazon Elastic MapReduce
RUNNING ON A CLUSTER
Distributed Spark
This is the layout of a Spark Job
Layout of Spark Job | ||
Spark Driver | ||
Cluster Manager | ||
Cluster Worker/ Executors | Cluster Worker/ Executors | Cluster Worker/ Executors |
Other Spark-Submit Parameters
--master
yarn - for running a YARN / Hadoop cluster
hostname:port - for connecting to a master on a Spark standalone cluster
mesos://masternode:port
A master in your SparkConf will override this!!!
--num-executors
Must set explicitly with YARN, only 2 by default
--executor-memory
Make sure you don't try to use more more memory than you have
--total-executor-cores
Amazon Elastic MapReduce
A quick way to create a cluster with Spark, Hadoop, and YARN pre-installed
You pay by the hour-instance and for network and storage IO
Let's run our one-million-ratings movie recommender on a cluster
Let's Use Elastic MapReduce
Very quick and easy way to rent time on a cluster of your own
Sets up a default spark configuration for you on top of Hadoop's YARN cluster manager
Buzzword alert! We're using Hadoop! Well, a part of it anyhow.
Spark also has a built-in standalone cluster manager, and scripts to set up its own EC2-based cluster.
But the AWS console is even easier.
Spark on EMR isn't really expensive, but it's not cheap either.
Unlike MapReduce with MRJob, you'll be using m3.xlarge instances.
I racked up about $30 running Spark jobs over a few hours preparing this course.
You also have to remember to shutdown your clusters when you're done, or else...
So you might just want to watch, and not follow along.
Make sure things run locally on a subset of your data first.
Getting Set Up On EMR
Make an Amazon Web Services account
Create an EC2 key pair and download the .pem file
On Windows, you'll need a terminal like PUTTY
For PUTTY, need to convert the .pem to a .ppk private key file
I'll walk you through this now.
Last updated