Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • SETTING UP AND RUNNING FROM YOUR AMAZON EMR CLUSTER
  • Setting Up A Amazon EMR Cluster
  • Running Commands From The PuTTY Instance Connected To EMR
  1. Section 5: Running Spark on a Cluster

35. Creating Similar Movies from One Million Ratings on EMR

Previous34. Introducing Amazon Elastic MapReduceNext36. Partitioning

Last updated 6 years ago

SETTING UP AND RUNNING FROM YOUR AMAZON EMR CLUSTER

Setting Up A Amazon EMR Cluster

  • Download which contains 1 million ratings, and you can copy the file into Amazon S3 service

  • You can create a new Bucket and copy the your files into that Bucket

  • The Bucket on Amazon S3 should contain MovieSimilarities1M.jar(For my setup the JAR name is: MovieSimilarities1M-assembly-1.0.jar), and that is the self contained Spark driver script bubbled up into a JAR file.

  • The Bucket should also contain ml-1m folder which contains the data set files for one million rating

  • Open ml-1m folder and it should contain the following files of README, movies.dat, ratings.dat and users.dat

  • Go to , and create one account if you don't have one. This account is going to use real money, so don't do yourself unless you're comfortable with that.

  • Once you get into the console, select EMR under Analytics

  • Click on create Cluster

  • Name the Cluster name as

      MovieSims1MTest
  • Ensure that Logging is checked, as you retrieve those logs in the event of a failure in the S3 folder

  • Select Spark Spark 1.6.2 Hadoop 2.7.2 YARN with Ganglia 3.7.2

  • Stick with the default configuration unless you want to adjust the different instances and capacity which comes to different costs

  • You can also choose to create an EC2 Key Pair which allows you to have some way of actually connecting to your nodes once you create them

  • Just follow the instruction to Create an Amazon EC2 Key Pair and PEM file for your Windows or Mac/Linux setup

  • Once you create that EC2 Key pair, save that PPK file somewhere safe, because there's no way to get it back after you've gotta than initial download.

  • Once everything is set up properly, just click on Create Cluster

  • It will take about 10 -15 minutes for Amazon to actually go out there and find some available hardware and get it all spun up and configured and booted up for you

  • After your cluster is set up, it is ready and waiting

  • To do that, you need to first connect to it

  • Under the master public DNS, it gives me the externally available address of the master node that I'm gonna run my script from and if I click on this Ssh link it will tell you exactly to Connect to the Master Node Using SSH

  • So for Windows, you can use something called like PuTTY or a terminal program like what I am using.

  • There is a download link there for you to download PuTTY and instructions for you to follow

  • So I am going to copy the address under Host Name Field

  • Open up PuTTY, and paste that address under Host Name (or IP address)

  • Next click on SSH, and Auth to specify your Private key for authentication

  • Click on browse and direct it to your EC2 private key .ppk file that you want to use.

  • Now just hit open, and we should be able to log into our master node.

  • Depending on your security settings, you might actually get a timeout at this point

  • So if you are trying to figure out why you can't connect no matter what you try with your firewall or whatever you still can't get through

  • Odds are there is a block on the server side so if you do run into that

  • A quick tip, you can click on the security group here for the master node in the console

  • Once inside, click inbound and make sure you have an ssh port open

  • So in this case, I had to actually manually add a ssh TCP port 22 to the IP address that I'm connecting from.

  • So if you are having trouble that's probably what you need to do.

  • Once you are finished, you can go back to the EMR console

Running Commands From The PuTTY Instance Connected To EMR

  • Once you are connected to that Putty Instance with EMR, input the following commands to run the JAR script

      ls
    
      pwd
      /home/hadoop
    
      aws s3 cp s3://sundog-spark/MovieSimilarities1M.jar ./
  • This is copying the JAR file from S3 to /home/hadoop

    ls

    aws s3 cp s3://sundog-spark/ml-1m/movies.dat ./

    ls
  • This is copying the movies.dat file from S3 to /home/hadoop

  • All I need to do now is to type in the spark-submit of the name of jar file and do the execution command

  • Stars Movie movieID is 260 in the 1 million ratings dataset

      spark-submit MovieSimilarities1M.jar 260
  • The spark job should now execute on your EMR cluster

  • The output should now show the top similar movies for a Star Wars Episode Four A New Hope based on one million real movie ratings

  • The results look pretty similar

  • Go back to EMR console and press Terminate to Terminate cluster

  • It is important to do that, as you are still going to be billed for the time on that cluster, even though you may not be running any Spark Jobs

MovieLens 1M Dataset
Amazon AWS