Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • Download the required files for SparkScala
  • Importing SparkScala files into Scala Eclipse IDE
  • Creating and Running a Scala Spark Application
  1. Section 1: Getting Started

3. [Activity] Create a Histogram of Real Movie Ratings with Spark!

Previous2. Introduction, and Getting Set UpNext4. [Activity] Scala Basics, Part 1

Last updated 6 years ago

Download the required files for SparkScala

  • Create a folder called SparkScala on your (C:/) so as to store your files of the course content

  • Go to

  • Click on the datasets tab, and choose MovieLens 100K Dataset, ml-100kzip to download

  • Extract and copy ml-100k folder with all the files inside to the SparkScala folder you created

  • Download the from Sundog Website

  • Extract the files from SparkScala.zip and remember where you store all of these data, as it contains all the source files for this course

Importing SparkScala files into Scala Eclipse IDE

  • Open and input C:/SparkScala for select a workspace for Scala IDE

  • Create a new Scala project for the course

  • Name it SparkScalaCourse and click finish

  • Right click the project and click create, new Package

  • Name the package

      com.sundogsoftware.spark
  • Click the finish button

  • Right click on the package and click import

  • Navigate to General/File System/

  • Choose where you have put the SparkScala Folder sources (For me, I put under this directory)

      C:\Users\User\Desktop\Alvin Programming Files\Data Science Courses\Apache 2.0 Spark with Scala\SparkScala)
  • Check RatingsCounter.scala

  • Double click on RatingsCounter.scala

  • You can see all the codes under that file, and there will be alot of missing dependencies

  • To resolve that, right click SparkScalaCourse project and select properties

  • Go to Java build path and select add external JARs

  • Navigate to (C:/spark/jars) and CTRL-a and select all the spark JARS to be added to Scala Spark IDE

  • There might be errors displaying that spark JARS were under Scala version 2.11 which is different from my current Scala version of 2.13

  • To fix that, right click on the project and go to properties, and go to Scala Compiler

  • Check use Project Settings, and select the Fixed Version Scala built in version of your Scala IDE (For me its Scala version 2.11.11)

  • After that, all Scala version related errors should disappear

Creating and Running a Scala Spark Application

  • Click on run, and go to Run Configurations

  • Click on Scala Applications, and input this for main class

      com.sundogsoftware.spark.RatingsCounter
  • Click run and it should work

  • The console should now show the output for the count for the ratings 1 to 5

grouplens.org
SparkScala.zip