3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
Download the required files for SparkScala
Create a folder called SparkScala on your (C:/) so as to store your files of the course content
Go to grouplens.org
Click on the datasets tab, and choose MovieLens 100K Dataset, ml-100kzip to download
Extract and copy ml-100k folder with all the files inside to the SparkScala folder you created
Download the SparkScala.zip from Sundog Website
Extract the files from SparkScala.zip and remember where you store all of these data, as it contains all the source files for this course
Importing SparkScala files into Scala Eclipse IDE
Open and input C:/SparkScala for select a workspace for Scala IDE
Create a new Scala project for the course
Name it SparkScalaCourse and click finish
Right click the project and click create, new Package
Name the package
Click the finish button
Right click on the package and click import
Navigate to General/File System/
Choose where you have put the SparkScala Folder sources (For me, I put under this directory)
Check RatingsCounter.scala
Double click on RatingsCounter.scala
You can see all the codes under that file, and there will be alot of missing dependencies
To resolve that, right click SparkScalaCourse project and select properties
Go to Java build path and select add external JARs
Navigate to (C:/spark/jars) and CTRL-a and select all the spark JARS to be added to Scala Spark IDE
There might be errors displaying that spark JARS were under Scala version 2.11 which is different from my current Scala version of 2.13
To fix that, right click on the project and go to properties, and go to Scala Compiler
Check use Project Settings, and select the Fixed Version Scala built in version of your Scala IDE (For me its Scala version 2.11.11)
After that, all Scala version related errors should disappear
Creating and Running a Scala Spark Application
Click on run, and go to Run Configurations
Click on Scala Applications, and input this for main class
Click run and it should work
The console should now show the output for the count for the ratings 1 to 5
Last updated