30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
Activity
Import MoviesSimilarities.scala from resource folder into Eclipse-Scala IDE in SparkScala folder
Open and take a look at MoviesSimilarities.scala
Looking At The Code
Looking at the main method where it runs
The method loadMovieNames() gets maps the data from u.item into a map of key MovieIDs, and value of MovieNames
Retrieves the data from u.data and split it into a map with user ID being the key and value being a tuple of movie ID and rating
Self join to combine all the movie ratings by the same user into a RDD
filterDuplicates function compares against 2 userRatingPairs and takes only 1 unique userRatingPair with the lower rating (True condition for movie1 < movie2)
Now we have a unique key of (movie1, movie2) with value of rating 1, 2 by the makePairs function
groupByKey functiokn will group together all the different ratings associated for a given movie pair
computeCosineSimlarity function will compute a similarity score for each of the given movie pair
This is just one way of measuring how similiar are the ratings for a movie pair similiar to one another
We use .cache function on moviePairSimilarities to use the RDD more than once
We filter the movies with similarity score with a threshold we specified
We map the results and flip the results around to take the top 10 similiar movies with the highest similarity score in descending order
Then we use a condition to check if the similiarMovieID is the movieID we are looking for, and than display the similiarity movie score for the movie pair
So we are finding out if for a given movie pair, do they have similar ratings given to them
To Run The Code And Pass In An Argument
We need to pass in the argument for the movieID we are finding, from the command line using the submit command
Right click the package and click export
Select Java, and JAR File
We can use the settings as default
We can export the JAR to our SparkScalaCourse Folder with the File name: MovieSims.jar
Just press finish to export MovieSims.jar to the destinaton folder
Open cmd and run as Administrator and cd to the folder containing MovieSims.jar
Next pass in the submit argument for Spark to run the JAR with the movieID 50 which stands for Star Wars
If you encounter the issue of spark-submit command not found in cmd, open control panel and navigate to environment variables. Select on system variables and add SPARK_HOME as a variable and into the path to execute spark-submit command.
For my Windows setup, it can run the Spark Job but it encountered an exception.
This exception is present when running simulated Hadoop environment on Windows with Spark job
You should see this as the output on your cmd for the Top 10 similiar movies rating for Stars Wars (1977)
Last updated