30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager

Activity

  • Import MoviesSimilarities.scala from resource folder into Eclipse-Scala IDE in SparkScala folder

  • Open and take a look at MoviesSimilarities.scala

Looking At The Code

  • Looking at the main method where it runs

      println("\nLoading movie names...")
      val nameDict = loadMovieNames()
    
        /** Load up a Map of movie IDs to movie names. */
        def loadMovieNames() : Map[Int, String] = {
    
          // Handle character encoding issues:
          implicit val codec = Codec("UTF-8")
          codec.onMalformedInput(CodingErrorAction.REPLACE)
            codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
    
            // Create a Map of Ints to Strings, and populate it from u.item.
          var movieNames:Map[Int, String] = Map()
    
          val lines = Source.fromFile("../ml-100k/u.item").getLines()
          for (line <- lines) {
                 var fields = line.split('|')
                 if (fields.length > 1) {
                     movieNames += (fields(0).toInt -> fields(1))
                 }
           }
           return movieNames
       }
  • The method loadMovieNames() gets maps the data from u.item into a map of key MovieIDs, and value of MovieNames

  • Retrieves the data from u.data and split it into a map with user ID being the key and value being a tuple of movie ID and rating

  • Self join to combine all the movie ratings by the same user into a RDD

  • filterDuplicates function compares against 2 userRatingPairs and takes only 1 unique userRatingPair with the lower rating (True condition for movie1 < movie2)

  • Now we have a unique key of (movie1, movie2) with value of rating 1, 2 by the makePairs function

  • groupByKey functiokn will group together all the different ratings associated for a given movie pair

  • computeCosineSimlarity function will compute a similarity score for each of the given movie pair

  • This is just one way of measuring how similiar are the ratings for a movie pair similiar to one another

  • We use .cache function on moviePairSimilarities to use the RDD more than once

  • We filter the movies with similarity score with a threshold we specified

  • We map the results and flip the results around to take the top 10 similiar movies with the highest similarity score in descending order

  • Then we use a condition to check if the similiarMovieID is the movieID we are looking for, and than display the similiarity movie score for the movie pair

  • So we are finding out if for a given movie pair, do they have similar ratings given to them

To Run The Code And Pass In An Argument

  • We need to pass in the argument for the movieID we are finding, from the command line using the submit command

  • Right click the package and click export

  • Select Java, and JAR File

  • We can use the settings as default

  • We can export the JAR to our SparkScalaCourse Folder with the File name: MovieSims.jar

  • Just press finish to export MovieSims.jar to the destinaton folder

  • Open cmd and run as Administrator and cd to the folder containing MovieSims.jar

  • Next pass in the submit argument for Spark to run the JAR with the movieID 50 which stands for Star Wars

  • If you encounter the issue of spark-submit command not found in cmd, open control panel and navigate to environment variables. Select on system variables and add SPARK_HOME as a variable and into the path to execute spark-submit command.

  • For my Windows setup, it can run the Spark Job but it encountered an exception.

  • This exception is present when running simulated Hadoop environment on Windows with Spark job

  • You should see this as the output on your cmd for the Top 10 similiar movies rating for Stars Wars (1977)

Last updated