Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • Activity
  • Looking At The Code
  • But The Results Aren't Really That Great.
  • MLLIB Is Still Really Useful, Though.
  1. Section 7: Machine Learning with MLLib

44. [Activity] Using MLLib to Produce Movie Recommendations

Activity

  • So the first thing, I have don is to modify the data set a little bit

  • So I have gone into the u.data file for ml-100k folder, and added three entries for a fictitious user.

    0    50    5    881250949
    0    172    5    881250949
    0    133    1    881250949
  • What these ratings mean basically is that I've created a new user id zero, which is going to represent me

  • This particular user zero user loves Star Wars which happens to be Id 50 and loves The Empire Strikes Back which is 172, five star ratings on both, but hates the movie Gone With the Wind which is rating 1

  • Import MovieRecommendationsALS.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE

  • Open MovieRecommendationsALS.scala and look at the code

Looking At The Code

    import org.apache.spark.mllib.recommendation._
  • The MLLib package was imported to our Scala file

    // Build the recommendation model using Alternating Least Squares
    println("\nTraining recommendation model...")

    val rank = 8
    val numIterations = 20

    val model = ALS.train(ratings, rank, numIterations)
  • The recommendation model was built by using ALS, and setting the rank and numIterations by our specified number

    val userID = args(0).toInt

    println("\nRatings for user ID " + userID + ":")

    val userRatings = ratings.filter(x => x.user == userID)

    val myRatings = userRatings.collect()

    for (rating <- myRatings) {
        println(nameDict(rating.product.toInt) + ": " + rating.rating.toString)
    }
  • We will take the userId that we want based on the command line argument

  • The ratings for all the movies that the userID have rated will be printed out

    println("\nTop 10 recommendations:")

    val recommendations = model.recommendProducts(userID, 10)
    for (recommendation <- recommendations) {
        println( nameDict(recommendation.product.toInt) + " score " + recommendation.rating )
    }
  • Next, the model will recommend the top ten movies for the userID based on ALS

  • The only difference here is put an argument of 0 when you run the scala code

  • Now, run it and see the output

  • However, there's an issue. Each time you run the model, it will display different results

But The Results Aren't Really That Great.

  • Very sensitive to the paremeters chosen. Takes more work to find optimal parameters for a data set than to run the recommendations

    • Can use "train/test" to evaluate various permutations of parameters

    • But what is a "good recommendation" anyway ?

  • I'm not convinced it's even working properly.

    • Puttin your faith in a black box is dodgy.

    • We'd get better results using our movie similarity results instead, to find similar moves to moves each user liked.

    • Complicated isn't always better.

  • Never blindly trust results when analyzing big data

    • Small problems in algorithms become big ones

    • Very often, quality of your input data is the real issue.

MLLIB Is Still Really Useful, Though.

Previous43. Introducing MLLibNext45. [Activity] Using DataFrames with MLLib

Last updated 6 years ago