Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • MOST POPULAR SUPERHERO
  • Superhero Social Networks
  • Input Data Format
  • Most Popular Superhero: Strategy
  • Off To The Code...
  • And The Winner Is...
  • Exercise
  1. Section 4: Advanced Examples of Spark Programs

25. [Activity] Find the Most Popular Superhero in a Social Graph

MOST POPULAR SUPERHERO

Superhero Social Networks

  • So if a superhero appears with another superhero, these 2 superheros might be friends, so we can represent that relationship on a graph

Input Data Format

  • Marvel-graph.txt contains the superhero id and their association to different superheros on a line

  • Marvel-graph.txt:

      4397 2237 1767 472 4997 5931 6235 1478 1369 806 3994 6232
      3519 4704 2460 763 1602 5306 5358 6121 6160 2459 3173 4963 6166
      3518 5409
  • A hero may span multiple lines

  • Marvel-names.txt contains the superhero id, their superhero name and real life name delimited by a space

  • Marvel-names.txt:

      5300 "SPENCERTRACY"
      5301 "SPERZEL, ANTON"
      5302 "SPETSBURO, GEN.YURI"
      5303 "SPHINX"
      5304 "SPHINX II"
      5305 "SPHINX III"
      5306 "SPIDER-MAN/PETER PAR"
      5307 "SPIDER-MAN III/MARTH"
      5308 "SPIDER-MAN CLONE/BEN"
      5309 "SPIDER-WOMAN/JESSICA"

Most Popular Superhero: Strategy

  • Map input data to (heroID, number of co-occurences) per line

  • Add up co-occurence by heroID using reduceByKey()

  • Flip (map) RDD to (number, heroID) so we can...

  • Use max() on the RDD to find the hero with the most co-occurences

  • Look up the name of the winner and display the result

Off To The Code...

  • Import MostPopularSuperhero.scala from resource folder into Eclipse-Scala IDE in SparkScala folder

  • Import Marvel-graph.txt and Marvel-names.txt from the resource folder into SparkScala folder to be used as the source file for MostPopularSuperhero.scala

  • The delimiter used for parseNames function is '\"'' is a quotation mark

  • countCoOccurences function is used to count the total number of other superhero occurences mapped to a key of the superhero id

  • Run the file and you should see the superhero with the most friends

And The Winner Is...

  • CAPTAIN AMERICA is the most popular superhero with 1933 co-appearances.

Exercise

  • Now you can fiddle around with this code and find out what is the top ten most popular superhero and the top ten least popular superhero

  • I have included my code topTenSuperHeroes.scala inside this folder to find out the top ten superheroes with the most and least friends

Previous24. [Activity] Use Broadcast Variables to Display Movie NamesNext26. Superhero Degrees of Seperation: Introducing Breadth-First Search

Last updated 6 years ago