25. [Activity] Find the Most Popular Superhero in a Social Graph
MOST POPULAR SUPERHERO
Superhero Social Networks
So if a superhero appears with another superhero, these 2 superheros might be friends, so we can represent that relationship on a graph
Input Data Format
Marvel-graph.txt contains the superhero id and their association to different superheros on a line
Marvel-graph.txt:
A hero may span multiple lines
Marvel-names.txt contains the superhero id, their superhero name and real life name delimited by a space
Marvel-names.txt:
Most Popular Superhero: Strategy
Map input data to (heroID, number of co-occurences) per line
Add up co-occurence by heroID using reduceByKey()
Flip (map) RDD to (number, heroID) so we can...
Use max() on the RDD to find the hero with the most co-occurences
Look up the name of the winner and display the result
Off To The Code...
Import MostPopularSuperhero.scala from resource folder into Eclipse-Scala IDE in SparkScala folder
Import Marvel-graph.txt and Marvel-names.txt from the resource folder into SparkScala folder to be used as the source file for MostPopularSuperhero.scala
The delimiter used for parseNames function is '\"'' is a quotation mark
countCoOccurences function is used to count the total number of other superhero occurences mapped to a key of the superhero id
Run the file and you should see the superhero with the most friends
And The Winner Is...
CAPTAIN AMERICA is the most popular superhero with 1933 co-appearances.
Exercise
Now you can fiddle around with this code and find out what is the top ten most popular superhero and the top ten least popular superhero
I have included my code topTenSuperHeroes.scala inside this folder to find out the top ten superheroes with the most and least friends
Last updated