Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • KEY /VALUE RDD'S
  • RDD'S Can Hold Key /Value Pairs
  • Creating A Key /Value RDD
  • Spark Can Do Special Stuff With Key /Value Data
  • You Can Do SQL-Style Joins On Two Key /Value-RDD's
  • Mapping Just The Values Of A Key /Value RDD?
  • Friends By Age Example
  • Parsing (Mapping) The Input Data
  • Count Up Sum Of Friends And Number Of Entries Per Age
  • Compute Averages
  • Collect And Display The Results
  1. Section 3: Spark Basics and Simple Examples

13. Key /Value RDD's, and the Average Friends by Age example

KEY /VALUE RDD'S

And the "friends by age" example

RDD'S Can Hold Key /Value Pairs

  • For example: number of friends by age

  • Key is age, value is number of friends

  • Instead of just a list of ages or a list of # of friends, we can store (age, # friends) etc...

Creating A Key /Value RDD

  • Nothing special in Scala, really

  • Just map pairs of data into the RDD using tuples. For example.

      totalsByAge = rdd.map(x => (x, 1))
  • Voila, you now have a key /value RDD.

  • OK to have tuples or other objects as values as well.

Spark Can Do Special Stuff With Key /Value Data

  • reduceByKey(): combine values with the same key using some function. rdd.reduceByKey((x, y) => x + y) adds them up.

  • groupByKey(): Group values with the same key

  • sortByKey(): Sort RDD by key values

  • keys(), values() - Create an RDD of just the keys, or just the values

You Can Do SQL-Style Joins On Two Key /Value-RDD's

  • join, rightOuterJoin, leftOuterJoin, cogroup, subtractByKey

  • We 'll look at an example of this later.

Mapping Just The Values Of A Key /Value RDD?

  • With Key /Value data, use mapValues() and flatMapValues() if your transformation doesn't affect the keys.

  • It's more efficient

Friends By Age Example

  • Input Data: ID, name, age, number of friends

    • 0, Will, 33, 385

    • 1, Jean-Luc, 33, 2

    • 2, Hugh, 55, 22

    • 3, Deanna, 40, 465

    • 4, Quark, 68, 21

Parsing (Mapping) The Input Data

    def parseLine(line: String) = {
        val fields = line.split(",")
        val age = fields(2).toInt
        val numFriends = fields(3).toInt
        (age, numFriends)
    }

    val lines = sc.textFile("../fakefriends.csv")
    val rdd = lines.map(parseLine)
  • Output is key/value pairs of (age, numFriends):

    • 33, 385

    • 33, 2

    • 55, 221

    • 40, 465

    • ...

Count Up Sum Of Friends And Number Of Entries Per Age

    val totalsByAge = rdd.mapValues(x => (x, 1)).reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2))

    rdd.mapValues(x => (x, 1))

    (33, 385) => (33, (385, 1))
    (33, 2) => (33, (2, 1))
    (55, 221) => (55, (221, 1))

    reduceByKey((x, y) => (x, 1 + y, 1, 2 + x, 2))

    Adds up all values for all unique key!

    (33, (387, 2))

Compute Averages

    val averagesByAge = totalsByAge.mapValues(x => x._1/ x._2)
    (33, (387, 2)) => (33, 193.5)

Collect And Display The Results

    val results = averagesByAge.collect()
    results.sorted.foreach(println)
Previous12. Spark InternalsNext14. [Activity] Running the Average Friends by Age Example

Last updated 6 years ago