Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • SPARK STREAMING
  • Spark Streaming
  • Let's Stream
  • Need A Twitter Developer Account.
  • Create A Twitter.txt File In Your WorkSpace
  • Step One
  • Step Two
  • Step Three
  • Step Four
  • Step Five
  • Step Six
  • Let's See It In Action
  1. Section 8: Intro to Spark Streaming

47. Spark Streaming Overview

SPARK STREAMING

Spark Streaming

  • Analyzes continual streams of data

    • Common example: processing log data from a website or server

  • Data is aggregated and analyzed at some given interval

  • Can take data fed to some port, Amazon Kinesis, HDFS, Kafka, Flume, and others

  • "Checkpointing" stores state to disk periodically for fault tolerance

  • A "Dstream" object breaks up the stream into distinct RDD's

  • Simple example:

      val stream = new StreamingContext(conf, Seconds(1))
      val lines = stream.sockTextStream("localhost", 8888)
      val errors = lines.filter(_.contains("error"))
      errors.print()
  • This listen to log data sent into port 8888, one second at a time, and prints out error lines.

  • You may need to kick off the job explicitly

      stream.start()
      stream.awaitTermination()
  • Remember your RDD's only contain one little chunk of incoming data.

  • "Windowed operations" can combine results from multiple batches over some sliding time window

    • See window(), reduceByWindow(), reduceByKeyAndWindow()

  • updateStateByKey()

    • Lets you maintain a state acros many batches as time goes by

    • For example, running counts of some event

Let's Stream

  • We'll run a Spark Streaming script that monitors live Tweets from Twitter, and keeps track of the most popular hashtags as Tweets are received!

Need A Twitter Developer Account.

  • Specifically we need to get an API key and access token

  • This allows us to stream Twitter data in real time!

Create A Twitter.txt File In Your WorkSpace

  • On each line, specify a name and your own consumer key & access token information

  • For example (substitute in your own keys & tokens!)

      consumerKey AXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSQ
      consumerSecret 9EwXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      accessToken 384953089-3DUtu13XXXXXXXXXXXXXXXXXXXX
      accessTokenSecret 7XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Step One

  • Get a Twitter stream and extract just the messages themselves

    val tweets = TwitterUtils.createStream(ssc, None)
    val statuses = tweets.map(status => status.getText())

    Vote for #BoatMcBoatFace!
    Vote for "I Like Big Boats and I Cannot Lie!"
    No! Vote for #WhatIceberg ?
    Are you crazy? #BoatMcBoatFace all the way.
    ...

Step Two

  • Create a new Dstream that has every individual word as its own entry.

  • We use flatMap() for this

    val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))

    Vote
    for
    #BoatyMcBoatFace
    Vote
    For
    I
    ...

Step Three

  • Eliminate anything that's not a hashtag

  • We use the filter() function for this

    val hashtags = tweetwords.filter(word => word.startsWith("#"))

    #BoatyMcBoatFace
    #WhatIceberg
    #BoatyMcBoatFace
    ...

Step Four

  • Convert our RDD of hashtags to key/value pairs

  • This is so we can count them up with a reduce function

    val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))

    (#BoatyMcBoatFace, 1)
    (#WhatIceberg, 1)
    (#BoatyMcBoatFace, 1)
    ...

Step Five

  • Count up the results over a sliding window

  • A reduce operations adds up all of the values for a given key

    • Here, a key is a unique hashtag, and each value is 1

    • So by adding up all the "1"'s associated with each instance of a hastag, we get the count of that hashtag

  • reduceByKeyAndWindow performs this reduce operations over a given window and slide interval

    val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow((x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))

    (#BoatyMcBoatFace, 2)
    (#WhatIceberg, 1)
  • Besides combining, we also pass in a function to remove hashtag, and input the parameters of 5 minutes for window time and 1 seconds for batch time

  • So basically every one second, it will update the result over the past five minutes

Step Six

  • Sort and output the results.

  • The counts are in the second value of each tuple, so we sort by ._2 element

    val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))
    sortedResults.print

    (#BoatyMcBoatFace, 2)
    (#WhatIceberg, 1)

Let's See It In Action

Previous46. [Activity] Using DataFrames with MLLibNext48. [Activity] Set up a Twitter Developer Account, and Stream Tweets

Last updated 6 years ago

Do this at

Twitter Website