Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • SPARK STREAMING IN ACTION
  • Activity
  • Setting Up And Running The File
  • Looking At The Code
  1. Section 8: Intro to Spark Streaming

48. [Activity] Set up a Twitter Developer Account, and Stream Tweets

Previous47. Spark Streaming OverviewNext49. Structured Streaming

Last updated 6 years ago

SPARK STREAMING IN ACTION

Activity

  • Now to do this first, we need to set up a developer with Twitter so that we can actually connect our Spark Streaming Application to the Twitter API.

  • Go to and create an AppStore Twitter account and you should see a screen like this.

  • Go ahead and sign in with your Twitter account

  • If you don't have a Twitter Account, you can Sign up now!

  • You can press Create New App and input the following details

    Name: SparkScalaCourse AlvinToh
    Description: Messing around with Spark Streaming
    Website: http://www.sundog-soft.com (You can put whichever website you have, I just put sundog website as default)
    CallbackUrl: (Can be left empty)
  • Check and agree on the Developer Agreement

  • Click on Create your Twitter application and your Twitter Application will be created

  • After you have access the application console, now we need to click on Keys and Access Tokens.

  • These are the credentials that we need in order to connect.

  • So let's click on that and we want to create an access token as well as our consumer key.

  • Now, you have the Consumer Key (API Key), Consumer Secret(API Secret), Access Token and an Access Token Secret.

  • So leave that information up where you can get it easily

  • Next access the sparkScala source folder for the course materials, you should find a twitter.txt file

  • Open twitter.txt file, and copy and paste the keys that you just got into this file.

  • Make sure that is one space between consumer key and the token itself, and no extra spaces at the end of anything

  • You want to keep on doing this for the other credentials.

  • Of course yours will be different, and don't try using the twitter.txt default keys in there because the account is going to be deleted

  • After you are done, make sure you have a return at the end of the line and make sure there's no extra spaces, no extra returns or anything like that

  • As this might mess up this file, and once you are happy with it, save it.

  • Now copy twitter.txt and paste it into your course folder, SparkScala folder.

Setting Up And Running The File

  • Next open up Spark-Eclipse IDE and import some libraries for Scala and Spark that will let it actually talk to Twitter

  • Now if you're using something before Spark 2.0, that capability is just built into Spark itself.

  • But they actually removed it in Spark 2.0.0

  • So if you are using Spark 2.0.0 or newer, you have to do this next step first, so go over to SparkScala

  • Right click on course project in the IDE, and click on properties.

  • Click on Java Build Path and press Add External JARs.

  • Browse to the SparkScala source folder and import all of these 3 JARS

    dstream-twitter.jar
    twitter4j-core.jar
    twittet4j-stream.jar
  • Hit Apply and Close once you are done

  • Import PopularHashtags.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE

  • Open PopularHashtags.scala and look at the code

Looking At The Code

    import org.apache.spark.streaming._
    import org.apache.spark.streaming.twitter._
    import org.apache.spark.streaming.StreamingContext._
  • Basically, we're importing all the stuff we need including a bunch of Spark Streaming classes and these Spark Streaming Twitter package

  • So that this allows us to connect to Twitter and use Twitter as a Spark Streaming Receiver

    /** Configures Twitter service credentials using twiter.txt in the main workspace directory */
    def setupTwitter() = {
        import scala.io.Source

        for (line <- Source.fromFile("../twitter.txt").getLines) {
            val fields = line.split(" ")
            if (fields.length == 2) {
                System.setProperty("twitter4j.oauth." + fields(0), fields(1))
            }
        }
    }
  • Here is where we actually load that twitter.txt file that you just created and we parse it out one line a time, split it based on that space character

  • And we just set system properties based on those settings.

  • So assuming you format that file correctly, that should set up all the system properties needed to actually connect to Twitter successfully.

  • So that's the first thing we do, and our main function we set up those credentials for Twitter

    // Create a DStream from Twitter using our streaming context
    val tweets = TwitterUtils.createStream(ssc, None)

    // Now extract the text of each status update into DStreams using map()
    val statuses = tweets.map(status => status.getText())

    // Blow out each word into a new DStream
    val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
  • We'll call the TwitterUtils class to actually create a receiver that listens to a stream of tweets.

  • And we now have a stream called tweets that we can work with.

  • So we're not dealing with individual little RDD's here but we're dealing with the Dstream as a whole which is kinnd of cool and keeping on with that we're gonna apply a flat map to that to actually bust those tweets out in individual words broken up by spaces.

    // Now eliminate anything that's not a hashtag
    val hashtags = tweetwords.filter(word => word.startsWith("#"))

    // Map each hashtag to a key/value pair of (hashtag, 1) so we can count them up by adding up the values
    val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))

    // Now count them up over a 5 minute window sliding every one second
    val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))
    //  You will often see this written in the following shorthand:
    //val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( _ + _, _ -_, Seconds(300), Seconds(1))
  • Next thing we want to do is to filter out anything that's not a hashtag

  • Next we will count each individual hashtags by 1 and reduce by Key and Window in the next line of code

  • Since we are moving the sliding window to the next second, we need to remove y from x, so we just say x minus y

  • Because with that sliding window, sometimes we have to take stuff out as well as that stuff in.

    // Sort the results by the count values
    val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))

    // Print the top 10
    sortedResults.print

    // Set a checkpoint directory, and kick it all off
    // I could watch this all day!
    ssc.checkpoint("C:/checkpoint/")
    ssc.start()
    ssc.awaitTermination()
  • Next, we will call transform on the hashtagCounts to get a Dstream which is sorted, and we will print the top 10 results

  • The last step will be explicitly restarting and ending the Spark Streaming process if it fails.

  • For me, I could see the results but some errors has occured which maybe related to Spark drivers being out of date.

  • The top 10 results, are related to Korean which maybe surprising given you thought the results will be in English.

Twitter Apps Website