48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
SPARK STREAMING IN ACTION
Activity
Now to do this first, we need to set up a developer with Twitter so that we can actually connect our Spark Streaming Application to the Twitter API.
Go to Twitter Apps Website and create an AppStore Twitter account and you should see a screen like this.
Go ahead and sign in with your Twitter account
If you don't have a Twitter Account, you can Sign up now!
You can press Create New App and input the following details
Check and agree on the Developer Agreement
Click on Create your Twitter application and your Twitter Application will be created
After you have access the application console, now we need to click on Keys and Access Tokens.
These are the credentials that we need in order to connect.
So let's click on that and we want to create an access token as well as our consumer key.
Now, you have the Consumer Key (API Key), Consumer Secret(API Secret), Access Token and an Access Token Secret.
So leave that information up where you can get it easily
Next access the sparkScala source folder for the course materials, you should find a twitter.txt file
Open twitter.txt file, and copy and paste the keys that you just got into this file.
Make sure that is one space between consumer key and the token itself, and no extra spaces at the end of anything
You want to keep on doing this for the other credentials.
Of course yours will be different, and don't try using the twitter.txt default keys in there because the account is going to be deleted
After you are done, make sure you have a return at the end of the line and make sure there's no extra spaces, no extra returns or anything like that
As this might mess up this file, and once you are happy with it, save it.
Now copy twitter.txt and paste it into your course folder, SparkScala folder.
Setting Up And Running The File
Next open up Spark-Eclipse IDE and import some libraries for Scala and Spark that will let it actually talk to Twitter
Now if you're using something before Spark 2.0, that capability is just built into Spark itself.
But they actually removed it in Spark 2.0.0
So if you are using Spark 2.0.0 or newer, you have to do this next step first, so go over to SparkScala
Right click on course project in the IDE, and click on properties.
Click on Java Build Path and press Add External JARs.
Browse to the SparkScala source folder and import all of these 3 JARS
Hit Apply and Close once you are done
Import PopularHashtags.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE
Open PopularHashtags.scala and look at the code
Looking At The Code
Basically, we're importing all the stuff we need including a bunch of Spark Streaming classes and these Spark Streaming Twitter package
So that this allows us to connect to Twitter and use Twitter as a Spark Streaming Receiver
Here is where we actually load that twitter.txt file that you just created and we parse it out one line a time, split it based on that space character
And we just set system properties based on those settings.
So assuming you format that file correctly, that should set up all the system properties needed to actually connect to Twitter successfully.
So that's the first thing we do, and our main function we set up those credentials for Twitter
We'll call the TwitterUtils class to actually create a receiver that listens to a stream of tweets.
And we now have a stream called tweets that we can work with.
So we're not dealing with individual little RDD's here but we're dealing with the Dstream as a whole which is kinnd of cool and keeping on with that we're gonna apply a flat map to that to actually bust those tweets out in individual words broken up by spaces.
Next thing we want to do is to filter out anything that's not a hashtag
Next we will count each individual hashtags by 1 and reduce by Key and Window in the next line of code
Since we are moving the sliding window to the next second, we need to remove y from x, so we just say x minus y
Because with that sliding window, sometimes we have to take stuff out as well as that stuff in.
Next, we will call transform on the hashtagCounts to get a Dstream which is sorted, and we will print the top 10 results
The last step will be explicitly restarting and ending the Spark Streaming process if it fails.
For me, I could see the results but some errors has occured which maybe related to Spark drivers being out of date.
The top 10 results, are related to Korean which maybe surprising given you thought the results will be in English.
Last updated