33. [Activity] Packaging driver scripts with SBT
PACKAGING WITH SBT
What is SBT
Like Maven for Scala
Manages your library dependency tree for you
Can package up all of your dependencies into a self-contained JAR
If you have many dependencies (or depend on a library that in turn has lots of dependencies), it makes life a lot easier than passing a ton of -jars options
Get it from scala-sbt.org
Using SBT
Set up a directory structure like this:
Dir Structure | |
src | project |
main | |
scala |
Your Scala source files go in the source folder
In your project folder, create an assembly.sbt file that contains one line:
Check the latest sbt documentation as this will change over time. This works with sbt 0.13.11
Creating An SBT Build File
At the root (alongside the src and project directories) create a build.sbt file
Example:
Adding Dependencies
Say for example you need to depend on Kafka, which isn't built into Spark. You could add
to your library dependencies, and sbt will automatically fetch it and everything it needs and bundle it into your JAR file.
Make sure you use the correct Spark version number, and note that I did NOT use "provided" on the line.
Bundling It Up
Just run:
...from the root folder, and it does its magic
You 'll find the JAR in target/scala-2.10 (or whatever Scala version you're building against.)
Here's The Cool Thing
This JAR is self-contained! Just use spark-submit
<jar file>
and it'll run, even without specifying a class!Let's try it out.
Activity
Go to SBT and navigate to the Download Tab to download the Windows version and install it
You can also go to MovieLens and click on the 1M Dataset and click README.TXT to understand that dataset
You can import MovieSimilarities1M.scala from the source folder into your Spark-Eclipse IDE into your project folder SparkScalaCourse
Open and see some of the differences of delimiters and way in parsing the data
The ratings.dat file is instead loaded from a cloud Amazon s3 storage
Amazon s3 is distributed cloud storage service provider, and s3 is available to every node on my Amazon Elastic MapReduce cluster
That data is in a place that I can handle the size of it and can store redundantly
So when we set up a Amazon Elastic MapReduce Cluster (Amazon EMR), it will come preconfigured to take the best advantage of the cluster it has
When you navigate to the source folder SparkScala, you can navigate to the folder sbt, and there will be a build.sbt inside it
Change the version inside build.sbt according to your setup
If you navigate sbt/project folder, you will see a assembly.sbt inside it
This SBT version may change depending on your SBT version
This line will add access to the SBT assembly command that I can use to actually make my final jar file
The folder also contains a src/main/scala folder which contains MovieSimilarities1M.scala file inside it.
So to run SBT, open cmd and run as Administrator in SparkScala source folder sbt folder
dir is to see files under sbt directory
This runs the SBT assembly command to build your JAR
If you are unsure of your sbt version, please type sbt about in cmd main window to check your sbt version and input the correct version into assembly.sbt
For my setup, my sbt version is 1.1.6
So go to this page SBT Version reference for sbt-assembly to change your assembly.sbt
For sbt version 1.1.6, I would need to change the SbtPlugin to 0.14.6
After the operation runs, there should now be a target folder which contains a scala-2.10 folder (which I specified is my dependency) which contains MovieSimilarities1M-assembly-1.0.jar
So that is the actual JAR file that I acutally need to distribute to my cluster.
And that is my Spark drivers script that I can run from a real cluster.
Last updated