Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • PACKAGING WITH SBT
  • What is SBT
  • Using SBT
  • Creating An SBT Build File
  • Adding Dependencies
  • Bundling It Up
  • Here's The Cool Thing
  • Activity
  1. Section 5: Running Spark on a Cluster

33. [Activity] Packaging driver scripts with SBT

PACKAGING WITH SBT

What is SBT

  • Like Maven for Scala

  • Manages your library dependency tree for you

  • Can package up all of your dependencies into a self-contained JAR

  • If you have many dependencies (or depend on a library that in turn has lots of dependencies), it makes life a lot easier than passing a ton of -jars options

  • Get it from scala-sbt.org

Using SBT

  • Set up a directory structure like this:

Dir Structure

src

project

main

scala

  • Your Scala source files go in the source folder

  • In your project folder, create an assembly.sbt file that contains one line:

      addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
  • Check the latest sbt documentation as this will change over time. This works with sbt 0.13.11

Creating An SBT Build File

  • At the root (alongside the src and project directories) create a build.sbt file

  • Example:

      name := "PopularMovies"
    
      version := "1.0"
    
      organization := "com.sundogsoftware"
    
      scalaVersion := "2.10.6"
    
      libraryDependecies ++= Seq(
      "org.apache.spark" %% "spark-core" % "1.6.1" % "provided"
      )

Adding Dependencies

  • Say for example you need to depend on Kafka, which isn't built into Spark. You could add

      "org.apache.spark" %% "spark-streaming-kafka" % "1.6.1",
  • to your library dependencies, and sbt will automatically fetch it and everything it needs and bundle it into your JAR file.

  • Make sure you use the correct Spark version number, and note that I did NOT use "provided" on the line.

Bundling It Up

  • Just run:

      sbt assembly
  • ...from the root folder, and it does its magic

  • You 'll find the JAR in target/scala-2.10 (or whatever Scala version you're building against.)

Here's The Cool Thing

  • This JAR is self-contained! Just use spark-submit <jar file> and it'll run, even without specifying a class!

  • Let's try it out.

Activity

  • You can import MovieSimilarities1M.scala from the source folder into your Spark-Eclipse IDE into your project folder SparkScalaCourse

  • Open and see some of the differences of delimiters and way in parsing the data

    val data = sc.textFile("s3n://sundog-spark/ml-1m/ratings.dat")
  • The ratings.dat file is instead loaded from a cloud Amazon s3 storage

  • Amazon s3 is distributed cloud storage service provider, and s3 is available to every node on my Amazon Elastic MapReduce cluster

  • That data is in a place that I can handle the size of it and can store redundantly

  • So when we set up a Amazon Elastic MapReduce Cluster (Amazon EMR), it will come preconfigured to take the best advantage of the cluster it has

  • When you navigate to the source folder SparkScala, you can navigate to the folder sbt, and there will be a build.sbt inside it

  • Change the version inside build.sbt according to your setup

    name := "MovieSimilarities1M"

    version := "1.0"

    organization := "com.sundogsoftware"

    scalaVersion := "2.11.8"

    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
    )
  • If you navigate sbt/project folder, you will see a assembly.sbt inside it

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
  • This SBT version may change depending on your SBT version

  • This line will add access to the SBT assembly command that I can use to actually make my final jar file

  • The folder also contains a src/main/scala folder which contains MovieSimilarities1M.scala file inside it.

  • So to run SBT, open cmd and run as Administrator in SparkScala source folder sbt folder

      cd C:\Users\User\Desktop\Alvin Programming Files\Data Science Courses\Apache 2.0 Spark with Scala\SparkScala\sbt
    dir
  • dir is to see files under sbt directory

    sbt assembly
  • This runs the SBT assembly command to build your JAR

    sbt about
  • If you are unsure of your sbt version, please type sbt about in cmd main window to check your sbt version and input the correct version into assembly.sbt

  • For my setup, my sbt version is 1.1.6

  • For sbt version 1.1.6, I would need to change the SbtPlugin to 0.14.6

      addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
  • After the operation runs, there should now be a target folder which contains a scala-2.10 folder (which I specified is my dependency) which contains MovieSimilarities1M-assembly-1.0.jar

  • So that is the actual JAR file that I acutally need to distribute to my cluster.

  • And that is my Spark drivers script that I can run from a real cluster.

Previous32. [Activity] Using spark-submit to run Spark driver scriptsNext34. Introducing Amazon Elastic MapReduce

Last updated 6 years ago

Go to and navigate to the Download Tab to download the Windows version and install it

You can also go to and click on the 1M Dataset and click README.TXT to understand that dataset

So go to this page to change your assembly.sbt

SBT
MovieLens
SBT Version reference for sbt-assembly