Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • GETTING SET UP (Install Java, Spark, Scala, Eclipse)
  • Install A Java Development Kit(JDK)
  • Install Spark (pre-built)
  • Set Up SPARK_HOME JAVA_HOME And PATH environment Variables
  • Install Scala IDE (bundled with Eclipse)
  • Detailed, Written Steps At SunDog Website
  1. Section 1: Getting Started

2. Introduction, and Getting Set Up

Previous1. Warning about Java 9 and Spark2.3!Next3. [Activity] Create a Histogram of Real Movie Ratings with Spark!

Last updated 6 years ago

GETTING SET UP (Install Java, Spark, Scala, Eclipse)

Install A Java Development Kit(JDK)

  • From

  • Accept the default configurations

Install Spark (pre-built)

  • From

  • Choose a Spark release later than or equal to 2.0.0

  • Choose a package type: Pre-built for Hadoop 2.7 and later

  • Download the Spark tgz file (Unix compression format on Windows)

  • So you might have to download a 3rd party tgz programme on Windows to uncompress tgz file

  • Open the tgz spark tgz file and extract to this folder, copy all the files inside to a new folder spark in (C:) drive

  • Change the configuration file for spark, go to spark/conf and change log4j.properties.template to log4j.properties

  • Open log4j.properties with whatever word editor you have (I uses Sublime 3)

  • Change the following settings

    # Set everything to be logged to the console
    log4j.rootCategory=ERROR, console
  • Install and HADOOP_HOME

  • Create a new folder call winutils, and create a new folder called bin in (C:) drive

  • Copy winutils.exe into C:/winutils/bin

Set Up SPARK_HOME JAVA_HOME And PATH environment Variables

  • Setting up the Windows environment, right click on the windows icon on the left hand corner and go into Control Panel

  • Click on Systems and Security, then onto System and then Advanced system settings. Click on Environment Variables

  • Click on New under User Variables for User

  • Input in the following details for New User Variable

      Variable name: SPARK_HOME
      Variable value: C:\spark
    
      Variable name: JAVA_HOME
      Variable value: C:\Program Files\Java\jdk1.8.0_172
    
      Variable name: HADOOP_HOME
      Variable value: C:\winutils
  • Now click Edit on the Path under User Variables and for User and click on New for the following environment variables

      %SPARK_HOME%\bin
    
      %JAVA_HOME%\bin
  • Press okay for all of the settings

Install Scala IDE (bundled with Eclipse)

  • Extract and copy the eclipse folder to a new folder eclipse in (C:/)

  • Create a Desktop shortcut to eclipse.exe in C:/eclipse/eclipse.exe

  • Open up a Windows Command Prompt in Adminstrator

      # To see your spark directory 
      cd C:\spark
      dir
    
      # spark-shell.cmd is inside the bin folder in spark directory
      cd bin
    
      # To write a new simple Scala Spark Application
      spark-shell
    
      # To see if Spark works with RDD
      val rdd=sc.textFile("../README.md")
    
      # Counting the number of lines in that file
      rdd.count()
  • To exit just hit Ctrl-D

From

Detailed, Written Steps At

oracle.com
spark.apache.org
3rd party tgz programme
winutils.exe
scala-ide.org
SunDog Website