9. Introduction to Spark

WHAT IS SPARK?

"A fast and general engine for large-scale data processing"

Process Flow for Spark

Executor -Cache Tasks

Driver Program -Spark Context

Cluster Manager (Spark,YARN)

Executor -Cache Tasks

"Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk."
DAG Engine (directed acyclic graph) optimizes workflows

SPARK CORE

Spark Streaming

Spark SQL

MLLLib

GraphX

Why Scala?
- Spark itself is written in Scala
- Scala's functional programming model is a good fit for distributed processing
- Gives you fast performance (Scala compiles to Java bytecode)
- Less code & boilerplate stuff than Java
- Python is slow in comparison
But...
- You probably don't know Scala
- So we'll have to learn the basics first.
- It's not as hard you think!

Python code to square numbers in a data set:

  nums = sc.parallelize([1, 2, 3, 4])
  squared = nums.map[lambda x: x * x].collect()

Scala code to square numbers in a data set:

  val nums = sc.parallelize(List(1, 2, 3, 4))
  val squared = nums.map(x -> x * x).collect()

Last updated 7 years ago