9. Introduction to Spark

WHAT IS SPARK?

"A fast and general engine for large-scale data processing"

It's Scalable

Process Flow for Spark

Executor -Cache Tasks

Driver Program -Spark Context

Cluster Manager (Spark,YARN)

Executor -Cache Tasks

Executor -Cache Tasks

It's Fast

  • "Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk."

  • DAG Engine (directed acyclic graph) optimizes workflows

It's Hot

It's Not That Hard

  • Code in Python, Java, or Scala

  • Build around one main concept: the Resilient Distributed Dataset (RDD)

Components Of Spark

SPARK CORE

Spark Streaming

Spark SQL

MLLLib

GraphX

This Course Uses Scala

  • Why Scala?

    • Spark itself is written in Scala

    • Scala's functional programming model is a good fit for distributed processing

    • Gives you fast performance (Scala compiles to Java bytecode)

    • Less code & boilerplate stuff than Java

    • Python is slow in comparison

  • But...

    • You probably don't know Scala

    • So we'll have to learn the basics first.

    • It's not as hard you think!

Fear Not

  • Scala code in Spark looks a LOTlike Python code.

  • Python code to square numbers in a data set:

      nums = sc.parallelize([1, 2, 3, 4])
      squared = nums.map[lambda x: x * x].collect()
  • Scala code to square numbers in a data set:

      val nums = sc.parallelize(List(1, 2, 3, 4))
      val squared = nums.map(x -> x * x).collect()

Last updated