Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • SPARK SQL
  • Working With Structured Data
  • DataSets
  • DataSets Are The New Hotness
  • In Spark 2.0.0, you create a SparkSession object instead of a SparkContext when using SparkSQL /DataSets
  • Other Stuff You Can Do With DataFrames
  • Shell Access
  • User-Defined Functions (UDF'S)
  • Let's Play With Spark SQL And DataFrames
  1. Section 6: SparkSQL, DataFrames, and DataSets

39. Introduction to SparkSQL

SPARK SQL

DataFrames and DataSets

Working With Structured Data

  • Extends RDD to a "DataFrame" objet

  • DataFrames:

    • Contain Row objects

    • Can run SQL queries

    • Has a schema (leading to more efficient storage)

    • Read and write to JSON, Hive, parquet

    • Communicates with JDBC/ ODBC, Tableau

DataSets

  • A DataFrame is really just a DataSet of Row objects

    (DataSet[Row])

  • DataSets can explicitly wrap a given struct or type (DataSet[Person], DataSet[(String, Double)])

    • It knows what its column are from the get-go

  • DataFrames schema is inferred at runtime; but a DataSet can be inferred at compile time

    • Faster detection of errors, and better optimization

  • RDD's can be converted to DataSets with .toDS()

DataSets Are The New Hotness

  • The trend in Spark is to use RDD's less, and DataSets more

  • DataSets are more efficient

    • They can be serialized very efficiently - even better than Kryo

    • Optimal execution plans can be determined at compile time

  • DataSets allow for better interoperability

    • MLLib and Spark Streaming are moving toward using DataSets instead of RDD's for their primary API

  • DataSets simplify development

    • You can perform most SQL operations on a dataset with one line

In Spark 2.0.0, you create a SparkSession object instead of a SparkContext when using SparkSQL /DataSets

  • You can get a SparkContext from this session, and use it to issue SQL queries on your DataSets!

  • Stop the session when you're done.

Other Stuff You Can Do With DataFrames

  • myResultDataFrame.show()

  • myResultDataFrame.select("someFieldName")

  • myResultDataFrame.filter(myResultDataFrame("someFieldName") > 200)

  • myResultDataFrame.groupBy(myResultDataFrame("someFieldName")).mean()

  • myResultDataFrame.rdd().map(mapperFunction)

Shell Access

  • Spark SQL exposes a JDBC/ODBC server (if you built Spark with Hive support)

  • Start it with sbin/start-thriftserver.sh

  • Listens on port 10000 by default

  • Connect using bin/beeline -u jdbc:hive2://localhost:10000

  • Viola, you have a SQL shellto Spark SQL

  • You can create new tables, or query existing ones that were cached using hiveCtx.cacheTable("tableName")

User-Defined Functions (UDF'S)

    import org.apache.spark.sql.functions.udf

    val square = (x => x * x)
    squaredDF = df.withColumn("square", square('value'))

Let's Play With Spark SQL And DataFrames

  • Use our fake social network data from earlier

  • Query it with SQL, and then use DataSets without SQL

  • Finally we'll re-do our popular movies example with DataSets, and see how much simpler it is.

Previous38. Troubleshooting, and Managing DependenciesNext40. [Activity] Using SparkSQL

Last updated 6 years ago