Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • MLLIB WITH DATASETS
  • Using DataSets With MLLib Is Actually Preferred
  • Let's look An Example
  • Activity
  • Looking At the Code
  1. Section 7: Machine Learning with MLLib

46. [Activity] Using DataFrames with MLLib

MLLIB WITH DATASETS

Using DataSets With MLLib Is Actually Preferred

  • But's not always practical. Not everything is a SQ problem...

  • Use spark.ml instead of spark.mllib for the preferred DataSet-based API (instead of RDDs)

    • Performs better

    • Will interoperate better with Spark Streaming, Spark SQL, etc.

  • Available in Spark 2.0.0+

  • API's are different

Let's look An Example

  • We'll do our linear regresson example using DataSets this time.

Activity

  • Import LinearRegressionDataFrame.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE

  • Open LinearRegressionDataFrame.scala and look at the code

Looking At the Code

    import org.apache.spark.ml.regression.LinearRegression
    import org.apache.spark.sql.types._
    import org.apache.spark.ml.linalg.Vectors
  • Now we are importing all these seperate spark ml packages that are different from the MLLib package that we use before

    // Use new SparkSession interface in Spark 2.0
    val spark = SparkSession
        .builder
        .appName("LinearRegressionDF")
        .master("local[*]")
        .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
        .getOrCreate()
  • Now instead of using a spark context, we're going to use a Spark session.

  • This is basically the API that we use in Spark 2.0 for doing DataSet stuff and Spark SQL stuff.

    // Load up our page speed / amount spent data in the format required by MLLib
    // (which is label, vector of features)

    // In machine learning lingo, "label" is just the value you're trying to predict, and
    // "feature" is the data you are given to make a prediction with. So in this example
    // the "labels" are the first column of our data, and "features" are the second column.
    // You can have more than one "feature" which is why a vector is required.
    val inputLines = spark.sparkContext.textFile("../regression.txt")
    val data = inputLines.map(_.split(",")).map(x => (x(0).toDouble, Vectors.dense(x(1).toDouble)))
  • We are parsing and delimiting the lines by , commas

  • The _ underscore is basically a wildcard so sort of a shortcut instead of saying X => x.split(), you can just say underscore and it means the same exact thing.

  • So it means that each individual input coming into your map function that's represented by the underscore character

  • Remember a DataFrame is a DataSet.

    // Convert this RDD to a DataFrame
    import spark.implicits._
    val colNames = Seq("label", "features")
    val df = data.toDF(colNames: _*)
  • Here, we are actually giving names to those columns

  • So in a DataFrame, you want to have names associate with those columns sowe can actually do SQL querys on them and refer to them by name.

  • So let's do something a little bit more fancier just to make life a little bit more interesting if you're not familiar with machine learning you get a little free lesson here.

  • So one way you can evaluate the performance of a machine learningmodel is technique called Train test

    and the idea is that you have set of data where you have a set of known results.

  • So in this case,we know the actual order amount and page speed for a set of data.

  • What we're going to do is to split the data into half randomly

  • A half is used for building our model and the other half is reserved for testing that model

  • Since the model had no knowledge of this other data when you are creating it, it's a good way to actually test how effective this model is at predicting data that it hasn't seen before.

    // Note, there are lots of cases where you can avoid going from an RDD to a DataFrame.
    // Perhaps you're importing data from a real database. Or you are using structured streaming
    // to get your data.

    // Let's split our data into training data and testing data
    val trainTest = df.randomSplit(Array(0.5, 0.5))
    val trainingDF = trainTest(0)
    val testDF = trainTest(1)

    // Now create our linear regression model
    val lir = new LinearRegression()
        .setRegParam(0.3) // regularization 
        .setElasticNetParam(0.8) // elastic net mixing
        .setMaxIter(100) // max iterations
        .setTol(1E-6) // convergence tolerance
  • Split the data 50 50 randomly into trainingDF and testDF

  • Create a linear regression model

    // Train the model using our training data
    val model = lir.fit(trainingDF)

    // Now see if we can predict values in our test data.
    // Generate predictions using our linear regression model for all features in our 
    // test dataframe:
    val fullPredictions = model.transform(testDF).cache()

    // This basically adds a "prediction" column to our testDF dataframe.

    // Extract the predictions and the "known" correct labels.
    val predictionAndLabel = fullPredictions.select("prediction", "label").rdd.map(x => (x.getDouble(0), x.getDouble(1)))

    // Print out the predicted and actual values for each point
    for (prediction <- predictionAndLabel) {
        println(prediction)
    }

    // Stop the session
    spark.stop()
  • Train the model using trainingDF and predict the values as fullPredictions

  • Extract the predictions and the "known" correct labels

  • Print it out and see if it is accurate

  • Stop the spark session once you have finished running spark job

  • The results are pretty similiar and this tells us our model is actually pretty good

Previous45. [Activity] Using DataFrames with MLLibNext47. Spark Streaming Overview

Last updated 6 years ago