46. [Activity] Using DataFrames with MLLib

MLLIB WITH DATASETS

Using DataSets With MLLib Is Actually Preferred

  • But's not always practical. Not everything is a SQ problem...

  • Use spark.ml instead of spark.mllib for the preferred DataSet-based API (instead of RDDs)

    • Performs better

    • Will interoperate better with Spark Streaming, Spark SQL, etc.

  • Available in Spark 2.0.0+

  • API's are different

Let's look An Example

  • We'll do our linear regresson example using DataSets this time.

Activity

  • Import LinearRegressionDataFrame.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE

  • Open LinearRegressionDataFrame.scala and look at the code

Looking At the Code

  • Now we are importing all these seperate spark ml packages that are different from the MLLib package that we use before

  • Now instead of using a spark context, we're going to use a Spark session.

  • This is basically the API that we use in Spark 2.0 for doing DataSet stuff and Spark SQL stuff.

  • We are parsing and delimiting the lines by , commas

  • The _ underscore is basically a wildcard so sort of a shortcut instead of saying X => x.split(), you can just say underscore and it means the same exact thing.

  • So it means that each individual input coming into your map function that's represented by the underscore character

  • Remember a DataFrame is a DataSet.

  • Here, we are actually giving names to those columns

  • So in a DataFrame, you want to have names associate with those columns sowe can actually do SQL querys on them and refer to them by name.

  • So let's do something a little bit more fancier just to make life a little bit more interesting if you're not familiar with machine learning you get a little free lesson here.

  • So one way you can evaluate the performance of a machine learningmodel is technique called Train test

    and the idea is that you have set of data where you have a set of known results.

  • So in this case,we know the actual order amount and page speed for a set of data.

  • What we're going to do is to split the data into half randomly

  • A half is used for building our model and the other half is reserved for testing that model

  • Since the model had no knowledge of this other data when you are creating it, it's a good way to actually test how effective this model is at predicting data that it hasn't seen before.

  • Split the data 50 50 randomly into trainingDF and testDF

  • Create a linear regression model

  • Train the model using trainingDF and predict the values as fullPredictions

  • Extract the predictions and the "known" correct labels

  • Print it out and see if it is accurate

  • Stop the spark session once you have finished running spark job

  • The results are pretty similiar and this tells us our model is actually pretty good

Last updated