46. [Activity] Using DataFrames with MLLib
MLLIB WITH DATASETS
Using DataSets With MLLib Is Actually Preferred
But's not always practical. Not everything is a SQ problem...
Use spark.ml instead of spark.mllib for the preferred DataSet-based API (instead of RDDs)
Performs better
Will interoperate better with Spark Streaming, Spark SQL, etc.
Available in Spark 2.0.0+
API's are different
Let's look An Example
We'll do our linear regresson example using DataSets this time.
Activity
Import LinearRegressionDataFrame.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE
Open LinearRegressionDataFrame.scala and look at the code
Looking At the Code
Now we are importing all these seperate spark ml packages that are different from the MLLib package that we use before
Now instead of using a spark context, we're going to use a Spark session.
This is basically the API that we use in Spark 2.0 for doing DataSet stuff and Spark SQL stuff.
We are parsing and delimiting the lines by , commas
The _ underscore is basically a wildcard so sort of a shortcut instead of saying X => x.split(), you can just say underscore and it means the same exact thing.
So it means that each individual input coming into your map function that's represented by the underscore character
Remember a DataFrame is a DataSet.
Here, we are actually giving names to those columns
So in a DataFrame, you want to have names associate with those columns sowe can actually do SQL querys on them and refer to them by name.
So let's do something a little bit more fancier just to make life a little bit more interesting if you're not familiar with machine learning you get a little free lesson here.
So one way you can evaluate the performance of a machine learningmodel is technique called Train test
and the idea is that you have set of data where you have a set of known results.
So in this case,we know the actual order amount and page speed for a set of data.
What we're going to do is to split the data into half randomly
A half is used for building our model and the other half is reserved for testing that model
Since the model had no knowledge of this other data when you are creating it, it's a good way to actually test how effective this model is at predicting data that it hasn't seen before.
Split the data 50 50 randomly into trainingDF and testDF
Create a linear regression model
Train the model using trainingDF and predict the values as fullPredictions
Extract the predictions and the "known" correct labels
Print it out and see if it is accurate
Stop the spark session once you have finished running spark job
The results are pretty similiar and this tells us our model is actually pretty good
Last updated