41. [Activity] Using DataFrames and DataSets

Activity

  • You don't have to use SQL with SparkSQL necessarily

  • You can also call functions directly without actually relying on the sql syntax

  • That can be little bit more efficient

  • Import DataFrames.scala from sourcefolder into SparkScalaCourse in Spark-Eclipse IDE

  • Open DataFrames.scala and look at the code

Looking At The Code

  • You are still importing the SQL package from Saprk and alot of the codes looks the same as SparkSQL.scala

    println("Let's select the name column:")
    people.select("name").show()
  • Instead of calling spark.sql select name, I am going to operate on the dataset directly by calling .select("name").show()

  • This will quickly show the top 20 results for that dataset

    println("Filter out anyone over 21:")
    people.filter(people("age") < 21).show()
  • We can also call people on our dataset, by calling the filter functio to filter out people who are over the age of 21

  • This will show the top 20 results for that dataset

    println("Group by age:")
    people.groupBy("age").count().show()
  • This will group the people by their age and count the total number for that age

  • This reduces the need for SQL like query

    println("Make everyone 10 years older:")
    people.select(people("name"), people("age") + 10).show()
  • We are selecting the name of people, and selecting the age column adding 10 as we go and show these results as we go.

  • So now let's go ahead and run these results

  • You should now see the top 20 results for each of the respective functions you specified

Last updated