Apache 2.0 Spark with Scala
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Getting Started
    • 1. Warning about Java 9 and Spark2.3!
    • 2. Introduction, and Getting Set Up
    • 3. [Activity] Create a Histogram of Real Movie Ratings with Spark!
  • Section 2: Scala Crash Course
    • 4. [Activity] Scala Basics, Part 1
    • 5. [Exercise] Scala Basics, Part 2
    • 6. [Exercise] Flow Control in Scala
    • 7. [Exercise] Functions in Scala
    • 8. [Exercise] Data Structures in Scala
  • Section 3: Spark Basics and Simple Examples
    • 9. Introduction to Spark
    • 10. Introducing RDD's
    • 11. Ratings Histogram Walkthrough
    • 12. Spark Internals
    • 13. Key /Value RDD's, and the Average Friends by Age example
    • 14. [Activity] Running the Average Friends by Age Example
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum
    • 17. [Activity] Counting Word Occurences using Flatmap()
    • 18. [Activity] Improving the Word Count Script with Regular Expressions
    • 19. [Activity] Sorting the Word Count Results
    • 20. [Exercise] Find the Total Amount Spent by Customer
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent
    • 22. Check Your Results and Implementation Against Mine
  • Section 4: Advanced Examples of Spark Programs
    • 23. [Activity] Find the Most Popular Movie
    • 24. [Activity] Use Broadcast Variables to Display Movie Names
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph
    • 26. Superhero Degrees of Seperation: Introducing Breadth-First Search
    • 27. Superhero Degrees of Seperation: Accumulators, and Implementing BFS in Spark
    • 28. Superhero Degrees of Seperation: Review the code, and run it!
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()
    • 30. [Activity] Running the Similiar Movies Script using Spark's Cluster Manager
    • 31. [Exercise] Improve the Quality of Similiar Movies
  • Section 5: Running Spark on a Cluster
    • 32. [Activity] Using spark-submit to run Spark driver scripts
    • 33. [Activity] Packaging driver scripts with SBT
    • 34. Introducing Amazon Elastic MapReduce
    • 35. Creating Similar Movies from One Million Ratings on EMR
    • 36. Partitioning
    • 37. Best Practices for Running on a Cluster
    • 38. Troubleshooting, and Managing Dependencies
  • Section 6: SparkSQL, DataFrames, and DataSets
    • 39. Introduction to SparkSQL
    • 40. [Activity] Using SparkSQL
    • 41. [Activity] Using DataFrames and DataSets
    • 42. [Activity] Using DataSets instead of RDD's
  • Section 7: Machine Learning with MLLib
    • 43. Introducing MLLib
    • 44. [Activity] Using MLLib to Produce Movie Recommendations
    • 45. [Activity] Using DataFrames with MLLib
    • 46. [Activity] Using DataFrames with MLLib
  • Section 8: Intro to Spark Streaming
    • 47. Spark Streaming Overview
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets
    • 49. Structured Streaming
  • Section 9: Intro to GraphX
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.
    • 51. [Activity] Superhero Degrees of Seperation using GraphX
  • Section 10: You Made It! Where to Go from Here.
    • 52. Learning More, and Career Tips
    • 53. Bonus Lecture: Discounts on my other "Big Data" / Data Science Courses.
Powered by GitBook
On this page
  • TROUBLESHOOTING SPARK
  • Troubleshooting Cluster Jobs Part I
  • Activity
  • Troubleshooting Cluster Jobs Part II
  • Managing Dependencies
  1. Section 5: Running Spark on a Cluster

38. Troubleshooting, and Managing Dependencies

Previous37. Best Practices for Running on a ClusterNext39. Introduction to SparkSQL

Last updated 6 years ago

TROUBLESHOOTING SPARK

Troubleshooting Cluster Jobs Part I

  • It is a dark art.

  • Your master will run a console on port 4040

    • But in EMR, it's next to impossible to actually connect to it from outside

    • If you have your own cluster running on your network, life's a little easier in that respect

    • Let's take a look on our local machine though

Activity

  • I would run the MovieSimilarities script on my desktop that processes the 100k ratings locally

  • In this case, I'm running on my local host, the IP address of your own machine is always 127.0.0.1

  • So to run the Spark script in the SparkScalaCourse, I will type in this command

      spark-submit --class com.sundogsoftware.spark.MovieSimilarities MovieSims.jar 50
  • To access the SparkUI, the address is when the spark-submit command is running

  • You can open all the other tabs like Stages, Storage, Environment and Executors to see the status and progress of the Spark Job

  • You can also see how Spark actually broke up the job in individual stages and you can visualize those stages independently

  • Remember stages represent points at which SPARK needs to shuffle data.

  • So the more stages you have, the more data is being shuffled around so it is least efficient

  • Your job is running so there could be opportunities to explicitly partition things to avoid shuffling and reduce the number of stages and by studying what's going on here that can be a useful way of figuring it out.

  • The environment tab gives you some general troubleshooting information about the Spark Job and the environment itself

  • Also useful things the path for Java. So again if you're having trouble with dependencies and trying to figure out why certain libraries loading that might tell you why and more general information about the configuration of job on your machine

  • We have the executors tab here that's actually telling how many executors are actually running and just running on my local desktop.

  • It may be surprising that you only have 1 executor, but Spark decided to allocate you 1 because Spark decided that you didn't actually need more than one executor to complete this job.

  • But if you were on a cluster, and you saw only 1 executor. That would be a sign of trouble

  • That would probably mean that things are configured right on your cluster.

  • Maybe you left something in the configuration on the script itself to run locally or restrict the number

  • So thats the SparkUI in a nutshell and the Job is already completed. So let's kick it off again.

Troubleshooting Cluster Jobs Part II

  • Logs

    • In standalone mode, they're in the web UI

    • In YARN though, the logs are distributed. You need to collect them after the fact using yarn logs -applicationID <app ID>

  • While your driver script runs, it will log errors like executors failing to issue heartbeats

    • This generally means you are asking too much of each executor.

    • You may need more of them -ie, more machines in your cluster

    • Each executor may need more memory

    • Or use partitionBy() to demand less work from individual executors by using smaller partitions.

Managing Dependencies

  • Remember your executors aren't necessarily on the same box as your driver script

  • Use broadcast variables to share data outside of RDD's

  • Need some Java or Scala package that's not pre-loaded on EMR?

    • Bundle them into your JAR with sbt assembly

    • Or use -jars with spark-submit to add individual libraries that are on the master

    • Try to avoid suing obscure packages you don't need in the first place. Time is money on your cluster, and you're better off not fiddling with it.

http://127.0.0.1:4040