38. Troubleshooting, and Managing Dependencies
TROUBLESHOOTING SPARK
Troubleshooting Cluster Jobs Part I
It is a dark art.
Your master will run a console on port 4040
But in EMR, it's next to impossible to actually connect to it from outside
If you have your own cluster running on your network, life's a little easier in that respect
Let's take a look on our local machine though
Activity
I would run the MovieSimilarities script on my desktop that processes the 100k ratings locally
In this case, I'm running on my local host, the IP address of your own machine is always 127.0.0.1
So to run the Spark script in the SparkScalaCourse, I will type in this command
To access the SparkUI, the address is http://127.0.0.1:4040 when the spark-submit command is running
You can open all the other tabs like Stages, Storage, Environment and Executors to see the status and progress of the Spark Job
You can also see how Spark actually broke up the job in individual stages and you can visualize those stages independently
Remember stages represent points at which SPARK needs to shuffle data.
So the more stages you have, the more data is being shuffled around so it is least efficient
Your job is running so there could be opportunities to explicitly partition things to avoid shuffling and reduce the number of stages and by studying what's going on here that can be a useful way of figuring it out.
The environment tab gives you some general troubleshooting information about the Spark Job and the environment itself
Also useful things the path for Java. So again if you're having trouble with dependencies and trying to figure out why certain libraries loading that might tell you why and more general information about the configuration of job on your machine
We have the executors tab here that's actually telling how many executors are actually running and just running on my local desktop.
It may be surprising that you only have 1 executor, but Spark decided to allocate you 1 because Spark decided that you didn't actually need more than one executor to complete this job.
But if you were on a cluster, and you saw only 1 executor. That would be a sign of trouble
That would probably mean that things are configured right on your cluster.
Maybe you left something in the configuration on the script itself to run locally or restrict the number
So thats the SparkUI in a nutshell and the Job is already completed. So let's kick it off again.
Troubleshooting Cluster Jobs Part II
Logs
In standalone mode, they're in the web UI
In YARN though, the logs are distributed. You need to collect them after the fact using yarn logs -applicationID
<app ID>
While your driver script runs, it will log errors like executors failing to issue heartbeats
This generally means you are asking too much of each executor.
You may need more of them -ie, more machines in your cluster
Each executor may need more memory
Or use partitionBy() to demand less work from individual executors by using smaller partitions.
Managing Dependencies
Remember your executors aren't necessarily on the same box as your driver script
Use broadcast variables to share data outside of RDD's
Need some Java or Scala package that's not pre-loaded on EMR?
Bundle them into your JAR with sbt assembly
Or use -jars with spark-submit to add individual libraries that are on the master
Try to avoid suing obscure packages you don't need in the first place. Time is money on your cluster, and you're better off not fiddling with it.
Last updated