36. Partitioning

PARTITIONING RDD'S

Let's Look At The Code

If you look at MoviesSimilarities1M.scala script in your Spark Eclipse IDE, there were a few things that we have changed.

There was one line of code that talks about partitioning on a fairly large cluster

  // Now key by (movie1, movie2) pairs.
  val moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(new HashPartitioner(100))

Optimizing For Running On A Cluster: Partitioning

Spark isn't totally magic - you need to think about how your data is partitioned
Running our movie similarity script as-is might not work at all.
- That self-join is expensive, and Spark won't distribute it on its own.
Use .partitionBy() on an RDD before running a large operation that benefits from partitioning
- Join(). cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), combineByKey(), and lookup()
- Those operations will preserve your partitioning in their result too.

Choosing A Partition Size

Too few partitions won't take full advantage of your cluster
Too many results in too much overhead froms shuffling data
At least as many partitions as you have cores, or executors that fit within your available memory

partitionBy(100) is usually a reasonsable place to start for large operations.

  // Filter out duplicate pairs
  val uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)

  // Now key by (movie1, movie2) pairs.
  val moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(new HashPartitioner(100))

  // We now have (movie1, movie2) => (rating1, rating2)
  // Now collect all ratings for each movie pair and compute similarity
  val moviePairRatings = moviePairs.groupByKey()

Previous35. Creating Similar Movies from One Million Ratings on EMR Next37. Best Practices for Running on a Cluster

Last updated 7 years ago