36. Partitioning
PARTITIONING RDD'S
Let's Look At The Code
// Now key by (movie1, movie2) pairs. val moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(new HashPartitioner(100))
Optimizing For Running On A Cluster: Partitioning
Choosing A Partition Size
// Filter out duplicate pairs val uniqueJoinedRatings = joinedRatings.filter(filterDuplicates) // Now key by (movie1, movie2) pairs. val moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(new HashPartitioner(100)) // We now have (movie1, movie2) => (rating1, rating2) // Now collect all ratings for each movie pair and compute similarity val moviePairRatings = moviePairs.groupByKey()
Previous35. Creating Similar Movies from One Million Ratings on EMRNext37. Best Practices for Running on a Cluster
Last updated