50. GraphX, Pregel, and Breadth-First-Search with Pregel.

Analyzing and transversing graphs

Using Spark GraphX

GRAPHX

  • Not that kind of graph...

  • Graphs like oursocial network of superheroes - graphs in the computer science / network sense.

  • However, it's only useful for specific things.

    • It can measure things like "connectedness", degree distribution, average path length, triangle counts - high level measures of a graph

    • It can count triangles in the graph, and apply the PageRank algorithm to it.

    • It can also join graphs together and transform graphs quickly

    • For things like our "degrees of seperation" example, you won't find built-in support. But it does support the Pregel API for transversing a graph...

  • Introduces VertexRDD and EdgeRDD, and the Edge data type

  • Otherwise, GraphX code looks like any other Spark code for the most part

Creating Vertex RDD's

    // Function to extract hero ID -> hero name tuples (or None in case of failure)
    def parseNames(line: String) : Option[(VertexId, String)] = {
        var fields = line.split('\'")
        if (fields.length > 1){
            val heroID: Long = fields(0).trim().toLong
            if (heroID < 6487) { // ID's above 6486 aren't real characters
                return Some(fields(0).trim().toLong, fields(1))
            }
        }

        return None // flatmap will just discard None results, and extract data from Some results.
    }

Creating Edge RDD's

Creating A Graph

Doing Stuff

  • Top 10 most-connected heroes:

Breadth-First Search With Pregel

  • Before, we actually implemented BFS search in Spark the hard and manual way.

  • Now with Pregel, that gives us an general algorithm for actually doing iterations through a graph in a general manner and we can actually implement breadth first search using the Pregel algorithm, in a little bit of a more simple way than we did it before

How Pregel Works

  • Vertices send messages to other vertices (their neighbours)

  • The graph is processed in iterations called supersteps

  • Each superstep does the following:

    • Messages from the previous iteration are received by each vertex

    • Each vertex runs a program to transform itself

    • Each vertex sends messages to other vertices

BFS: Initialization

Sending Messages

Preserving The Minimum Distance At Each Step

  • Pregel's vertex program will preserve the minimum distance between the one it receives and what it has:

    • (id, attr, msg) => math.min(attr, msg)

  • Its reduce operation preserves the minimum distance if multiple messages are received for the same vertex:

    • (a,b) => math.min(a,b)

Putting It All Together

Let's Do It

Last updated