Graph Algorithms

Graph Algorithms


AStormOfSwords

In this guide we’ll learn how to use the Neo4j Graph Algorithms package using a Game of Thrones dataset.

The dataset used in this guide is based on work done by Dr. Andrew Beveridge, based on his Network of Thrones research.

Graph of Character Interactions


Andrew built a graph where nodes are characters in the TV series and relationships represent character interactions.

Characters have a relationship for each time their names (or nicknames) appear within 15 words of one another in the scripts. Each link corresponds to an interaction between the two characters. Note that this interaction could be direct or indirect. Here are some of the types of interactions that our method picks up.

  • Two characters appearing together in the same location

  • Two characters in conversation

  • One character talking about another character

  • One character listening to a third character talk about a second character

  • A third character talking about two other characters

The data extraction process is described in detail here: https://networkofthrones.wordpress.com/from-book-to-network/

The data model


Run the following query to see the schema of our graph:

call db.schema.visualization()

Summary statistics


Let’s do some simple exploration of the dataset before running some more complex algorithms. The following query will compute the minimum, maximum, average, and standard deviation of unique interactions for characters grouped by season:

MATCH (c:Character)-[r]-()
WITH r.season as season, c, count(*) AS degree
RETURN season, min(degree) AS min, max(degree) AS max, avg(degree) AS average, stdev(degree) AS stdev
ORDER BY season

There’s a lot of conversations in season 1, but then it falls off in seasons 2 and 3, before a slight increase after that.

Characters can interact with each other multiple times, which means we can compute these metrics based on the total number of interactions as well:

MATCH (c:Character)-[r]-()
WITH r.season as season, c, sum(r.weight) AS degree
RETURN season, min(degree) AS min, max(degree) AS max, avg(degree) AS average, stdev(degree) AS stdev
ORDER BY season

Degree Centrality


The summary statistics from the previous section are based on the degree of our nodes. The Degree Centrality algorithm measures the number of incoming and outgoing relationships from a node, and helps us find the most popular nodes in a graph.

The following query finds the most popular characters in the 1st season, based on the number of character interactions:

CALL algo.degree.stream("Character", "INTERACTS_SEASON1", {
  direction: "BOTH"
})
YIELD nodeId, score
RETURN algo.asNode(nodeId).name AS character, score
ORDER BY score DESC
LIMIT 50

We can also specify weightProperty if we want to find the most popular character based on total character interactions:

CALL algo.degree.stream("Character", "INTERACTS_SEASON1", {
  direction: "BOTH",
  weightProperty: "weight"
})
YIELD nodeId, score
RETURN algo.asNode(nodeId).name AS character, score
ORDER BY score DESC
LIMIT 50

Diameter of the network


The diameter (or geodesic) of a network is defined as the longest shortest path in the network.

We can write the following query to find it in our graph for the 2nd season:

// Find maximum diameter of network
// maximum shortest path between two nodes
MATCH (a:Character), (b:Character) WHERE id(a) > id(b)
MATCH path = shortestPath((a)-[:INTERACTS_SEASON2*]-(b))

RETURN path, length(path) AS len
ORDER BY len DESC
LIMIT 5

This query creates a cartesian product combining all characters so we need to be careful when running this on larger graphs or we’ll get an OutOfMemoryException.

Pivotal nodes


A node is said to be pivotal if it lies on all shortest paths between two other nodes in the network. We can find all pivotal nodes in the network.

The following query will find all the pivotal nodes in the network for the first season:

MATCH (a:Character), (b:Character) WHERE id(a) > id(b)
MATCH p = allShortestPaths((a)-[:INTERACTS_SEASON1*]-(b))

WITH collect(p) AS paths, a, b
UNWIND nodes(paths[0]) as c // first path

WITH *
WHERE NOT c IN [a,b]
AND all(path IN paths[1..] WHERE c IN nodes(path))

RETURN a.name, b.name, c.name AS pivotalNode, length(paths[0]) AS pathLength, length(paths) AS pathCount
SKIP 490
LIMIT 10

Betweenness Centrality


Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence.

Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

betweenness centrality

The red nodes have a high betweenness centrality and are connectors of clusters.

Betweenness Centrality


We’ll start by calculating the betweenness centrality for the characters who interacted in the first season. We can do this by calling the algo.betweenness.stream procedure with the label Character and relationship type INTERACTS1.

Run the following query to learn who the most influential characters are:

CALL algo.betweenness.stream("Character", "INTERACTS_SEASON1", {
  direction: "BOTH"
})
YIELD nodeId, centrality
RETURN algo.asNode(nodeId).name, centrality
ORDER BY centrality DESC
LIMIT 10

If you’ve watched the TV series hopefully the results aren’t too surprising!

Betweenness Centrality vs Biggest Communicators


We can add to our query and see how the betweenness scores compare to the total interactions a character had:

CALL algo.betweenness.stream("Character", "INTERACTS_SEASON1", {direction: "BOTH"})
YIELD nodeId, centrality
WITH algo.asNode(nodeId) AS c, centrality
WITH c, centrality, [(c)-[r:INTERACTS_SEASON1]-(other) | {character: other.name, weight: r.weight}] AS interactions
RETURN c.name, centrality,
       apoc.coll.sum([i in interactions | i.weight]) AS totalInteractions,
       [i in apoc.coll.sortMaps(interactions, 'weight')[..5] | i.character] as charactersInteractedWith
ORDER BY centrality DESC
LIMIT 10

From this query we can see that it’s not necessarily the most talkative characters that have the most influence on the network.

Try changing the query to order by totalInteractions instead of centrality to see this more clearly.

Storing Betweenness Centrality


Although the betweenness centrality algorithm runs very quickly on this dataset we wouldn’t usually be running this types of algorithms in the normal request/response flow of a web/mobile app. Instead of that we can store the result of the calculation as a property on the node and then refer to it in future queries.

Each of the algorithms has a variant that saves its output to the database rather than returning a stream. Let’s run the betweenness centrality algorithm and store the result as a property named season1BetweennessCentrality:

CALL algo.betweenness("Character", "INTERACTS_SEASON1", {direction: "BOTH", writeProperty: "season1BetweennessCentrality"})

Querying Betweenness Centrality


We can write the following query to find the most influential characters:

MATCH (c:Character)
RETURN c.name, c.season1BetweennessCentrality AS centrality
ORDER BY centrality DESC
LIMIT 10

Exercise: Betweenness Centrality for season 7


Now we want to calculate the betweenness centrality for other seasons and store the results in the database.

  • Write a query that calls algo.betweenness for INTERACTS_SEASON7 relationship types.

After you’ve done that see if you can write queries to answer the following questions:

  • Which character had the biggest increase in influence from season 1 to 7?

  • Which character had the biggest decrease?

Bonus question:

  • Which characters who were in the top 10 influencers in season 1 are also in the top 10 influencers in season 7?

Answer: Betweenness Centrality for season 7


CALL algo.betweenness("Character", "INTERACTS_SEASON7", {direction: "BOTH", writeProperty: "season7BetweennessCentrality"})

Answer: Increase in influence


MATCH (c:Character)
RETURN c.name, c.season1BetweennessCentrality, c.season7BetweennessCentrality, c.season7BetweennessCentrality - c.season1BetweennessCentrality AS difference
ORDER BY difference DESC
LIMIT 10

Answer: Decrease in influence


MATCH (c:Character)
RETURN c.name, c.season1BetweennessCentrality, c.season7BetweennessCentrality, c.season7BetweennessCentrality - c.season1BetweennessCentrality AS difference
ORDER BY difference
LIMIT 10

Answer: Consistent influencers


MATCH (c:Character)

WITH c
ORDER BY c.season1BetweennessCentrality DESC
LIMIT 10

WITH collect(c.name) AS characters
MATCH (c:Character)

WITH c, c.season7BetweennessCentrality AS season7BetweennessCentrality, characters
ORDER BY season7BetweennessCentrality DESC
LIMIT 10

WITH c WHERE c.name IN characters
RETURN c.name, c.season1BetweennessCentrality, c.season7BetweennessCentrality
LIMIT 10

Closeness Centrality


Closeness centrality is a way of detecting nodes that are able to spread information very efficiently through a graph. The closeness centrality of a node measures its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes.

We can run this algorithm over the interactions in season 2 of Game of Thrones:

CALL algo.closeness.stream("Character", "INTERACTS_SEASON2", {
  direction: "BOTH"
})
YIELD nodeId, centrality
RETURN algo.asNode(nodeId).name, centrality
ORDER BY centrality DESC
LIMIT 10

Daenerys has a score of 1, which means that she’s interacted directly with all other characters. This doesn’t seem likely, so let’s try and work out what’s happened.

Why is Daenerys so well connected?


By default, the Closeness Centrality algorithm works out the connectedness of a node to all nodes that it can reach. We can run the connected components algorithm to find sets of nodes that have paths between each other.

CALL algo.unionFind.stream("Character", "INTERACTS_SEASON2", {
  direction: "BOTH"
})
YIELD nodeId, setId
WITH setId, collect(algo.asNode(nodeId).name) AS people
RETURN setId, people, size(people) AS size
ORDER BY size(people) DESC
LIMIT 10

Aha! In season 2 Daenerys was away from the majority of other characters, and she did indeed interact with all the people within her connected component.

Closeness Centrality: Wasserman and Faust / Harmonic


So the closeness centrality algorithm actually measures the farness of a node to all other nodes in the same connected component. If we want to find the farness to all other nodes in the graph, we can use the Wasserman and Faust or Harmonic variants of the algorithm.

Wasserman and Faust

CALL algo.closeness.stream("Character", "INTERACTS_SEASON2", {
  direction: "BOTH", improved: true
})
YIELD nodeId, centrality
RETURN algo.asNode(nodeId).name, centrality
ORDER BY centrality DESC
LIMIT 10

Harmonic

CALL algo.closeness.harmonic.stream("Character", "INTERACTS_SEASON2", {
  direction: "BOTH"
})
YIELD nodeId, centrality
RETURN algo.asNode(nodeId).name, centrality
ORDER BY centrality DESC
LIMIT 10

We can learn more about these variants of the closeness centrality algorithm in Chapter 6 of the Graph Algorithms Book.

PageRank


This is another version of weighted degree centrality with a feedback loop. This time, you only get your “fair share” of your neighbor’s importance.

i.e. your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor.

Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures narrative tension. Indeed, major developments occur when two important characters interact.

758px PageRanks Example.svg

Calculating PageRank


This time lets skip straight to the version of this procedure that stores results straight into the database.

Run the following queries to calculate page rank scores for seasons 1 and 7:

CALL algo.pageRank("Character", "INTERACTS_SEASON1", {direction: "BOTH", writeProperty:'season1PageRank'})
CALL algo.pageRank("Character", "INTERACTS_SEASON7", {direction: "BOTH", writeProperty:'season7PageRank'})

Querying PageRank


We can now write a query to see how influential the characters are across a variety of different metrics:

MATCH (c:Character)
WITH c, [(c)-[r:INTERACTS_SEASON1]-(other) | {character: other.name, weight: r.weight}] AS interactions
RETURN c.name, c.season1PageRank, c.season1BetweennessCentrality,
       apoc.coll.sum([i in interactions | i.weight]) AS totalInteractions,
       [i in apoc.coll.reverse(apoc.coll.sortMaps(interactions, 'weight'))[..5] | i.character] as charactersInteractedWith
ORDER BY c.season1PageRank DESC
LIMIT 20

You’ll notice that there are some characters who have a high page rank but a very low betweenness centrality score.

This suggests that they aren’t necessarily influential in their own right, but are friends with important people. Varys is a good example of a character that fits this profile.

Community Detection


We can detect communities in our data by running an algorithm which traverses the graph structure to find highly connected subgraphs with fewer connections other other subgraphs.

Run the following query to calculate the communities that exist based on interactions across all the seasons.

CALL algo.labelPropagation(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c:Character)-[rel]-(c2) RETURN id(c) as source, id(c2) as target, SUM(rel.weight) as weight',
  {graph:'cypher', partitionProperty: 'community'})

Querying Communities


We can then write a query to see what communities we have and how many members they have:

MATCH (c:Character)
WHERE exists(c.community)
RETURN c.community, count(*) AS count
ORDER BY count DESC

There seem to be 2 or 3 large clusters of people and then a lot of smaller ones.

Querying Communities


It’d be good to know who are the influential people in each community. To do that we’ll need to calculate a PageRank score for each character across all the seasons:

CALL algo.pageRank(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c:Character)-[rel]-(c2) RETURN id(c) as source,id(c2) as target, SUM(rel.weight) as weight',
  {graph:'cypher', writeProperty: 'pageRank'})
MATCH (c:Character)
WHERE exists(c.community)
WITH c ORDER BY c.pageRank DESC
RETURN c.community as cluster, count(*) AS count, collect(c.name)[..10]
ORDER BY count DESC

Most people are in a big community containing Tyrion, but we also have smaller communities which contain Jon and Daenerys who are another important characters.

Visualising Communities


We can write the following community to see the interactions between people in one of the communities:

MATCH (c:Character) WHERE exists(c.community)
WITH c.community AS community, COUNT(*) AS count
ORDER BY count DESC
SKIP 1 LIMIT 1
MATCH path = (c:Character {community: community})--(c2:Character {community: community})
RETURN path

Intra community PageRank


We can also calculate the PageRank within communities.

Run the following query to calculate the page rank for the 2nd largest community:

MATCH (c:Character) WHERE EXISTS(c.community)
WITH c.community AS communityId, COUNT(*) AS count
ORDER BY count DESC
SKIP 1 LIMIT 1
CALL apoc.cypher.doIt(
  "CALL algo.pageRank(
    'MATCH (c:Character) WHERE c.community =" + communityId + " RETURN id(c) as id',
    'MATCH (c:Character)-[rel]->(c2) WHERE c.community =" + communityId + " AND c2.community =" + communityId + " RETURN id(c) as source,id(c2) as target, sum(rel.weight) as weight',
    {graph:'cypher', writeProperty: 'communityPageRank'}) YIELD nodes RETURN count(*)", {})
YIELD value
RETURN value

Intra community PageRank


We can run the following query to find the most influential character within that cluster:

MATCH (c:Character) WHERE exists(c.community)
WITH c.community AS communityId, COUNT(*) AS count
ORDER BY count DESC
SKIP 1 LIMIT 1
MATCH (c:Character) WHERE c.community = communityId
RETURN c.name, c.communityPageRank
ORDER BY c.communityPageRank DESC
LIMIT 10

Learn more


OReilly Graph Algorithms v2 ol1

You can learn more about Neo4j Graph Algorithms by browsing the User Guide or by watching the online meetup presented by Michael Hunger.

You can also get your free copy of O’Reilly’s Graph Algorithms: Practical Examples in Apache Spark and Neo4j book.