Importing the Buzzfeed TrumpWorld Dataset into Neo4j


A few weeks ago Sanchez Castro (@SCHZCAS) tweeted the following:

As part of the Buzzfeed article Help Us Map TrumpWorld, the four investigative journalists, John Templon, Alex Campbell, Anthony Cormier, and Jeremy Singer-Vine asked the public to help them map and analyze the data that they investigated, confirmed and published:

Now we are asking the public to use our data to find connections we may have missed, and to give us context we don’t currently understand. We hope you will help us — and the public — learn more about TrumpWorld and how this unprecedented array of businesses might affect public policy.

sub buzz 31493 1484333437 1

Looking at the Data


The Buzzfeed article linked in the tweet points to a Google Spreadsheet with collected and verified/researched data of relationships of Trump organizations to other organizations, in total we have 770 organizations (as of today) with 611 relationships.

Let’s take a look:

Looking at the Data


We can actually access this data in CSV format as well and use Neo4j LOAD CSV command to explore (and then load) the data:

Let’s have a quick look at the first few rows from the Google spreadsheet. Run the following query:

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row
LIMIT 5

This query returns the first five rows of the CSV file.

Looking at the Data


We can find the most frequently mentioned organizations by running this query:

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity B Type` = "Organization" and row.`Entity A Type` = "Organization"
UNWIND  [row.`Entity A`, row.`Entity B`] AS org
RETURN org, count(*)
ORDER BY count(*) DESC
LIMIT 10

We should see the usual suspects on top.

Try changing the number of rows returned and see if you can find any other interesting organisations.

Looking at the Data


What kind of relationships exist in the data?

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity B Type` = "Organization" and row.`Entity A Type` = "Organization"
RETURN row.Connection AS type, count(*)
ORDER BY count(*) DESC

While some of them like Ownership, or Investor are straightforward others (Anbang would become one of the equity partners in the [666 Fifth Avenue] redevelopment if an agreement is finalized, per New York Times) are very specific, and probably not a good choice for relationship-type to query on.

So we have two options, one could be to use a generic relationship and put all the Connection information into a property, or alternatively we do some cleanup/unification and have a richer set of relationships.

Simple, Direct Data Import


We look at the simpler variant here, to quickly get results. THe unification approach is demonstrated here, for those of you that want to have a bit more interesting graph model.

After creating the two needed constraints, we can directly import the data with Neo4j’s LOAD CSV command.

CREATE CONSTRAINT ON (o:Organization) ASSERT o.name IS UNIQUE;
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;

Connect organizations with other organizations from the 1st tab.

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity B Type` = "Organization" and row.`Entity A Type` = "Organization"
MERGE (o1:Organization {name:row.`Entity A`})
MERGE (o2:Organization {name:row.`Entity B`})
CREATE (o1)-[con:CONNECTED_TO]->(o2)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Connect people with organizations.

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity A Type` = "Organization" and row.`Entity B Type` = "Person"
MERGE (o:Organization {name:row.`Entity A`})
MERGE (p:Person {name:row.`Entity B`})
CREATE (p)-[con:INVOLVED_WITH]->(o)
SET con.connection=row.Connection, con.source=row.`Source(s)`
WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity A Type` = "Person" and row.`Entity B Type` = "Organization"
MERGE (p:Person {name:row.`Entity A`})
MERGE (o:Organization {name:row.`Entity B`})
CREATE (p)-[con:INVOLVED_WITH]->(o)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Connect people with other people

WITH 'https://docs.google.com/spreadsheets/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&id=1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss&gid=1996904412' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.`Entity A Type` = "Person" and row.`Entity B Type` = "Person"
MERGE (p1:Person {name:row.`Entity A`})
MERGE (p2:Person {name:row.`Entity B`})
CREATE (p2)-[con:RELATED_TO]->(p1)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Now we have all data of the Buzzfeed spreadsheet imported and can start asking some interesting questions.

This is what TrumpWorld looks like.

trumpworld simple

Data Enrichment


We can enrich this graph, eg. by labeling organizations:

MATCH (o:Organization)
WHERE o.name CONTAINS "BANK" SET o:Bank

We find 17 banks (there are certainly more).

MATCH (o:Organization)
WHERE o.name CONTAINS "HOTEL" SET o:Hotel

We find 34 hotels (there are certainly more).

MATCH (o:Organization)
WHERE any(term in ["TRUMP","DT","DJT"] WHERE o.name CONTAINS (term + " "))
SET o:Trump

We find 405 "Trump" organizations (there are certainly more).

There should be more labeling, but we leave that for later.

Banks in the Graph


Now we can start running some queries:

E.g. what relationships do banks have to which other organizations in our dataset:

MATCH (n:Bank)--(o)
RETURN *

Which YIELDs this interesting graph:

trumpworld simple banks

Connections between Organizations


Looking at organisations, TRUMP TOWER COMMERCIAL LLC and 40 WALL STREET LLC are some of Trump’s most prized possessions.

Can you write a query to see if they’re related to each other?

Connections between Organizations


The following query finds the two organisations and looks for a relationship between them:

MATCH (o1:Organization {name:"TRUMP TOWER COMMERCIAL LLC"})--(o2:Organization {name:"40 WALL STREET LLC"})
return *

It doesn’t seem like they are directly connected!

Connections between Organizations


Are they indirectly connected? How would you write a query to find out ?

Connections between Organizations


First let’s deconstruct our previous query. The following query is the same as the one in the previous section. [] represent the relationship.

MATCH (o1:Organization {name:"TRUMP TOWER COMMERCIAL LLC"})-[]-(o2:Organization {name:"40 WALL STREET LLC"})
return *

We could also indicate how many 'hops' we want to traverse between the nodes. e.g. to find up to one hop:

MATCH (o1:Organization {name:"TRUMP TOWER COMMERCIAL LLC"})-[*..1]-(o2:Organization {name:"40 WALL STREET LLC"})
return *

This is equivalent to our initial query so it still doesn’t return anything.

Connections between Organizations


What if we try changing the '1' to a '2'?

MATCH p = (o1:Organization {name:"TRUMP TOWER COMMERCIAL LLC"})-[*..2]-(o2:Organization {name:"40 WALL STREET LLC"})
return p

We’ve also named the path as 'p'. A path is a combination of nodes and relationships.

Kushner Network


We can now look for instance at the 2nd degree network of "Jared Kushner":

MATCH network = (:Person {name:"JARED KUSHNER"})-[*..2]-()
RETURN network

Most Connected People


We can run the following query to find the most connected people in the graph:

MATCH (p:Person)-[r]-()
RETURN p.name, type(r) AS type, count(*) AS degree
ORDER BY degree DESC
LIMIT 10

Connection from Putin to Trump


As expected via Rex Tillerson, but interestingly the connection between Tillerson and Trump as Nominee for Secretary of State seems to be missing from the data.

MATCH (vp:Person {name:"VLADIMIR PUTIN"}),(dt:Person {name:"DONALD J. TRUMP"})
MATCH path = allShortestPaths( (vp)-[*]-(dt) )
RETURN path

So let’s check the data.

Nominees of Trumps Cabinet


And Tillerson really is missing - there are only 19 in the list.

MATCH (p:Person)-[con:RELATED_TO]->()
WHERE con.connection CONTAINS "Nominee"
RETURN p.name, con.connection
ORDER BY p.name ASC

Visualizing TrumpWorld


Our friends from Linkurious used part of this work in a blog post demonstrating how to visualize this data with their tool.