The Data Connecting Politicians, Criminals and a Rogue Industry that hides their cash

Panama Papers: Secrets of the Global Elite

About this data

The ICIJ Offshore Leaks database, which you are working with in Neo4j, contains information on almost 785,000 offshore entities that are part of the Paradise and Panama Papers and the Offshore Leaks investigations. The data covers a long time of activities – and links to people and companies in more than 200 countries and territories.

The real value of the database is that it strips away the secrecy that cloaks companies and trusts incorporated in tax havens and exposes the people behind them. This includes, when available, the names of the real owners of those opaque structures. In all, it reveals more than 430,000 names of people and companies behind secret offshore structures. They come from leaked records and not a standardized corporate registry, so there may be duplicates. We suggest you confirm the identities of any individuals or entities located in the database based on addresses or other identifiable information.

There are legitimate uses for offshore companies and trusts. We do not intend to suggest or imply that any persons, companies or other entities included in the database have broken the law or otherwise acted improperly. If you find an error in the database please get in touch with ICIJ.

The Shape of the Data

data model

The Offshore Leaks Database was imported into Neo4j to be used by journalists and researchers to take advantage of the connections in the data. To the left is the basic "property graph" data model. Each data record is called a "node" representing an entity, intermediary, officer or address. They're connected to form a "graph" that reveals a complex web of relationships. To the left you can see a simplified diagram how the nodes connect to each other.

These are the types of nodes that you will encounter in the data:

  • Entity - The offshore legal entity. This could be a company, trust, foundation, or other legal entity created in a low-tax jurisdiction.

  • Officer - A person or company who plays a role in an offshore entity, such as beneficiary, director, or shareholder. The relationships shown in the diagram are just a sample of all the existing ones.

  • Intermediary - A go-between for someone seeking an offshore corporation and an offshore service provider — usually a law-firm or a middleman that asks an offshore service provider to create an offshore firm.

  • Address - The registered address as it appears in the original databases obtained by ICIJ.

  • Other - Other entities found in the data.

How Graphs Helped our Investigation

The Offshore Leaks data exposes a set of connections between people and offshore entities. Graph databases are the best way to explore the relationships between these people and entities — it’s much more intuitive to use for this purpose than a SQL database or other types of NoSQL databases.

For example, let's say we want to discover the shortest paths between two entity officers through a set of Entity or Address nodes. This is quite easy with Cypher, Neo4j's graph query language.

MATCH (a:Officer),(b:Officer)
          WHERE a.name CONTAINS 'Smith' AND a.sourceID = "Panama Papers"
            AND b.name CONTAINS 'Grant'
          MATCH p=allShortestPaths((a)-[:officer_of|intermediary_of|registered_address*..10]-(b))
          RETURN p
          LIMIT 50

To execute this query, please click on the statement above to put the query in the query editor above.
Hit the triangular button or press Ctrl+Enter to run it and see the resulting visualization.

The resulting graph allows us to explore how these people are connected:

How to ask questions using a query language

Graph Patterns

Neo4j’s query language, Cypher, is centered around graph patterns which represents entities with parentheses, for example, (e:Entity) and connections with arrows, for example -[:intermediary_of]->. :Entity and :intermediary_of are the types of the entity and the connection, respectively.

Here is an example pattern: (:Intermediary)-[:intermediary_of]->(:Entity). These patterns may be found with the MATCH clause.

Other Clauses

The following clauses may follow a MATCH clause. They work with the properties stored at the nodes and relationships found in the graph matching that pattern.

filter

WHERE intermediary.name CONTAINS 'MOSSACK'

aggregate

WITH e.jurisdiction AS country, COUNT(*) AS frequency

return

RETURN country, frequency

order

ORDER BY frequency DESC

limit

LIMIT 20;

Jurisdiction distribution of intermediaries in the ICIJ offshore leaks DB
MATCH (intermediary:Intermediary)-[:intermediary_of]->(e:Entity)
          WHERE intermediary.name CONTAINS 'MOSSFON' AND intermediary.sourceID = "Panama Papers"
          RETURN e.jurisdiction AS country, COUNT(*) AS frequency
          ORDER BY frequency DESC LIMIT 20;

Click on the block to put the query in the topmost window on the query editor. Hit the triangular button or press Ctrl+Enter to run it and see the resulting visualization.

Querying the Data

investigating

This guide covers:

  • Statistics

  • Visual vs Tabular Results

  • Investigating invidual people and entities

  • Finding x-degree relationships

  • Finding shortest paths and connections between entities

  • Frequently recurring pairs

This interactive guide will help you explore the Paradise and Panama Papers and the Offshore Leaks data using Cypher, Neo4j’s graph query language.

You’ll have the same investigative power we had in order to discover additional stories behind the data.

Be sure to check out the shape of the data section to understand the basics of Cypher and the data model used in the graph database.

Overview


Let’s see what data is in this database.

Run the embedded query to examine graph structure that we have just created. We see nodes for each type of entity, connected by the relationships they have.

Meta Graph

CALL db.schema()

To execute this query, please click on the statement above to put the query in the query editor above.
Hit the triangular button or press Ctrl+Enter to run it and see the resulting visualization.

Counts per entity type

MATCH (node)
RETURN labels(node) AS type,count(*)

We can also check how many entities of each type are in our database.

Which intermediaries have the most connections to which entities

MATCH (i:Intermediary) 
WHERE count {  (i)--()  } > 100 AND i.sourceID = "Panama Papers"
MATCH (i)-[connection]-(entity)
RETURN i.name as intermediary, type(connection) as relationship, head(labels(entity)) as type, count(*) as count
ORDER BY count DESC LIMIT 20

Filter datasource

Note that the above query is searching the Offshore leaks, Paradise- and Panama Papers datasets. To filter for just one source, use a WHERE clause. For example:

MATCH (i:Intermediary) 
WHERE i.sourceID = "Panama Papers" AND count {  (i)--()  } > 100 
MATCH (i)-[connection]-(entity)
RETURN i.name as intermediary, type(connection) as relationship, head(labels(entity)) as type, count(*) as count
ORDER BY count DESC LIMIT 20

Query for visual results

Querying the graph with Cypher is all about graph pattern matching. To query the graph we define graph patterns to be searched. For example, we can define the pattern "Intermediary connected to an Entity node through the intermediary_of relationship" as:

(i:Intermediary)-[r:intermediary_of]->(e:Entity)

We use variables i, e and r respectively for later filtering with WHERE and RETURNing results, these are aliases that can be reused within a single Cypher statement.

Entities registered by an Intermediary

MATCH (i:Intermediary)-[r:intermediary_of]->(e:Entity)
WHERE i.name CONTAINS "MOSSACK" AND i.sourceID = "Panama Papers"
RETURN i, r, e LIMIT 100

The results of this query are visualized as a graph, you can switch between the graph visualization and tabular results with the icons on the left side of the results view.

Query for tabular results


Let's query for the pattern "Officer node connected to an Entity node through officer_of relationship:

(o:Officer)-[:officer_of]->(:Entity)

We then count the entities per officer in an aggregation and ORDER BY that count DESCending and return the top-10 results.

Officers with most entities

MATCH (o:Officer)-[:BENEFICIARY_OF]->(:Entity)
WHERE o.sourceID = "Panama Papers"
RETURN o.name, count(*) as entities
ORDER BY entities DESC LIMIT 10

Search for Officer nodes by name


Enter any name (e.g. from our published investigations) into the form then click on the query to execute to see if that person appears in the data. Note that this search is case sensitive and searches exact matches only. We're setting a parameter for the officer which we can reuse later, just click and run the :param block.

:param officer=>"A Name"
MATCH (o:Officer)
WHERE o.name CONTAINS $officer AND o.sourceID = "Panama Papers"
RETURN o
LIMIT 100
After running the query, you can double-click on any node to expand connections in the graph around that node.

Search for an Officer and find the connections


Let’s see with which entities our officer was involved with, including first and second degree connections.

1st degree

MATCH (o:Officer)
WHERE o.name CONTAINS $officer AND o.sourceID = "Panama Papers"
MATCH path = (o)-[r]->(:Entity)
RETURN path LIMIT 100

2nd degree entities

MATCH (o:Officer) 
WHERE o.name CONTAINS $officer AND o.sourceID = "Panama Papers"
MATCH path = (o)-[]->(:Entity)
      <-[]-(:Officer)-[]->(:Entity)
RETURN path LIMIT 100

Find who is behind an Entity and the roles that they play


:param entity=>"An Entity"
MATCH (e:Entity)-[r]-(o:Officer)
WHERE e.name CONTAINS $entity AND e.sourceID = "Panama Papers"
RETURN *
LIMIT 100

Joint involvement


When investigating, it is very important to identify people that appear to operate together. You can try to find if two officers appear connected to the same entities recurrently by using this query. Note that in this case, the first results show companies because companies can also be officers of entities.

MATCH (o1:Officer)-[r1]->(e:Entity)<-[r2]-(o2:Officer)
USING JOIN ON e
WHERE id(o1) < id(o2) AND count {  (o1)-->() } > 10 AND count { (o2)-->() } > 10 AND o1.sourceID = "Panama Papers"
WITH o1,o2,count(*) as freq, collect(e)[0..10] as entities
WHERE freq > 10
WITH * ORDER BY freq DESC LIMIT 10
RETURN o1.name, o2.name, freq, [e IN entities | e.name]

We can attempt to remove companies from the results by filtering our Officer names that contain strings commonly found in company names such as "LLC", "Limited", "Ltd", etc:

MATCH (o1:Officer)-[r1]->(e:Entity)<-[r2]-(o2:Officer)
USING JOIN ON e
WHERE id(o1) < id(o2) AND NOT o1.name CONTAINS "LIMITED" AND NOT o1.name CONTAINS "Limited" AND o1.sourceID = "Panama Papers"
AND NOT o2.name CONTAINS "Limited" AND NOT o2.name CONTAINS "LIMITED"
AND count {  (o1)-->() } > 10 AND count { (o2)-->() } > 10
WITH o1,o2,count(*) as freq, collect(e)[0..10] as entities
WHERE freq > 10
WITH * ORDER BY freq DESC LIMIT 10
RETURN o1.name, o2.name, freq, [e IN entities | e.name]

Lets visualize this graph with virtual relationships

MATCH (o1:Officer)-[r1]->(e:Entity)<-[r2]-(o2:Officer)
USING JOIN ON e
WHERE id(o1) < id(o2) AND NOT o1.name CONTAINS "LIMITED" AND NOT o1.name CONTAINS "Limited" AND o1.sourceID = "Panama Papers"
AND NOT o2.name CONTAINS "Limited" AND NOT o2.name CONTAINS "LIMITED"
AND count {  (o1)-->() ) > 10 AND size( (o2)-->()  } > 10
WITH o1,o2,count(*) as freq, collect(e)[0..10] as entities
WHERE freq > 200
RETURN o1,o2,apoc.create.vRelationship(o1,'JOINT',{freq:freq,entities:[e IN entities| e.name]},o2)

Shortest path between two people


MATCH (a:Officer),(b:Officer)
WHERE a.name CONTAINS 'Smith' AND b.name CONTAINS 'Grant'
AND a.sourceID = "Panama Papers" AND b.sourceID = "Panama Papers"
WITH a,b LIMIT 20000
MATCH p=allShortestPaths((a)-[:officer_of|intermediary_of|registered_address*..10]-(b))
RETURN p
LIMIT 50

Query by address


We can use Neo4j's string comparison functions to search for addresses that contain cities and countries in which we are interested:

MATCH (a:Address)<-[:registered_address]-(other)
WHERE a.address CONTAINS 'Barcelona' AND a.address CONTAINS 'Spain' AND a.sourceID = "Panama Papers"
RETURN a, other
LIMIT 100

Offshore entity juridictions by intermediary


What are the most popular offshore jurisdictions for certain intermediaries?

MATCH (i:Intermediary)-[:intermediary_of]->(e:Entity)
WHERE i.name CONTAINS 'MOSSACK' AND i.sourceID = "Panama Papers"
RETURN e.jurisdiction_description AS jurisdiction, count(*) AS number
ORDER BY number DESC LIMIT 10

Most popular offshore jurisdiction for people connected to a country


What are the common offshore jurisdictions for officers with addresses in a certain country?

MATCH (o:Officer)-->(e:Entity)<-[:intermediary_of]-(i:Intermediary)
WHERE o.country_codes CONTAINS 'USA' and i.sourceID = 'Panama Papers'
RETURN e.jurisdiction_description AS jurisdiction, count(*) AS number
ORDER BY number DESC LIMIT 10

Most popular offshore jurisdictions for people with addresses in a certain city


What are the common offshore jurisdictions used by people connected to certain cities?

MATCH (a:Address)<-[:registered_address]-(o:Officer),
(o)-->(e:Entity)<-[:intermediary_of]-(i:Intermediary)
WHERE a.address CONTAINS 'London' AND a.address CONTAINS 'United Kingdom'
AND a.sourceID = "Panama Papers"
RETURN e.jurisdiction_description AS jurisdiction, count(*) AS number 
ORDER BY number DESC LIMIT 10

Most popular intermediaries for people with an address in a certain city?


Who are the most common intermediaries for officers connected to addresses in certain places?

MATCH (a:Address)<-[:registered_address]-(o:Officer),
(o)-->(e:Entity)<-[:intermediary_of]-(i:Intermediary)
WHERE a.address CONTAINS 'Bahamas' AND a.sourceID = "Panama Papers"
RETURN i.name AS intermediary, count(DISTINCT e) AS number 
ORDER BY number DESC LIMIT 10

Full Text Search


All the previous queries we run were searching the exact text of a node. However, you can also build an index to search the text of properties inside a node. Just like in a book where you look at the index to find a section that interest you, and then start reading from there.

For that, you need to build the index first. It will take a little while since the procedure has to read through the entire database to create it.

We used Solr as a full text search engine to search across all data. But you can do the same in Neo4j directly, both use Apache Lucene under the hood for full text search.

In order to use the full text search feature, we indexed our data by specifying all the properties we want to index. Here we created a full text index called search (we will use this name when searching in the index) with all our data. You don't have to run this, it's just here as an example

create fulltext index search for
        (n:Officer|Intermediary|Address|Entity)
        on each [n.name,n.address]

We can now use this index to search for any text contained in the properties. As a result of the query you'll get the nodes with matching text.

The most simple case would be to search across all data for an exact match of a particular word.

If you enter a word into the form, all occurrences will be found (but limited to 10 results).

CALL db.index.fulltext.queryNodes('search','Nike',{limit:10})

Full text search - advanced


You can futher restrict the full text search to only searching in a particular property. In order to seach for an Entity incorporated by offshore service provider Appleby, use the following:

CALL db.index.fulltext.queryNodes('search', 'name:"Appleby"',{limit:10})

Test it yourself:

CALL db.index.fulltext.queryNodes('search', 'name:"Appleby"', {limit:10})

Now we can search for nodes with a specific property value, and then explore their neighbourhoods visually by double-clicking to expand relationships.

Full text search with graph patterns


Previously we searched for nodes by matching against property values. However, integrating text search with an graph query is so much more powerful.

We could for instance search for addresses in the database that contain the word "New York", and then find all entities registered at those addresses:

CALL db.index.fulltext.queryNodes("search", 'address:"New York"')
            YIELD node AS addr WHERE addr.sourceID = "Panama Papers"
            MATCH (addr:Address)<-[:registered_address]-(entity)
            RETURN addr, entity LIMIT 50

There may be typos in the data so we can use fuzzy matching to find addresses that have inconsistent spellings.

Add a tilde (~) to instruct the index search procedure to do a fuzzy match, allowing you to find “Malta” even if the spelling is slightly off.

CALL db.index.fulltext.queryNodes("search", 'address:Malta~')
            YIELD node AS addr WHERE addr.sourceID = "Panama Papers"
            MATCH (addr:Address)<-[:registered_address]-(entity)
            RETURN addr, entity LIMIT 50

You might notice that there are addresses that contain the word “Barcelona” that are not in Barcelona, Spain.

We can further specify that we want the text to contain both the word Barcelona, AND the word Spain with:

CALL db.index.fulltext.queryNodes("search", '+name:Barcelona +name:Spain')
            YIELD node AS addr WHERE addr.sourceID = "Panama Papers"
            MATCH (addr:Address)<-[:registered_address]-(entity)
            RETURN addr, entity LIMIT 50

For more details on the query syntax used in the second parameter of the search procedure, please see this Lucene query tutorial

Graph Analytics


We start by projecting the graph into an in-memory represenation suitable for heavy computation

call gds.graph.project("offshoreleaks","*","*");
                            

PageRank is an algorithm used for instance by Google to rank websites in their search engine results. It assumes that more important entities are likely to have more connections pointing to them, and not just directly but also indirectly.

The algorithm can also be used as way of measuring the importance of entities in a network. Let’s use PageRank to find the top 20 ranked entities in the dataset (this runs for about 20s).

CALL gds.pageRank.stream("offshoreleaks") YIELD nodeId, score
        WITH gds.util.asNode(nodeId) as node, score
        WHERE node:Entity AND node.sourceID = "Panama Papers"
        RETURN node.name AS entity, score
        ORDER BY score DESC
        LIMIT 20

Now let’s try a similar query but investigating different patterns. The following detects the 10 top ranked addresses. It then finds how many entities are registered at those addresses. Note that the address nodes contain text that has not been standardized, therefore there may be inconsistencies in the address data.

CALL gds.pageRank.stream("offshoreleaks") YIELD nodeId, score
        WITH gds.util.asNode(nodeId) as node, score
        WHERE node:Address AND node.sourceID = "Panama Papers"
WITH * ORDER BY score DESC  LIMIT 10
MATCH (address)<-[:registered_address]-(e:Entity)
RETURN address.address, count(e) AS count
ORDER BY count DESC

Explore the Pandora, Panama, and Paradise Papers Yourself

Shape of the Data

Understand the data model.

  • What are the nodes?
  • What are the relationships?
  • What are the properties?

Pandora Papers

Explore the latest leak yourself.

  • Cypher query intro
  • Finding companies and individuals
  • Power Players

Paradise Papers

Explore the Appleby data yourself.

  • Cypher query intro
  • Finding companies and individuals
  • Path finding

Panama Papers

Explore the Mossack Fonseca data yourself.

  • Cypher query intro
  • Explore the Panama Law Firm files
  • Connect the dots

Send ICIJ a tip

Help us investigate.

  • Interesting connections
  • Entities that matter to you