Tutorial: Import Relational Data Into Neo4j

Goals

This guide will teach you the process for exporting data from a relational database (PostgreSQL) and importing into a graph database (Neo4j). You will learn how to take data from the relational system and to the graph by translating the schema and using import tools. This example uses a specific data set, but the principles in this guide can be applied and reused with any data domain.

Prerequisites

You should have a basic understanding of the property graph model and know how to model data as a graph. If you haven’t already, download and install Neo4j Desktop, so you can follow along with the examples.

Beginner

About the Data Domain

In this guide, we will be using the NorthWind dataset, an often-used SQL dataset. This data depicts a product sale system - storing and tracking customers, products, customer orders, warehouse stock, shipping, suppliers, and even employees and their sales territories. Although the NorthWind dataset is often used to demonstrate SQL and relational databases, the data also can be structured as a graph.

An entity-relationship diagram (ERD) of the Northwind dataset is shown below.

First, this is a rather large and detailed model. We can scale this down a bit for our example and choose the entities that are most critical for our graph - in other words, those that might benefit most from seeing the connections. For our use case, we really want to optimize the relationships with orders - what products were involved (with the categories and suppliers for those products), which employees worked on them and those employees' managers.

Using these business requirements, we can narrow our model down to these essential entities.

Developing a Graph Model

The first thing we will need to do to get data from a relational database into a graph is to translate the relational data model to a graph data model. Determining how we want to structure tables and rows as nodes and relationships may vary depending on what is most important to your business needs.

For more information on adapting your graph model to different scenarios, check out our modeling designs guide.

When deriving a graph model from a relational model, we should keep a couple of general guidelines in mind.

A row is a node
A table name is a label name
A join or foreign key is a relationship

With these principles in mind, we can map our relational model to a graph with the following steps:

Rows to Nodes, Table names to labels

Each row on our Orders table becomes a node in our graph with Order as the label.
Each row on our Products table becomes a node with Product as the label.
Each row on our Suppliers table becomes a node with Supplier as the label.
Each row on our Categories table becomes a node with Category as the label.
Each row on our Employees table becomes a node with Employee as the label.

Joins to relationships

Join between Suppliers and Products becomes a relationship named SUPPLIES (where supplier supplies product).
Join between Products and Categories becomes a relationship named PART_OF (where product is part of a category).
Join between Employees and Orders becomes a relationship named SOLD (where employee sold an order).
Join between Employees and itself (unary relationship) becomes a relationship named REPORTS_TO (where employees have a manager).
Join with join table (Order Details) between Orders and Products becomes a relationship named CONTAINS with properties of unitPrice, quantity, and discount (where order contains a product).

If we draw our translation out on the whiteboard, we have this graph data model.

Now, we can, of course, decide that we want to include the rest of the entities from our relational model, but for now, we will keep to this smaller graph model.

How does the Graph Model Differ from the Relational Model?

There are no nulls. Non-existing value entries (properties) are just not present.
It describes the relationships in more detail. For example, we know that an employee SOLD an order rather than having a foreign key relationship between the Orders and Employees tables. We could also choose to add more metadata about that relationship, should we wish.
Either model can be more normalized. For example, addresses have been denormalized in several of the tables, but could have been in a separate table. In a future version of our graph model, we might also choose to separate addresses from the Order (or Supplier or Employee) entities and create separate Address nodes.

Exporting Relational Tables to CSV

Thankfully, this step has already been done for us with the Northwind data. However, if you are working with another data domain, you will need to take the data from the relational tables and put it in another format for loading to the graph. A common format that many systems can handle a flat file of comma-separated values (CSV), so let’s see how to export relational tables from a PostgreSQL database to CSV files for us to create our graph.

The PostgreSQL 'copy' command lets us execute a SQL query and write the result to a CSV file. We can assemble a short .sql script of these copy commands, as shown below.

export_csv.sql

COPY (SELECT * FROM customers) TO '/tmp/customers.csv' WITH CSV header;
COPY (SELECT * FROM suppliers) TO '/tmp/suppliers.csv' WITH CSV header;
COPY (SELECT * FROM products)  TO '/tmp/products.csv' WITH CSV header;
COPY (SELECT * FROM employees) TO '/tmp/employees.csv' WITH CSV header;
COPY (SELECT * FROM categories) TO '/tmp/categories.csv' WITH CSV header;

COPY (SELECT * FROM orders
      LEFT OUTER JOIN order_details ON order_details.OrderID = orders.OrderID) TO '/tmp/orders.csv' WITH CSV header;

Then, we can run that script against our northwind database with the command psql -d northwind < export_csv.sql, and it will create the individual CSV files listed in our script.

Importing the Data using Cypher

After we have exported our data from PostgreSQL, we will use Cypher’s {cyphermanual}/clauses/load-csv/[LOAD CSV^] command to transform the contents of the CSV file into a graph structure. First, we will likely want to place our CSV files in an easily-accessed directory. With Neo4j Desktop, we can place them in the local database import directory (detailed instructions found in our desktop import guide). This way, we can use the file:/// prefix in our Cypher statements to locate the files. We can also place the files in another local or remote directory (supports HTTPS, HTTP, and FTP) and specify the full path in our Cypher statements. Since we are using Neo4j Desktop in this example, we will use the import folder for the database and the path for our CSV files can start with the file:/// prefix.

Now that we have our files where we can access them easily, we can use Cypher’s LOAD CSV command to read each file and add Cypher statements after it to take the row/column data and transform it to the graph.

The entire Cypher script is available on Github for you to copy and run, but we will step through each section below to explain what each piece of the script is doing.

// Create orders
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
MERGE (order:Order {orderID: row.OrderID}) 
  ON CREATE SET order.shipName = row.ShipName;

// Create products
LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS row
MERGE (product:Product {productID: row.ProductID}) 
  ON CREATE SET product.productName = row.ProductName, product.unitPrice = toFloat(row.UnitPrice);

// Create suppliers
LOAD CSV WITH HEADERS FROM 'file:///suppliers.csv' AS row
MERGE (supplier:Supplier {supplierID: row.SupplierID}) 
  ON CREATE SET supplier.companyName = row.CompanyName;

// Create employees
LOAD CSV WITH HEADERS FROM 'file:///employees.csv' AS row
MERGE (e:Employee {employeeID:row.EmployeeID}) 
  ON CREATE SET e.firstName = row.FirstName, e.lastName = row.LastName, e.title = row.Title;

// Create categories
LOAD CSV WITH HEADERS FROM 'file:///categories.csv' AS row
MERGE (c:Category {categoryID: row.CategoryID}) 
  ON CREATE SET c.categoryName = row.CategoryName, c.description = row.Description;

You might notice that we have not imported all of the field columns in our CSV file. With our statements, we can choose which properties are needed on a node, which can be left out, and which might need imported to another node type or relationship. You might also notice that we used the {cyphermanual}/clauses/merge/[MERGE keyword^], instead of {cyphermanual}/clauses/create/[CREATE^]. Though we feel pretty confident there are no duplicates in our CSV files, we can use MERGE as good practice for ensuring unique entities in our database.

For very large commercial or enterprise datasets, you may find out-of-memory errors, especially on smaller machines. To avoid these situations, you can prefix the statement with the USING PERIODIC COMMIT query hint to commit data in batches. This practice is not standard recommendation for smaller datasets, but is only recommended when memory issues are threatened. More information on this query hint can be found in the {cyphermanual}/query-tuning/using/#query-using-periodic-commit-hint[manual^].

After the nodes are created, we need to create the relationships between them. Importing the relationships will mean looking up the nodes we just created and adding a relationship between those existing entities. To ensure the lookup of nodes is optimized, we will want to create indexes for any node properties we want to use in the lookups (often the id or another unique value).

We also want to create a constraint (also creates an index with it) that will disallow orders with the same id from getting created, preventing duplicates. Finally, as the indexes are created after the nodes are inserted, their population happens asynchronously, so we use the schema await (a shell command) to block until they are populated.

CREATE INDEX product_id FOR (p:Product) ON (p.productID);
CREATE INDEX product_name FOR (p:Product) ON (p.productName);
CREATE INDEX supplier_id FOR (s:Supplier) ON (s.supplierID);
CREATE INDEX employee_id FOR (e:Employee) ON (e.employeeID);
CREATE INDEX category_id FOR (c:Category) ON (c.categoryID);
CREATE CONSTRAINT order_id ON (o:Order) ASSERT o.orderID IS UNIQUE;
schema await

Initial nodes and indexes in place, we can now create the relationships for orders to products and orders to employees.

// Create relationships between orders and products
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (product:Product {productID: row.ProductID})
MERGE (order)-[op:CONTAINS]->(product)
  ON CREATE SET op.unitPrice = toFloat(row.UnitPrice), op.quantity = toFloat(row.Quantity);

// Create relationships between orders and employees
LOAD CSV WITH HEADERS FROM "file:///orders.csv" AS row
MATCH (order:Order {orderID: row.OrderID})
MATCH (employee:Employee {employeeID: row.EmployeeID})
MERGE (employee)-[:SOLD]->(order);

Next, create relationships between products, suppliers, and categories:

// Create relationships between products and suppliers
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS row
MATCH (product:Product {productID: row.ProductID})
MATCH (supplier:Supplier {supplierID: row.SupplierID})
MERGE (supplier)-[:SUPPLIES]->(product);

// Create relationships between products and categories
LOAD CSV WITH HEADERS FROM "file:///products.csv" AS row
MATCH (product:Product {productID: row.ProductID})
MATCH (category:Category {categoryID: row.CategoryID})
MERGE (product)-[:PART_OF]->(category);

Lastly, we will create the 'REPORTS_TO' relationship between employees to represent the reporting structure:

// Create relationships between employees (reporting hierarchy)
LOAD CSV WITH HEADERS FROM "file:///employees.csv" AS row
MATCH (employee:Employee {employeeID: row.EmployeeID})
MATCH (manager:Employee {employeeID: row.ReportsTo})
MERGE (employee)-[:REPORTS_TO]->(manager);

You can also run the whole script at once using bin/neo4j-shell -path northwind.db -file import_csv.cypher.

We can now query the resulting graph to find out what it can tell us about our newly-imported data.

Querying the Graph

We might start with a couple of general queries to verify that our data matches the model we designed earlier in the guide. Here are some example queries:

//find a sample of employees who sold orders with their ordered products
MATCH (e:Employee)-[rel:SOLD]->(o:Order)-[rel2:CONTAINS]->(p:Product)
RETURN e, rel, o, rel2, p LIMIT 25;

//find the supplier and category for a specific product
MATCH (s:Supplier)-[r1:SUPPLIES]->(p:Product {productName: 'Chocolade'})-[r2:PART_OF]->(c:Category)
RETURN s, r1, p, r2, c;

Once we are comfortable that the data aligns with our data model and everything looks correct, we can start querying to gather information and insight for business decisions. One question we might be interested in is the following:

Which Employee had the Highest Cross-Selling Count of 'Chocolade' and Another Product?

MATCH (choc:Product {productName:'Chocolade'})<-[:CONTAINS]-(:Order)<-[:SOLD]-(employee),
      (employee)-[:SOLD]->(o2)-[:CONTAINS]->(other:Product)
RETURN employee.employeeID as employee, other.productName as otherProduct, count(distinct o2) as count
ORDER BY count DESC
LIMIT 5;

Looks like employee No. 4 was busy, though employee No. 1 also did well!

employee	otherProduct	count
4	Gnocchi di nonna Alice	14
4	Pâté chinois	12
1	Flotemysost	12
3	Gumbär Gummibärchen	12
1	Pavlova	11

We might also like to answer the following question:

How are Employees Organized? Who Reports to Whom?

MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN e.employeeID AS manager, sub.employeeID AS employee;

manager	employee
2	3
2	4
2	5
2	1
2	8
5	9
5	7
5	6

Notice that employee No. 5 has people reporting to them but also reports to employee No. 2.

Let’s investigate that a bit more:

Which Employees Report to Each Other Indirectly?

MATCH path = (e:Employee)<-[:REPORTS_TO*]-(sub)
WITH e, sub, [person in NODES(path) | person.employeeID][1..-1] AS path
RETURN e.employeeID AS manager, path as middleManager, sub.employeeID AS employee
ORDER BY size(path);

manager	middleManager	employee
2	[]	3
2	[]	4
2	[]	5
2	[]	1
2	[]	8
5	[]	9
5	[]	7
5	[]	6
2	[5]	9
2	[5]	7
2	[5]	6

How Many Orders were Made by Each Part of the Hierarchy?

MATCH (e:Employee)
OPTIONAL MATCH (e)<-[:REPORTS_TO*0..]-(sub)-[:SOLD]->(order)
RETURN e.employeeID as employee, [x IN COLLECT(DISTINCT sub.employeeID) WHERE x <> e.employeeID] AS reportsTo, COUNT(distinct order) AS totalOrders
ORDER BY totalOrders DESC;

employee	reportsTo	totalOrders
2	[8,1,5,6,7,9,4,3]	830
5	[6,7,9]	224
4	[]	156
3	[]	127
1	[]	123
8	[]	104
7	[]	72
6	[]	67
9	[]	43

What’s Next?

If you followed along with each step through this tutorial as a guide, then you might want to explore the data set with more queries and try to answer additional questions you came up with for the data. You may also want to apply these same principles to your own or another data set for analysis.

If you used this as a process flow to apply to a different data set or you would like to do that next, feel free to start at the top and work through this guide again with another domain. The steps and processes still apply (though, of course, the data model, queries, and business questions will need adjusted).

If you have data that needs additional cleansing and manipulation than what is covered in this guide, the APOC library may be able to help. It contains hundreds of procedures and functions for handling large amounts of data, translating values, cleaning messy data sources, and more!

If you are interested in doing a one-time initial dump of relational data to Neo4j, then the Neo4j ETL Tool might be what you are looking for. The application is designed with a point-and-click user interface with the goal of fast, simple relational-to-graph loads that help new and existing users gain faster value from seeing their data as a graph without Cypher, import procedures, or other code.

Resources

Northwind SQL, CSV and Cypher data files, also as zip file
{cyphermanual}/clauses/load-csv/[LOAD CSV^]: Cypher’s command for importing CSV files
APOC library: Neo4j’s utility library
Neo4j ETL Tool: Loading relational data without code
Importing Data with Neo4j
Graph Data Modeling