Graph Data Modeling


This section will cover how to model data as a graph as well as how to import data into Neo4j using LOAD CSV.

Let’s first identify the tools and concepts we’ll be using:

Neo4j Tools
  • Neo4j Browser - Neo4j Browser (this web application!) is a query workbench for querying Neo4j and visualizing the results.

  • Cypher - Cypher is the query language for graphs used to interact with Neo4j. Cypher is an expressive query language designed for graphs that allows the user to define graph queries defining a traversal.

  • LOAD CSV - functionality built into Cypher that allows for importing data into Neo4j from CSV / flat files.

Next, let’s introduce the dataset we’ll be using today.

Short term housing rental data


Inside Airbnb

Inside Airbnb

Inside Airbnb has collected data on listings, hosts and reviews. We will use this data as an example to see how we can model and import data into Neo4j. We will learn the Cypher query language

The Datamodel


Listings, reviews, hosts

Listings datamodel

This is the full data model that we will build out using data from Inside Airbnb.

The Labeled Property Graph Model


Neo4j implements the labeled property graph data model. The components of the labeled property graph model are:

The Labeled Property Graph Model

lpg

  • Node - The node is the entity or object. Each node can have one or more labels.

  • Relationship - Relationships connect Nodes in the graph. Each relationship has a single type.

  • Properties - Key-values pairs stored on either Nodes or Relationships

An Introduction to Cypher


When using Cypher, we will be defining graph patterns, expressed as ASCII-art like strings. For example:

A simple graph traversal
MATCH (l:Listing)-[:HAS]->(a:Amenity)
RETURN l, COLLECT(a) AS amenities;
Creating data with Cypher
CREATE (l:Listing {listing_id: "1234"})
SET l.name  = "Spacious 2 bedroom",
    l.price = 195
CREATE (a1:Amenity {name: "wifi"})
CREATE (a2:Amenity {name: "hot tub"})
CREATE (l)-[:HAS]->(a1)
CREATE (l)-[:HAS]->(a2)
RETURN *;
Deleting data
MATCH (a)
DETACH DELETE a;

Cypher keywords


In the previous example, we used the following Cypher keywords:

  • MATCH - Search the graph for a certain graph pattern

  • RETURN - Specify results to be returned from the query

  • COLLECT - An aggregation function, collect elements into a list.

  • SET - Set property values

  • CREATE - Create data in the graph. Takes a graph pattern.

  • DELETE - Delete data from the graph

Next, we will learn how to import data into Neo4j using LOAD CSV;

LOAD CSV


LOAD CSV allows us to import data from CSV files, using Cypher to define the model we want to create.

Examine CSV file data, but don’t create any data
LOAD CSV WITH HEADERS FROM "file:///listings.csv" AS row
RETURN row LIMIT 10

We see that by specifying WITH HEADERS, the row object becomes a map (or dictionary). LOAD CSV will allow us to iterate over each row in the file, creating data for each row. Let’s start by importing listings, amenities, and neighborhoods:

datamodel1

LOAD CSV WITH HEADERS FROM "file:///listings.csv" AS row
WITH row WHERE row.id IS NOT NULL
CREATE (l:Listing {listing_id: row.id})
SET l.name = row.name,
    l.price = toFloat(substring(row.price, 1)),
    l.weekly_price = toFloat(substring(row.weekly_price, 1)),
    l.cleaning_fee = toFloat(substring(row.cleaning_fee, 1)),
    l.property_type = row.property_type,
    l.accomodates = toInt(row.accomodates),
    l.bedrooms = toInt(row.bedrooms),
    l.bathrooms = toInt(row.bathrooms),
    l.availabiliity_365 = toInt(row.availability_365)

We can verify that we’ve imported data with this query:

Count number of Listing nodes
MATCH (l:Listing) RETURN COUNT(l);
Create index on :Listing(listing_id)
CREATE INDEX ON :Listing(listing_id);

Adding Neighborhood and Amenity nodes


Remember that this is the model we want to create initially:

datamodel1

We’re missing amenities and neighborhoods so let’s add those.

MERGE and constraints

Neo4j is "schema optional" - we can choose to define a schema to enfore some data integrity constraints. In this case we want to avoid creating Amenity nodes so we will define a constraint to enfore uniqueness:

CREATE CONSTRAINT ON (a:Amenity) ASSERT a.name IS UNIQUE;

Now that we’ve created a uniqueness constraint, we can use the MERGE keyword when creating data to check to see if data exists in the graph before creating. Think of MEGE as "get-or-create".

Add Amenity nodes, and relationships connecting them to the listings
LOAD CSV WITH HEADERS FROM "file:///listings.csv" AS row
WITH row WHERE row.id IS NOT NULL
MATCH (l:Listing {listing_id: row.id})
WITH l, split(replace(replace(replace(row.amenities, "{", ""), "}", ""), "\"", ""), ",") AS amenities
UNWIND amenities AS amenity
MERGE (a:Amenity {name: amenity})
MERGE (l)-[:HAS]->(a);

Neighborhood

Let’s do the same with Neighborhood nodes now:

CREATE CONSTRAINT ON (n:Neighborhood) ASSERT n.neighborhood_id IS UNIQUE;

Some neighborhoods lack an id so we’ll use a default value for these with the coalesce function.

Add Neighborhood nodes
LOAD CSV WITH HEADERS FROM "file:///listings.csv" AS row
WITH row WHERE row.id IS NOT NULL
MATCH (l:Listing {listing_id: row.id})
MERGE (n:Neighborhood {neighborhood_id: coalesce(row.neighbourhood_cleansed, "NA")})
ON CREATE SET n.name = row.neighbourhood
MERGE (l)-[:IN_NEIGHBORHOOD]->(n);

Host


datamodel2

CREATE CONSTRAINT ON (h:Host) ASSERT h.host_id IS UNIQUE;
LOAD CSV WITH HEADERS FROM "file:///listings.csv" AS row
WITH row WHERE row.host_id IS NOT NULL
MERGE (h:Host {host_id: row.host_id})
ON CREATE SET h.name      = row.host_name,
              h.about     = row.host_abot,
              h.superhost = CASE WHEN row.host_is_super_host = "t" THEN True ELSE False END,
              h.location  = row.host_location,
              h.image     = row.host_picture_url
WITH row, h
MATCH (l:Listing {listing_id: row.id})
MERGE (h)-[:HOSTS]->(l);

Reviews


datamodel full

Create constraint on User
CREATE CONSTRAINT ON (u:User) ASSERT u.user_id IS UNIQUE;
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///reviews.csv" AS row

// User
MERGE (u:User {user_id: row.reviewer_id})
SET u.name = row.reviewer_name

// Review
CREATE (r:Review {review_id: row.id})
SET r.date     = row.date,
    r.comments = row.comments
WITH row, u, r
MATCH (l:Listing {listing_id: row.listing_id})
MERGE (u)-[:WROTE]->(r)
MERGE (r)-[:REVIEWS]->(l);

Verify the data model


CALL db.schema();