About this Course


Motivations

This 60-minute online Cypher class was created to provide you with a quick way of learning the open Cypher graph query language. Please give us feedback on how we can improve it.

Feel free to pause and resume the course whenever you wish. Know that you can logout and log back in by just entering your email address in the login form.

Additional Resources

This course gives you a basic understanding of Cypher. It’s a broad language with many cool features. When in doubt, please use the resources below to find out more:

Running the Course with Neo4j Browser

You can run this course also with an installed and running Neo4j instance.

After starting the database, open the Neo4j Browser on http://localhost:7474/. Log in.

The browser interface has a command line on top in which you enter Cypher statements or Browser commands (starting with a colon :).

You can run this course any time with this command :play http://guides.neo4j.com/cypher

Import the movie dataset by running :play movies and clicking on the big CREATE statement and then the triangular run button right of the command line.

You can also use Ctrl+Enter to run any command, with Shift+Enter you go into multi-line mode and / puts the cursor focus into the command line.

Note
You should pin this frame with the little pin icon on the top-right.

What is Cypher


Cypher is a declarative query language that allows for expressive and efficient querying and updating of the graph data. Cypher is a relatively simple but still very powerful language. Very complicated database queries can easily be expressed through Cypher. This allows you to focus on your domain instead of getting lost in database access.

Cypher is designed to be a humane query language, suitable for both developers and (importantly, we think) operations professionals. Our guiding goal is to make the simple things easy, and the complex things possible. Its constructs are based on English prose and neat iconography which helps to make queries more self-explanatory. We have tried to optimize the language for reading and not for writing.

Being a declarative language, Cypher focuses on the clarity of expressing what to retrieve from a graph, not on how to retrieve it. This is in contrast to imperative, programmatic APIs for database access. This approach makes query optimization an implementation detail instead of burdening the user with it and requiring her to update all traversals just because the physical database structure has changed.

Cypher is inspired by a number of different approaches and builds upon established practices for expressive querying. Most of the keywords like WHERE and ORDER BY are inspired by SQL. Pattern matching borrows expression approaches from SPARQL. Some of the collection semantics have been borrowed from languages such as Haskell and Python.

Movie Database


The data model in this tutorial includes nodes with three different labels (each with their own properties), and six different types of relationships (one of which has its own property). The underlying structure of the database is visualized in the image below.

In brief, the graph is made up of Person, Movie, and Genre nodes that are related to each other in various ways. This tutorial will use the Cypher widget and the model below to introduce and explain the Cypher concepts you need to build and query the model relevant to your use case.

Getting Started


Nodes

Cypher uses a pair of parentheses (usually containing a text string) like (), (foo) to represent a node, i.e. an entity of your domain. This is reminiscent of a circle or a rectangle with rounded corners. Here are some ASCII-art representations of nodes, providing varying types and amounts of detail:

    ()
    (matrix)
    (:Movie)
    (matrix:Movie)
    (matrix:Movie {title: "The Matrix"})
    (matrix:Movie {title: "The Matrix", released: 1997})

The simplest form, (), represents an anonymous, uncharacterized node. If we want to refer to the node elsewhere, we can add an identifier, eg: (matrix). Identifiers are restricted (ie, scoped) to a single statement: an identifier may have different (or no) meaning in another statement.

The Movie label (prefixed in use with a colon) declares the node’s type. This restricts the pattern, keeping it from matching (say) a structure with an `Actor ` node in this position. Neo4j’s node indexes also use labels: each index is specific to the combination of a label and a property.

The node’s properties (eg, title) are represented as a list of key/value pairs, enclosed within a pair of braces, eg: {…​}. Properties can be used to store information and/or restrict patterns. For example, we could match nodes whose title is "The Matrix". These attributes look similar to JSON structures.

Labels

Labels allow us to group our nodes. For example, we might want to distinguish movies from people or animals (both act in films). By matching for (actor:Person)-[:ACTED_IN]→(movie), it will return Clint Eastwood, but not Clyde - his pet orangutan in Every Which Way but Loose. Let’s leave arguments over which of the two should get the most credit as an actor to film buffs!

Labels are usually used like this:

    MATCH (node:Label) RETURN node
    MATCH (node1:Label1)-[:REL_TYPE]->(node2:Label2)
    RETURN node1, node2

With labels, Cypher can make way better decisions on how to optimize your query, you should try to use them to label relevant nodes in your pattern.

Relationships

The missing piece of the Cypher snippets above is that they didn’t say anything about the relationship between the nodes, which add all the contextual information to our data. So that we can view people in their relationship to Movie as Actor, Director and Producer. So we need to be able to describe the types of relationships in our Cypher queries.

First and foremost relationships are arrows pointing from one node to another, much like -→ or ←-. But we can add detail about them as needed within a pair of square brackets.

If we wanted to retrieve everyone who had acted in a movie, we would describe the pattern (actor)-[:ACTED_IN]→(movie) to retrieve only nodes that had a relationship typed ACTED_IN with other nodes (movies) those nodes would the be actor`s as implied by the `ACTED_IN relationship.

Or generally:

    MATCH (node1)-[:REL_TYPE]->(node2)

Sometimes we need access to information about a relationship (e.g. its type or properties). For example, we might want to output the roles that an Actor played in a Movie and that roles would probably be a property of the role relationship. As with nodes, we can use identifiers for relationships (in front of the :TYPE). If we tried to match (actor)-[role:ACTED_IN]→(movie) we would be able to output the role.roles for each of the actors in all of the movies that they acted in.

    MATCH (node1)-[rel:TYPE]->(node2)
    RETURN rel.property

Patterns

Combining the syntax for nodes and relationships, we can express patterns. The following could be a simple pattern (or fact) in this domain:

    (matrix:Movie {title:"The Matrix"} )<-[role:ACTED_IN {roles:["Neo"]}]-
    (keanu:Person:Actor {name:"Keanu Reeves"})

Like with node labels, the relationship type ACTED_IN is added as a symbol, prefixed with a colon: :ACTED_IN. Identifiers (eg, role) can be used elsewhere in the statement to refer to the relationship. Node and relationship properties use the same notation. In this case, we used an array property for the roles, allowing multiple roles to be specified.

To increase modularity and reduce repetition, Cypher allows patterns to be assigned to identifiers. This allow the matching paths to be inspected, used in other expressions, etc.

    cast = (:Person)-[:ACTED_IN]->(:Movie)

The cast variable would contain two nodes and the connecting relationship for each path that was found or created. There are a number of functions to access details of a path, including nodes(path), rels(path) (same as relationships(path)), and length(path).

Resources

Nodes and Properties


Creating Nodes

Let’s start by adding a node to the graph.

Let’s follow along with the video.

Run the following query, replacing My Name with your name in quotes (if you happen to have the same name as a famous actor you might want to change what name you put in):

CREATE (me:Person {name: "My Name"})
RETURN me;

You will see the new node returned and also as part of the visualization. You can also easily check for its existence with the following query.

MATCH (me:Person)
WHERE me.name="My Name"
RETURN me.name;

or

MATCH (me:Person {name:"My Name"})
RETURN me.name;

The All Nodes Query

If we wanted to return all nodes in the graph, we can use the following query:

MATCH (n)
RETURN n; //note the semicolon

The query is doing a full graph search. It visits every single node to see whether it matches the pattern of (n). In this concrete case, the pattern is simply a node that may or may not have relationships, so it will match every single node in the graph. The RETURN clause then returns all of the information about each of those nodes - including all of their properties.

NOTE: In a larger graph this will return A LOT of data, run it only when you know what you’re doing.

Note the semicolon after the RETURN clause. It is used to tell the Neo4j-Shell that you’re finished writing your query. In the command line Neo4j-Shell - if you don’t use a semicolon, Neo4j will assume you still have more to write and will sit patiently waiting for the rest of your input. In both the Cypher gadget in this course and the Neo4j Browser it is not needed and silently ignored.

Adding Properties

To begin, let’s add the movie Mystic River to the dataset.

CREATE (movie:Movie {title: "Mystic River", released:1993})
RETURN movie;

Let’s say we wanted to add a tagline to the Mystic River :Movie node that we added. First we have to locate the single movie again by its title, then SET the property. Here is the query:

MATCH (m:Movie)
WHERE m.title = "Mystic River"
SET m.tagline = "We bury our sins here, Dave. We wash them clean."
RETURN m;

Because Neo4j is schema-free, you can add any property you want to any node or relationship.

What if you want to update a property? Mystic River was actually released in 2003, not 1993. We can fix that with the following query:

MATCH (m:Movie)
WHERE m.title = "Mystic River"
SET m.released = 2003
RETURN m;

So you can see that the syntax is the same for updating or adding a property. You SET a property. If it exists, it’ll update it. If not, it’ll add it.

Try it yourself

Updating a relationship property

Let’s try to change the role of Kevin Bacon in Mystic River from ["Sean"] to ["Sean Devine"].

We should find the ACTED_IN relationship between the two first using MATCH and then use SET to update the property as we learned when creating the :Movie node.

Solution:

MATCH (kevin)-[r:ACTED_IN]->(mystic)
WHERE kevin.name="Kevin Bacon"
AND mystic.title="Mystic River"
SET r.roles = ["Sean Devine"]
RETURN r;

The query is hidden, you can still mark, copy & paste it.

Resources

Relationships


Adding Relationships

In the following video we query reviews to see the format they should take and then add a review of "The Matrix".

Adding a relationship is similar to adding a node, but we CREATE the relationship with the relationship syntax (n)-[:REL_TYPE {prop: value}]→(m):

Let’s create ourselves first in this new database:

CREATE (me:Person {name:"My Name"})
RETURN me.name

And then let’s rate the movie Mystic River (or any other movie that you want to rate).

MATCH (me:Person), (movie:Movie)
WHERE me.name="My Name" AND movie.title="Mystic River"
CREATE (me)-[r:REVIEWED {rating:80, summary:"tragic character movie"}]->(movie)
RETURN me, r, movie

or

MATCH (me:Person {name:"My Name"}),(movie:Movie {title:"Mystic River"})
CREATE (me)-[r:REVIEWED {rating:80, summary:"tragic character movie"}]->(movie)
RETURN me, r, movie

Try it yourself

Two Nodes, One Relationship

Let’s say we wanted to return all of the nodes that have relationships to another node. This is still going to return every single node that has a relationship to another node - along with the other node. But it’s moving us in an important direction, so stay with us for a little longer!

To describe this query, we’d write:

MATCH (n)-->(m)
RETURN n, m;

This will return every pair of nodes with a relationship going from n to m.

Exercise: Connect Kevin Bacon as an actor to Mystic River

Connect Kevin Bacon as an actor with an ACTED_IN relationship with roles of ["Sean"] (it is an array, but the same SET syntax applies) to the movie Mystic River.

Solution:

MATCH (m:Movie),(kevin:Person)
WHERE m.title="Mystic River" AND kevin.name="Kevin Bacon"
CREATE (kevin)-[:ACTED_IN {roles:["Sean"]}]->(m)

What about adding properties to nodes or relationships that already exist?

Exercise: Add Clint Eastwood as the director of Mystic River

Can you add a director for the Mystic River Movie? Clint Eastwood directed the movie. The relationship name is DIRECTED) and no properties are required.

First of all we have to find both again, like in Kevin Bacon’s case. That should be no problem. Try it. Now we have to create that simple DIRECTED relationship like we did before.

Solution

MATCH (clint:Person),(mystic:Movie)
WHERE clint.name="Clint Eastwood" AND mystic.title="Mystic River"
CREATE (clint)-[:DIRECTED]->(mystic)
RETURN clint, mystic;

If you want to make sure that only one relationship is created, no matter how often you run this statement, use MERGE instead. MERGE has a get-or-create semantics. It tries to find the patterns you specify. If it finds them, it will return the data. Otherwise it creates the structure in the graph.

MATCH (clint:Person),(mystic:Movie)
WHERE clint.name="Clint Eastwood" AND mystic.title="Mystic River"
MERGE (clint)-[:DIRECTED]->(mystic)
RETURN clint, mystic;

Resources

Set

Deleting Nodes


Previously, we’d added ourselves to the graph. If you didn’t do that, add yourself now using the following query (replacing "My Name" with your name):

CREATE (me:Person {name:"My Name"});

Let’s then run the following query to make sure you have been added successfully to the graph.

MATCH (p:Person {name:"My Name"})
RETURN p.name;

Great, now let’s see in the following video how to remove a node from the graph.

So, to remove both yourself and any relationships you might or might not have, you need to run:

MATCH (me:Person {name="My Name"})
OPTIONAL MATCH (me)-[r]-()
DELETE me,r;

It turns out there is another node in the graph that also needs to be deleted. Run the following query:

MATCH (matrix:Movie {title:"The Matrix"})<-[r:ACTED_IN]-(actor)
RETURN actor.name, r.roles;

It’s looking for and returning actors who played in the Matrix.

But wait, take a look at the results! Who is Emil? There is nobody (character or actor) named Emil in the Matrix. We need to delete this person.

Go ahead and delete Emil.

Exercise

Did it work?

No?

Then go ahead, check out the next section.

Deleting Nodes and Relationships

This query statement will delete both the relationship and the node, even though there may be no relationships.

MATCH (emil:Person {name:"Emil Eifrem"})
OPTIONAL MATCH (emil)-[r]-()
DELETE emil,r;

The first MATCH is obvious, it finds the node we’re looking for. There WHERE statement belongs to the first MATCH.

The second is an OPTIONAL MATCH. It tries to find nodes matching the pattern, if it doesn’t find anything it returns a single row with null values. But it will always return at least one row. You can also filter the optional match with a WHERE statement.

As this is a frequent tasks, DETACH DELETE was added to Cypher, which deletes a node with all its relationships.

MATCH (emil:Person {name:"Emil Eifrem"})
DETACH DELETE emil;

Try them in the graph below.

Exercise: Delete Emil

Resources

Lab: All Characters in the Matrix


Using the syntax we’ve learnt so far, RETURN a list all of the characters in the movie The Matrix

Hint: Movies have the label Movie and a title property you want to compare to. Hint: We’re looking for the characters (the roles which are a property of the ACTED_IN relationships) - not the names of the actors.

Solution:

MATCH (actor)-[r:ACTED_IN]->(movie:Movie)
WHERE movie.title = "The Matrix"
RETURN r.roles;

If you see all the usual suspects, you’re good.

Order, Limit, and Skip


Order

In Cypher it’s easy to order results using a ORDER BY command. Let’s say we wanted to display the oldest people, we could use the following query:

MATCH (p:Person)
RETURN p.name, p.born
ORDER BY p.born

The query returns every actor ordered by their year of birth, so it’ll display the oldest (smallest a.born) first.

Limit and Skip

Cypher supports easy pagination of record sets. It uses SKIP and LIMIT statements to reduce the number of records returned and to allow for paginating through the results.

So if we wanted to display the second page of actors and movies they played in, we might use the following query:

MATCH (a)-[:ACTED_IN]->(m)
RETURN a.name, m.title
SKIP 10 LIMIT 10;

We could also just use LIMIT if we just want the top-n elements within the result. Want to see the 5 oldest people in the system?

MATCH (p:Person)
RETURN p
ORDER BY p.born
LIMIT 5;

Distinct Results

Often you find yourself wanting to return only the distinct results for a query. For example, let’s look at the list of the oldest 5 actors. Initially we might try the following:

MATCH (a:Person)-[:ACTED_IN]->()
RETURN a
ORDER BY a.born
LIMIT 5

But if any of the five oldest actors played in more than one movie we’ll get them multiple times. So the query we really want to run is:

MATCH (a:Person)-[:ACTED_IN]->()
RETURN DISTINCT a
ORDER BY a.born
LIMIT 5

Actually we would write it like this:

MATCH (a:Person)
WHERE (a)-[:ACTED_IN]->()
RETURN a
ORDER BY a.born
LIMIT 5

Try it yourself:

Resources

Lab: Exploring an Unknown Graph


In this video, we’ll learn how you can explore an unknown graph easily using Cypher to gather some insight about the structure of the data:

In general, labels often give good insight into the types of nodes in a graph. From there you can return sample datasets to learn about properties and relationships usually attached to these nodes.

Learn about "People" in our Graph

Let’s find out a little more about the people in the system by querying the various relationships of the nodes with a Person label attached to them.

Filters


Using WHERE

Cypher provides us with a number of mechanisms for reducing the number of matching patterns returned in a result set.

Let’s start with a simple query:

MATCH (p:Person)
WHERE p.name = "Tom Hanks"
RETURN p;

This will look through all of the nodes in the graph with a label of Person and if one of them has the name "Tom Hanks", it’ll RETURN that node.

There is a shorter version of this query, which adds the Properties to filter by to the MATCH clause.

MATCH (p:Person {name:"Tom Hanks"})
RETURN p;

Filtering using Comparisons

We can also filter by comparing properties of different nodes. For example, we could RETURN all of the actors who acted with Tom Hanks and are older than him:

MATCH (tom:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(a:Person)
WHERE tom.name="Tom Hanks"
AND a.born < tom.born
RETURN a.name;

Note that we didn’t bother to put (movie) in the middle - just () as we don’t need to know anything about the movie they worked together in.

We can even add a little math to the RETURN clause along with an alias to show us the difference in ages:

MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(movie),
(movie)<-[:ACTED_IN]-(a:Person)
WHERE a.born < tom.born
RETURN DISTINCT a.name, (tom.born - a.born) AS diff;

Math or more general expressions can be used almost everywhere in Cypher.

Filtering using Patterns: A Few Examples

So far, we used paths as part of a MATCH clause, but it is also possible to use paths as filter expressions in the WHERE clause.

How would we find all the actors who worked with Gene Hackman?

MATCH (gene:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(other)
WHERE gene.name="Gene Hackman"
RETURN DISTINCT other;

Now, how do we filter those actors to only actor-directors?

MATCH (gene:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(other)
WHERE gene.name="Gene Hackman"
AND (other)-[:DIRECTED]->()
RETURN DISTINCT other;

Here’s a more complex example

Actors who worked with Gene Hackman, but not when he was also working with Robin Williams in the same movie.

MATCH (gene:Person {name:"Gene Hackman"})-[:ACTED_IN]->(movie),
  (other)-[:ACTED_IN]->(movie),
  (robin:Person {name:"Robin Williams"})
WHERE NOT (robin)-[:ACTED_IN]->(movie)
RETURN DISTINCT other;

Resources

Lab: Exploring the Movie Database


Tom Hank’s Filmography

Now let’s try to answer some more interesting questions. If you wanted to Find all of the movies that Tom Hanks acted in, how might you do that?

Solution:

MATCH (tom:Person)-[:ACTED_IN]->(movie)
WHERE tom.name="Tom Hanks"
RETURN movie.title;

Great. What if you wanted to limit that to movies which were released after 2000? (There is a released property on movies):

Solution:

MATCH (tom:Person)-[:ACTED_IN]->(movie)
WHERE tom.name="Tom Hanks"
AND movie.released > 2000
RETURN movie.title;

Find all movies Keanu Reeves has acted in

Solution:

MATCH (keanu:Person)-[:ACTED_IN]->(movie)
WHERE keanu.name = "Keanu Reeves"
RETURN movie.title;

Great. And what about movies he acted in where he played the role (Neo)?

  • Hint: you need an identifier for the relationship

  • Hint: the ACTED_IN relationship has a roles property (which is an array)

  • Hint: the syntax for seeing whether an element is in an array is {element} IN r.roles

Syntax Guide

{expression} IN {collection}

Checks for the existence of the value of {expression} in the {collection}

Solution:

MATCH (keanu:Person)-[r:ACTED_IN]->(movie)
WHERE keanu.name="Keanu Reeves"
AND "Neo" IN r.roles
RETURN movie.title;

Resources

Lab: Friends of Friends WORK IN PROGRESS


TODO: Review the KNOWS Relationship

  1. Try to RETURN all Keanu Reeve’s friends of friends (using the KNOWS relationships) that was created earlier.

  2. How would you refine this to get friends of friends who are not his immediate friends?

Solution 1

MATCH (keanu:Person)-[:KNOWS*2]->(fof)
WHERE keanu.name = "Keanu Reeves"
RETURN DISTINCT fof.name;

Solution 2

MATCH (keanu:Person)-[:KNOWS*2]->(fof)
WHERE keanu.name = "Keanu Reeves"
AND NOT((keanu)-[:KNOWS]-(fof))
RETURN DISTINCT fof.name;

Resources

Matching Paths


What is a Path?

A Path is series of connected nodes and relationships. Paths can be matched by a pattern.

What can we do with Paths?

So, what could we do with these paths? Well, let’s say we wanted to display all of the directors that every actor has worked with, along with the names of the movie they worked together on, we could do that with the following query:

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie),
(movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, movie.title, director.name;

It is looking for an (actor) that has an ACTED_IN relationship to a (movie). At the same time it is looking for a (director) who DIRECTED that same (movie). So for every combination of actor and director in each movie it’ll return a result. If there were 10 actors and one director in one movie, that’d be 10 results. If the movie had two directors there would be 20 results - each actor with the first director and each actor with the second director.

While we’re talking about actors being represented by (actor) and directors by (director), strictly speaking the query doesn’t know anything about the nodes. It’s just asking for all nodes (that we’ll refer to as actor) with an ACTED_IN relationship to a movie and nodes (that we’ll refer to as director) that have a DIRECTED relationship with the same movie.

Variable Length Paths

In Cypher, we can describe variable length paths using a star: *

MATCH (node1)-[*]-(node2)

If you want to traverse relationships up to four levels deep it’d be (a)-[*1..4]→(b)

If you want to traverse any depth it’s simply (a)-[*]→(b)

For a specific depth of relationships it’s (a)-[*3]→(b) to find all connections exactly three relationships away.

Alternative Notations


There are a number of different ways we could write the query we just examined:

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, movie.title, director.name;

That’s nice, but especially for long paths the match might not fit into a single path expression, so we can break it down into two separate segments using a comma:

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie),
(movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, movie.title, director.name;

In this form, we’re "taking a breath" with the comma, but we still want to return all of the actors who acted in a movie together with the directors of those movies.

This expresses exactly the same query and will return the same record set.

Note: that we have repeated the identifier movie in both segments of the match clause.

This is critical.

If we didn’t do this we’d get a very different record set as it is that shared identifier that connects the two segments of the match clause.

There is yet another way we could express exactly the same query:

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie),(director:Person)-[:DIRECTED]->(movie)
RETURN actor.name, movie.title, director.name;

Notice that the director element of the match clause is the other way round. However, the directionality (shown by the arrow) is still the same. So the following two snippets are identical as far as Cypher (and common sense) are concerned.

Identical Cypher snippets:

(movie)<-[:DIRECTED]-(director)
(director)-[:DIRECTED]->(movie)

Resources

Returning Paths


Returning Paths

In addition to being able to match paths, we can name paths and RETURN them as part of the result. So we could take a version of the last query we used and name the entire path, returning that:

MATCH p=(a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN p;

This will RETURN all of the nodes and relationships for each path - including all of their properties. That’s interesting, but can be too much data, so we might use the nodes() function just to RETURN the nodes in the path:

MATCH p=(a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN nodes(p);

There is a similar function for relationships:

MATCH p=(a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN rels(p);

Note that only connected patterns can be used to create named paths. If you have two patterns in your MATCH clause with a comma between them, you’d have to RETURN the results as two separate named paths:

MATCH p1=(a)-[:ACTED_IN]->(m), p2=(d)-[:DIRECTED]->(m)
RETURN p1, p2;

Exercise: Directors acting in their movies

We’ve already seen how to RETURN all of the actors and directors in all of the movies:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN a.name, m.title, d.name;

How would you change this query to RETURN only the directors who acted in their own movies? Return people who both acted and directed in the same movie and display their name.

Hint: If you’re having trouble, what would happen if you replaced the (d) and d.name with an (a) ` and `a.name? Does that work? Why? How could you simplify that query?

Solution:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]->(a)
RETURN a.name;

Resources

Indexing and Labels


Create an Index

Unlike other databases, Neo4j doesn’t use indexes to speed up JOIN operations. However, they will perform well for finding your starting points by value, textual prefix or range. You’ll create a label specific index as indexes are bound to a concrete label-property combination.

So, if you want to be able to search efficiently for Movies based on their title, you might run the following Cypher command:

CREATE INDEX ON :Movie(title);

How would you create an index for searching people by name?

Solution

CREATE INDEX ON :Person(name);

And you don’t need to do anything to your queries to use these indexes. If you run the commands above, and you run a query like:

MATCH (gene:Person)-[:ACTED_IN]->(m),(other)-[:ACTED_IN]->(m)
WHERE gene.name="Gene Hackman"
RETURN DISTINCT other;

The lookup of Gene Hackman will now be much faster - although with a small test data set the difference may not be noticeable.

Exercise: Using Indexes

Try to use the query above once with and without an index.

Create a Label-Specific Index

CREATE INDEX ON :Person(name);
  • nodes labeled as Person, indexed by name property

CREATE INDEX ON :Movie(title);
  • nodes labeled as Movie, indexed by title property

Anchor Pattern Nodes in the Graph

Return movies featuring both Tom Hanks and Kevin Bacon

MATCH (tom:Person)-[:ACTED_IN]->(movie),(kevin:Person)-[:ACTED_IN]->(movie)
WHERE tom.name="Tom Hanks" AND kevin.name="Kevin Bacon"
RETURN DISTINCT movie;

You can anchor one or more nodes of your pattern in the graph, for example, by constraining their properties to a single fitting node. Then the pattern matching works much faster as Cypher doesn’t has to scan the whole graph to apply the patterns. If there is an index Cypher will automatically use it.

You can see that in Neo4j if you prefix your query with PROFILE or EXPLAIN.

Create the Index and Run the Query

Resources

Aggregation


Cypher provides support for a number of aggregate functions

count(x) Count the number of occurrences

min(x) Get the lowest value

max(x) Get the highest value

avg(x) Get the average of a numeric value

sum(x) Sum up values

collect(x) Collect all the values into an collection

More on aggregate functions can be found in the Neo4j Manual.

Here’s an example of using aggregates to get a list of all of the movies that an actor acted in:

Collect

Let’s say we wanted to display all movie titles that an actor participated in, aka the filmography. We could use the following query:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name, collect(m.title);

For every Person who has acted in at least one movie, the query will RETURN their name and an array of strings containing the movie titles.

Let’s look closer at the graph and at Tom Hanks' movies.

Here are some more examples.

How would you RETURN all director names that each actor has ever worked with?

Solution:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN p.name, collect(d.name);

But if the actor worked several times with the same director, they would appear repeatedly, so we can use DISTINCT here as well to collect only the distinct set of director names.

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN p.name, collect(DISTINCT d.name) as directors;

Count

Let’s say we wanted to RETURN the number of movies that an actor had acted in. What would that query look like?

Solution:

MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN a.name, count(m);

What about the number of films that an actor and director have jointly worked in?

Solution:

MATCH (a:Person)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN a.name, d.name, count(m);

Top n

Oftentimes you’re interested in the top-n results, which result from a count aggregation. This is achieved by counting first and the ordering the results in a `DESC`ending manner and then `LIMIT`ing the results by the top n. If we would be interested in the top ten actors, who acted in the most movies, the query would look like this.

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name, count(m)
ORDER BY count(m) DESC
LIMIT 10;

Exercise: Aggregation

Write some aggregation queries like the ones above on our dataset.

Resources

Lab: Who are the five busiest actors?


Try to come up with a query that will display the five busiest actors, i.e., the ones who have been in the most movies.

Hint: Use aggregation and ordering

Solution:

MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN a.name, count(m)
ORDER BY count(m) DESC
LIMIT 5;

Lab: Recommendation Engine


Exercise: Recommend three actors that Keanu Reeves should work with (but hasn’t).

This is kind of a friends-of-a-friend query, only that we don’t have FRIEND relationships here but co-acting in a movie (ACTS_IN). So it might be a bit verbose in the first place. There are different approaches for the recommendation. So keep in mind that the top three most frequently appearing people in that network seem to be good candidates.

Solution:

MATCH (keanu:Person {name:"Keanu Reeves"})-[:ACTED_IN]->()<-[:ACTED_IN]-(c),
      (c)-[:ACTED_IN]->()<-[:ACTED_IN]-(coc)
WHERE coc <> keanu  AND NOT((keanu)-[:ACTED_IN]->()<-[:ACTED_IN]-(coc))
RETURN coc.name, count(coc)
ORDER BY count(coc) DESC
LIMIT 3;

Importing Data


Developing a Graph Model

Throughout this course, we have been assuming data already exists in a database or is small enough to enter and jot down by hand. However, what if you want to explore an already existing external dataset? How would you import data from a spreadsheet or relational database?

If you want to import data from, say, a CSV, first you will need to develop a graph model describing what piece of source data goes where in your graph.

Importing Normalized Data using LOAD CSV

Cypher provides an elegant built-in way to import tabular CSV data into graph structures.

The LOAD CSV clause parses a local or remote file into a stream of rows which represent maps (with headers) or lists. Then you can use whatever Cypher operations you want to apply to either CREATE nodes or relationships or to MERGE with existing graph structures.

As CSV files usually represent either node- or relationship-lists, you run multiple passes to create nodes and relationships separately.

The movies.csv file (sample below) holds the data that will populate the Movie nodes.

id,title,country,year
1,Wall Street,USA,1987
2,The American President,USA,1995
3,The Shawshank Redemption,USA,1994

The following query CREATE s the Movie nodes using the data from movies.csv as properties.

LOAD CSV WITH HEADERS
FROM "http://neo4j.com/docs/stable/csv/intro/movies.csv"
AS line
CREATE (m:Movie { id:line.id, title:line.title, released:toInt(line.year) });

The persons.csv file (sample below) holds the data that will populate the :Person nodes.

id,name
1,Charlie Sheen
2,Oliver Stone
3,Michael Douglas
4,Martin Sheen
5,Morgan Freeman

In case you already have people in your database, you want to make sure not to create duplicates. That’s why instead of just creating them, we use MERGE to ensure unique entries after the import. We only have to set the name of a person, when it was created so we use the ON CREATE feature.

LOAD CSV WITH HEADERS
FROM "http://neo4j.com/docs/stable/csv/intro/persons.csv"
AS line
MERGE (a:Person { id:line.id })
ON CREATE SET a.name=line.name;

The roles.csv file (sample below) holds the data that will populate the relationships between the nodes.

personId,movieId,role
1,1,Bud Fox
4,1,Carl Fox
3,1,Gordon Gekko
4,2,A.J. MacInerney
3,2,President Andrew Shepherd
5,3,Ellis Boyd 'Red' Redding

The query below matches entry of line.personId and line.movieId to their respective :Movie and :Person node via their key property - id, and CREATE`s an `ACTED_IN relationship between the person and the movie. This model includes a relationship property of role, which is passed via line.role.

LOAD CSV WITH HEADERS
FROM "http://neo4j.com/docs/stable/csv/intro/roles.csv"
AS line
MATCH (m:Movie { id:line.movieId })
MATCH (p:Person { id:line.personId })
CREATE (p)-[:ACTED_IN { roles: [line.role]}]->(m);

Importing Denormalized Data

If your file contains denormalized data, you can either run the same file with multiple passes and simple operations as shown above or you might have to use MERGE to create nodes and relationships (if need be) uniquely.

For our use-case we can import the data using a CSV structure like this:

movie_actor_roles.csv
title;released;actor;born;characters
Back to the Future;1985;Michael J. Fox;1961;Marty McFly
Back to the Future;1985;Christopher Lloyd;1938;Dr. Emmet Brown
LOAD CSV WITH HEADERS
FROM "http://neo4j.com/docs/stable/csv/intro/movie_actor_roles.csv"
AS line FIELDTERMINATOR ";"
MERGE (m:Movie { title:line.title })
ON CREATE SET m.released = toInt(line.released)
MERGE (a:Person { name:line.actor })
ON CREATE SET a.born = toInt(line.born)
MERGE (a)-[r:ACTED_IN]->(m)
ON CREATE SET r.roles = split(line.characters,",")

For large denormalized files, it might still make sense to create nodes and relationships separately in multiple passes. That would depend on the complexity of the operations and the experienced performance.

Importing a Large Dataset

If you import a larger amount of data (more than 10000 rows), it is recommended to prefix your LOAD CSV clause with a PERIODIC COMMIT hint. This allows the database to regularly commit the import transactions to avoid memory churn for large transaction-states.

Resources

Set

Community Resources and Next Steps


In-Person Training

If you enjoyed the online version of our training and want to continue to learn about Cypher, you might want to check out our in-person, hands-on training classes that are delivered around the world by experienced instructors.

Cypher Community

No course can answer all of the questions you might have about a technology, but fortunately Neo4j, the creator of Cypher, has a really active community.

If you’d like to learn more about using Neo4j, start by going to the Neo4j developer pages. It provides a range of topics for learning more about Neo4j.

If you have specific questions or problems please ask on Stackoverflow where the most knowledgable people can help you quickly. Come to the Neo4j Google Group if you want to discuss graph modeling questions, the Neo4j ecosystem or product features. You can also join our public slack channel to get quick answers.

If you’d like to learn more about the company behind Cypher, the product and its the enterprise features and support packages that are provided, you should check out [neo4j.com](http://neo4j.com/product). There you can also find a lot of information about use-cases and solutions.

Follow Neo4j on Twitter @Neo4j, and find local Neo4j meetups to connect with other developers interested in graph databases.

Links