Intro to the dataset

Welcome to the first of a set of interactive guides. In these guides we’ll evolve a dataset containing the connections between US airports in 2008.

Let’s get started!

Start with a question

When modeling a dataset it always helps to have a use case in mind. We’ll start with the following question:

As an air travel enthusiast
I want to know how airports are connected
So that I can find the busiest ones

This is the model that we’re going to create:

Manually creating the model

Before we start working with a large dataset let’s create some nodes and relationships manually. First we’ll create some airports:

CREATE (:Airport {code: "LAX"})
CREATE (:Airport {code: "LAS"})
CREATE (:Airport {code: "ABQ"})

We can find LAX by changing the CREATE to a MATCH and returning the matched node:

MATCH (lax:Airport {code: "LAX"})
RETURN lax

Create relationships

Now let’s create some connections between those airports.

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
CREATE (las)-[connection:CONNECTED_TO {
  airline: "WN",
  flightNumber: "82",
  date: "2008-1-3",
  departure: 1715,
  arrival: 1820}]->(lax)

We can check that the relationship was created correctly by executing the following query:

MATCH connection = (las:Airport {code: "LAS"})-[:CONNECTED_TO]->(lax:Airport {code: "LAX"})
RETURN connection

Create relationships idempotently

If we were to run our relationship creating query again we’d end up creating the same relationship twice. To avoid this we can use Cypher’s MERGE keyword instead of CREATE.

Create relationships idempotently

When using the MERGE command, we only need to inline the properties that make the CONNECTED_TO relationship unique. In this case it’s the combination of airline, flightNumber, and date. To idempotently create a specific connection between airports we can run the following query:

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
MERGE (las)-[connection:CONNECTED_TO { airline: "WN", flightNumber: "82", date: "2008-1-3"}]->(lax)
ON CREATE SET connection.departure = 1715, connection.arrival = 1820

Let’s try it with another connection to get the hang of it:

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "ABQ"})
MERGE (las)-[connection:CONNECTED_TO { airline: "WN", flightNumber: "500", date: "2008-1-3"}]->(lax)
ON CREATE SET connection.departure = 1445, connection.arrival = 1710

Try running the query multiple times. The relationship will only be created once.

All the connections leaving `LAS`

We can now find any connections leaving LAS:

MATCH connection = (las:Airport {code: "LAS"})-[:CONNECTED_TO]->(:Airport)
RETURN connection

The modeling workflow

Let’s quickly have a look at the workflow that we’ve just followed and introduce the LOAD CSV that we’ll be using in the next section:

Exploring data with `LOAD CSV`

While we’re working out the appropriate model for our dataset it’s much easier to work with a subset of the data so that we can iterate quickly. A smaller dataset containing the first 1,000 connections lives in flights_1k.csv.

We can run the following query to see what data we’ve got to work with:

LOAD CSV WITH HEADERS FROM "file:///flights_1k.csv" AS row
RETURN row
LIMIT 5

This query:

loads the file flights_1k.csv
iterates over the file, referring to each line as the variable row
and returns the first 5 lines in the file

We’ve got lots of different fields but the ones that will be helpful for answering our question are: Origin, Dest, and FlightNum.

Importing connections and airports

Run the following query to create nodes and relationships for these connections:

LOAD CSV WITH HEADERS FROM "file:///flights_1k.csv" AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
  airline: row.UniqueCarrier,
  flightNumber: row.FlightNum,
  date: toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth)}]->(destination)
ON CREATE SET connection.departure = toInteger(row.CRSDepTime), connection.arrival = toInteger(row.CRSArrTime)

This query:

iterates through each row in the file
creates nodes with the Airport label for the origin and destination airports if they don’t already exist
creates a connection relationship between origin and destination airports for each row in the file

By default properties will be stored as strings. We know that year, month, and day are actually numeric values so we’ll coerce them using the toInteger function.

Now we’re ready to start querying the data.

Finding the most popular airports

We can see some of what we’ve imported by writing the following query, which finds the airports with the most outgoing connections.

MATCH (a:Airport)-[:CONNECTED_TO]->()
RETURN a, COUNT(*) AS outgoing
ORDER BY outgoing DESC
LIMIT 10

This query:

finds every node with the Airport label
finds all outgoing CONNECTED_TO relationships
counts them up grouped by airport
returns the Airport nodes and the outgoing count in descending order by outgoing
limits the number of airports returned to 10

Exercise: Finding connections

Now it’s your turn! Try and write queries to answer the following questions:

Find the airports that have the most incoming connections
Find all the connections into Las Vegas (LAS)
Find all the connections from Las Vegas (LAS) to Los Angeles (LAX)

Hint You’ll want to refer to the Cypher refcard for the syntax for the second question.

Click through for the answers

If you really want to see them…

Answer: Find the airports that have the most incoming connections

MATCH (a:Airport)<-[:CONNECTED_TO]-()
RETURN a, COUNT(*) AS incoming
ORDER BY incoming DESC
LIMIT 10

Answer: Find all the connections into Las Vegas (`LAS`)

MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE destination.code = "LAS"
RETURN origin, destination, connection
LIMIT 10

MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport {code: "LAS"})
RETURN origin, destination, connection
LIMIT 10

Answer: Find all the connections from Las Vegas (`LAS`) to Los Angeles (`LAX`)

MATCH (origin:Airport {code: "LAS"})-[connection:CONNECTED_TO]->(destination:Airport {code: "LAX"})
RETURN origin, destination, connection
LIMIT 10

Finding specific connections

The model has worked well so far. We’ve been able to find the popular airports and find the connections between pairs of airports without much trouble.

What about if we want to find all the occurrences of a specific connection?

As an air travel enthusiast
I want to know the schedule for <flight number>
So that I know when I’ll be able to spot those planes taking off and landing

Finding connection `WN 1016`

Our next query finds all the instances of connection WN 1016:

MATCH  (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE connection.airline = "WN" AND connection.flightNumber = "1016"
RETURN origin.code, destination.code, connection.date, connection.departure, connection.arrival

It’s still reasonably quick because we only have 1,000 rows, but under the covers we’re actually doing a lot of unnecessary work.

Next Step

In the next section we’re going to learn how to profile queries so that we can see what work’s actually being done. We’ll then refactor the model to introduce Flight as a first class concept.

Flight as a first class citizen

Intro to the dataset

Intro to the dataset

Start with a question

Manually creating the model

Create relationships

Create relationships idempotently

Create relationships idempotently

All the connections leaving LAS

The modeling workflow

Exploring data with LOAD CSV

Importing connections and airports

Finding the most popular airports

Exercise: Finding connections

Click through for the answers

Answer: Find the airports that have the most incoming connections

Answer: Find all the connections into Las Vegas (LAS)

Answer: Find all the connections from Las Vegas (LAS) to Los Angeles (LAX)

Finding specific connections

Finding connection WN 1016

Next Step

All the connections leaving `LAS`

Exploring data with `LOAD CSV`

Answer: Find all the connections into Las Vegas (`LAS`)

Answer: Find all the connections from Las Vegas (`LAS`) to Los Angeles (`LAX`)

Finding connection `WN 1016`