Intro to the dataset

Intro to the dataset


Welcome to the first of a set of interactive guides. In these guides we’ll evolve a dataset containing the connections between US airports in 2008.

Let’s get started!

Start with a question


When modeling a dataset it always helps to have a use case in mind. We’ll start with the following question:

As an air travel enthusiast
I want to know how airports are connected
So that I can find the busiest ones

This is the model that we’re going to create:

initial

Manually creating the model


Before we start working with a large dataset let’s create some nodes and relationships manually. First we’ll create some airports:

CREATE (:Airport {code: "LAX"})
CREATE (:Airport {code: "LAS"})
CREATE (:Airport {code: "ABQ"})

We can find LAX by changing the CREATE to a MATCH and returning the matched node:

MATCH (lax:Airport {code: "LAX"})
RETURN lax

Create relationships


Now let’s create some connections between those airports.

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
CREATE (las)-[connection:CONNECTED_TO {
  airline: "WN",
  flightNumber: "82",
  date: "2008-1-3",
  departure: 1715,
  arrival: 1820}]->(lax)

We can check that the relationship was created correctly by executing the following query:

MATCH connection = (las:Airport {code: "LAS"})-[:CONNECTED_TO]->(lax:Airport {code: "LAX"})
RETURN connection

Create relationships idempotently


If we were to run our relationship creating query again we’d end up creating the same relationship twice. To avoid this we can use Cypher’s MERGE keyword instead of CREATE.

slides

Create relationships idempotently


When using the MERGE command, we only need to inline the properties that make the CONNECTED_TO relationship unique. In this case it’s the combination of airline, flightNumber, and date. To idempotently create a specific connection between airports we can run the following query:

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
MERGE (las)-[connection:CONNECTED_TO { airline: "WN", flightNumber: "82", date: "2008-1-3"}]->(lax)
ON CREATE SET connection.departure = 1715, connection.arrival = 1820

Let’s try it with another connection to get the hang of it:

MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "ABQ"})
MERGE (las)-[connection:CONNECTED_TO { airline: "WN", flightNumber: "500", date: "2008-1-3"}]->(lax)
ON CREATE SET connection.departure = 1445, connection.arrival = 1710

Try running the query multiple times. The relationship will only be created once.

All the connections leaving LAS


We can now find any connections leaving LAS:

MATCH connection = (las:Airport {code: "LAS"})-[:CONNECTED_TO]->(:Airport)
RETURN connection

The modeling workflow


Let’s quickly have a look at the workflow that we’ve just followed and introduce the LOAD CSV that we’ll be using in the next section:

slides

Exploring data with LOAD CSV


While we’re working out the appropriate model for our dataset it’s much easier to work with a subset of the data so that we can iterate quickly. A smaller dataset containing the first 1,000 connections lives in flights_1k.csv.

We can run the following query to see what data we’ve got to work with:

LOAD CSV WITH HEADERS FROM "file:///flights_1k.csv" AS row
RETURN row
LIMIT 5

This query:

  • loads the file flights_1k.csv

  • iterates over the file, referring to each line as the variable row

  • and returns the first 5 lines in the file

We’ve got lots of different fields but the ones that will be helpful for answering our question are: Origin, Dest, and FlightNum.

Importing connections and airports


Run the following query to create nodes and relationships for these connections:

LOAD CSV WITH HEADERS FROM "file:///flights_1k.csv" AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
  airline: row.UniqueCarrier,
  flightNumber: row.FlightNum,
  date: toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth)}]->(destination)
ON CREATE SET connection.departure = toInteger(row.CRSDepTime), connection.arrival = toInteger(row.CRSArrTime)

This query:

  • iterates through each row in the file

  • creates nodes with the Airport label for the origin and destination airports if they don’t already exist

  • creates a connection relationship between origin and destination airports for each row in the file

By default properties will be stored as strings. We know that year, month, and day are actually numeric values so we’ll coerce them using the toInteger function.

Now we’re ready to start querying the data.

Finding the most popular airports


We can see some of what we’ve imported by writing the following query, which finds the airports with the most outgoing connections.

MATCH (a:Airport)-[:CONNECTED_TO]->()
RETURN a, COUNT(*) AS outgoing
ORDER BY outgoing DESC
LIMIT 10

This query:

  • finds every node with the Airport label

  • finds all outgoing CONNECTED_TO relationships

  • counts them up grouped by airport

  • returns the Airport nodes and the outgoing count in descending order by outgoing

  • limits the number of airports returned to 10

Exercise: Finding connections


Now it’s your turn! Try and write queries to answer the following questions:

  • Find the airports that have the most incoming connections

  • Find all the connections into Las Vegas (LAS)

  • Find all the connections from Las Vegas (LAS) to Los Angeles (LAX)

Hint You’ll want to refer to the Cypher refcard for the syntax for the second question.

Click through for the answers


If you really want to see them…​

Answer: Find the airports that have the most incoming connections


MATCH (a:Airport)<-[:CONNECTED_TO]-()
RETURN a, COUNT(*) AS incoming
ORDER BY incoming DESC
LIMIT 10

Answer: Find all the connections into Las Vegas (LAS)


MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE destination.code = "LAS"
RETURN origin, destination, connection
LIMIT 10

or

MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport {code: "LAS"})
RETURN origin, destination, connection
LIMIT 10

Answer: Find all the connections from Las Vegas (LAS) to Los Angeles (LAX)


MATCH (origin:Airport {code: "LAS"})-[connection:CONNECTED_TO]->(destination:Airport {code: "LAX"})
RETURN origin, destination, connection
LIMIT 10

Finding specific connections


The model has worked well so far. We’ve been able to find the popular airports and find the connections between pairs of airports without much trouble.

What about if we want to find all the occurrences of a specific connection?

As an air travel enthusiast
I want to know the schedule for <flight number>
So that I know when I’ll be able to spot those planes taking off and landing

Finding connection WN 1016


Our next query finds all the instances of connection WN 1016:

MATCH  (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE connection.airline = "WN" AND connection.flightNumber = "1016"
RETURN origin.code, destination.code, connection.date, connection.departure, connection.arrival

It’s still reasonably quick because we only have 1,000 rows, but under the covers we’re actually doing a lot of unnecessary work.

Next Step


In the next section we’re going to learn how to profile queries so that we can see what work’s actually being done. We’ll then refactor the model to introduce Flight as a first class concept.