Data lives in a triple store that offers a SPARQL endpoint


A public SPARQL endpoint (dbpedia): http://dbpedia.org/sparql

A SPARQL query that returns Gene Hackman’s movies:

prefix dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?movie ?title ?dir ?name
WHERE {
  ?movie dbpedia-owl:starring [ rdfs:label "Gene Hackman"@en ];
         rdfs:label ?title;
         dbpedia-owl:director ?dir .
  ?dir rdfs:label ?name .
  FILTER LANGMATCHES(LANG(?title), "EN")
  FILTER LANGMATCHES(LANG(?name),  "EN")
}

Let’s have a look at the dataset…​

WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=prefix+dbpedia-owl%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E+%0D%0A%0D%0ASELECT+%3Fmovie+%3Ftitle+%3Fdir+%3Fname%0D%0AWHERE+%7B%0D%0A++%3Fmovie+dbpedia-owl%3Astarring+%5B+rdfs%3Alabel+%22Gene+Hackman%22%40en+%5D%3B%0D%0A+++++++++rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir+rdfs%3Alabel+%3Fname+.%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Ftitle%29%2C+%22EN%22%29%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Fname%29%2C++%22EN%22%29%0D%0A%7D&format=text%2Fcsv&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on" AS url

LOAD CSV WITH HEADERS FROM url AS row
RETURN row

Parsed data looks good so we can complete the query to create nodes and rels in Neo4j…​

WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=prefix+dbpedia-owl%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E+%0D%0A%0D%0ASELECT+%3Fmovie+%3Ftitle+%3Fdir+%3Fname%0D%0AWHERE+%7B%0D%0A++%3Fmovie+dbpedia-owl%3Astarring+%5B+rdfs%3Alabel+%22Gene+Hackman%22%40en+%5D%3B%0D%0A+++++++++rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir+rdfs%3Alabel+%3Fname+.%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Ftitle%29%2C+%22EN%22%29%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Fname%29%2C++%22EN%22%29%0D%0A%7D&format=text%2Fcsv&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on" AS url

LOAD CSV WITH HEADERS FROM url AS row
MERGE (m:Movie { id: row.movie, title: row.title })
MERGE (d:Director { id: row.dir, name : row.name })
MERGE (m)-[db:DIRECTED_BY]->(d)
RETURN m, db, d

Maybe empty the graph?

MATCH (n) DETACH DELETE n

Data is RDF shaped and is ingested in small snippets


From a CONSTRUCT query in a SPARQL endpoint:

prefix dbpedia-owl: <http://dbpedia.org/ontology/>
CONSTRUCT {
  ?movie a dbpedia-owl:Movie ; dbpedia-owl:starring ?hf .
  ?hf rdfs:label "Harrison Ford"@en .
  ?movie rdfs:label ?title;
         dbpedia-owl:director ?dir .
  ?dir  a dbpedia-owl:Director; rdfs:label ?dirname .
}
WHERE {
  ?movie dbpedia-owl:starring ?hf .
  ?hf rdfs:label "Harrison Ford"@en .
  ?movie rdfs:label ?title;
         dbpedia-owl:director ?dir .
  ?dir  rdfs:label ?dirname .
  FILTER LANGMATCHES(LANG(?title), "EN")
  FILTER LANGMATCHES(LANG(?dirname),  "EN")
}

We can use the url and call the importRDF stored procedure

WITH 'http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=prefix+dbpedia-owl%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E+%0D%0ACONSTRUCT+%7B+%0D%0A++%3Fmovie+a+dbpedia-owl%3AMovie+%3B+dbpedia-owl%3Astarring+%3Fhf+.%0D%0A++%3Fhf+rdfs%3Alabel+%22Harrison+Ford%22%40en+.%0D%0A++%3Fmovie+rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir++a+dbpedia-owl%3ADirector%3B+rdfs%3Alabel+%3Fdirname+.%0D%0A%7D%0D%0AWHERE+%7B%0D%0A++%3Fmovie+dbpedia-owl%3Astarring+%3Fhf+.%0D%0A++%3Fhf+rdfs%3Alabel+%22Harrison+Ford%22%40en+.%0D%0A++%3Fmovie+rdfs%3Alabel+%3Ftitle%3B%0D%0A+++++++++dbpedia-owl%3Adirector+%3Fdir+.%0D%0A++%3Fdir++rdfs%3Alabel+%3Fdirname+.%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Ftitle%29%2C+%22EN%22%29%0D%0A++FILTER+LANGMATCHES%28LANG%28%3Fdirname%29%2C++%22EN%22%29%0D%0A%7D&format=text%2Fplain&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on' AS url
CALL semantics.importRDF(url,'N-Triples',true,true,50000) YIELD terminationStatus, triplesLoaded,namespaces,extraInfo
RETURN terminationStatus, triplesLoaded,namespaces,extraInfo
MATCH (n:Resource { uri: 'http://dbpedia.org/resource/Harrison_Ford'}) RETURN n LIMIT 25
MATCH path = (n:Resource { uri: 'http://dbpedia.org/resource/Harrison_Ford'})<-[:ns1_starring]-(movie)-[:ns1_director]->(director)
RETURN path LIMIT 5

Empty the graph?

MATCH (n) DETACH DELETE n

From a resource (linked data) page:

Google dbpedia Microsoft

WITH 'http://dbpedia.org/data/Microsoft.ntriples' AS url
CALL semantics.importRDF(url,'N-Triples',true,true,50000) YIELD terminationStatus, triplesLoaded,namespaces,extraInfo
RETURN terminationStatus, triplesLoaded,namespaces,extraInfo

Does not look very nice, how about not using types as labels? Let’s empty the graph…​

MATCH (n) DETACH DELETE n

And let’s load the data again but this time turning off the 'types to labels' parameter

WITH 'http://dbpedia.org/data/Microsoft.ntriples' AS url
CALL semantics.importRDF(url,'N-Triples',true,false,50000) YIELD terminationStatus, triplesLoaded,namespaces,extraInfo
RETURN terminationStatus, triplesLoaded,namespaces,extraInfo

And now we can explore it. Notice that the prefix ns6_ might need to be changed in the following query as the namespace ids are assigned randomly on load. It may happen that when you run this guide on your instance type uses a different one. Hint: Check on the relationship types on the left bar of the browser.

MATCH p = (ms:Resource { uri: 'http://dbpedia.org/resource/Microsoft'})-[:ns6_type]-(type) RETURN p

We will want to extend the graph by adding more RDF fragments.

MATCH path = (n:Resource { uri: 'http://dbpedia.org/resource/Microsoft'})-[:ns2_keyPerson]-(p)
RETURN path LIMIT 5

Google dbpedia Bill Gates

WITH 'http://dbpedia.org/data/Bill_Gates.ntriples' AS url
CALL semantics.importRDF(url,'N-Triples',true,false,50000) YIELD terminationStatus, triplesLoaded,namespaces,extraInfo
RETURN terminationStatus, triplesLoaded,namespaces,extraInfo

And we can run the 'key person' query again to see the results…​

Let’s empty the graph once more

MATCH (n) DETACH DELETE n

From a full dataset dump:

The Thomson Reuters OpenPermID …​

ls -arlt ~/Downloads/OpenPermID-bulk-*.ttl

Preview your RDF dataset as a LPG in Neo4j


We’ll typically start by previewing what the RDF data looks like when transformed into a LPG in Neo4j.

 WITH '...' AS rdf
 CALL semantics.previewRDFSnippet(rdf,"Turtle",true,false) YIELD nodes, relationships
 RETURN nodes, relationships

Load the RDF graph


The 'Organization' file contains approx. 29 Million triples

CALL semantics.importRDF('file:///Users/jbarrasa/Downloads/OpenPermID-bulk-organization-20161005_120255.ttl','Turtle',true,true,50000)
CALL semantics.importRDF('file:///Users/jbarrasa/Downloads/OpenPermID-bulk-industry-20161005_120254.ttl','Turtle',false,true,50000)

Maybe skip the quotes?

CALL semantics.importRDF('file:///Users/jbarrasa/Downloads/OpenPermID-bulk-quote-20161005_120255.ttl','Turtle',false,true,50000)
CALL semantics.importRDF('file:///Users/jbarrasa/Downloads/OpenPermID-bulk-instrument-20161005_120255.ttl','Turtle',false,true,5000)
CALL semantics.importRDF('file:///Users/jbarrasa/Downloads/OpenPermID-bulk-assetClass-20161005_120255.ttl','Turtle',false,true,5000)

Query by uri:

MATCH (n:Resource { uri: 'https://permid.org/1-4295904482'}) RETURN n LIMIT 25

Some basic analytics on the imported graph


MATCH (n:Resource) RETURN COUNT(n)
MATCH (n:Resource)
WITH n.uri AS uri, size((n)--()) AS degree
RETURN AVG(degree)

Enriching the graph


MATCH (o:ns0_Organization)-[iii:ns0_isIncorporatedIn]->(loc)
RETURN * LIMIT 10

No info on countries on dataset. It’s a unique ID but not very usable.

LOAD CSV WITH HEADERS FROM "file:///countries-geonames.csv" AS row FIELDTERMINATOR "\t"
MATCH (r:Resource { uri: "http://sws.geonames.org/" + row.geonameid + "/" } )
SET r+= row, r:Country

If we are going to look up organizations by name, let’s create an index

CREATE INDEX ON :ns0_Organization(`ns4_organization-name`)

We can add labels for untyped entities

MATCH ()-[:ns4_hasURL]->(url)
SET url:URL

Make it more LPG


Identify dense nodes or RDFish modelling

MATCH (o)-[:ns0_hasActivityStatus]->(as) RETURN as.uri, count(o)

We want to fix it by running something like this:

MATCH (o)-[has:ns0_hasActivityStatus]→(:Resource { uri: 'http://permid.org/ontology/organization/statusActive'}) SET o.activityStatus = 'ACTIVE' DELETE has

…​and the equivalent for inactive organizations

We will do it in batches

MATCH (o:ns0_Organization) SET o:Process
CALL apoc.periodic.commit("
      MATCH (o:Process)-[has:ns0_hasActivityStatus]->(:Resource { uri: 'http://permid.org/ontology/organization/statusInActive'})
      WITH o, has LIMIT {limit}
      SET o.activityStatus = 'INACTIVE'
      DELETE has
      REMOVE o:Process
      RETURN COUNT(o)
",{limit:75000})

Publishing your Neo4j Graph as RDF


//Get data as RDF
:GET /rdf/nodebyuri?nodeuri=https://permid.org/1-4295904482

Or from the CLI to produce different serializations (text/turtle, text/plain) curl -i http://localhost:7474/rdf/nodebyuri?nodeuri=https://permid.org/1-4295904482 -H accept:application/rdf+xml

//Get data as RDF
:GET /rdf/nodebyuri?nodeuri=http://dbpedia.org/resource/Indiana_Jones_and_the_Temple_of_Doom