Building a Knowledge Graph from Wikipedia in Neo4j

Categories in Wikipedia pages
You may have noticed at the bottom of every Wikipedia article a section listing the categories it’s classified under.
If you open the Wikipedia page on the Neo4j, you will find the following list of categories:

page

Categories are organised in multiple overlaping hierarchies
Every Wikipedia article will have at least one category, and categories branch into subcategories forming overlapping trees. It is sometimes possible for a category (and the Wikipedia hierarchy is an example of this) to be a subcategory of more than one parent category, so the hierarchy is effectively a graph.

hierarchy

Approach #1 : Loading a reduced subset through the MediaWiki API

The MediaWiki API
Wikipedia data can be accessed through the MediaWiki action API. It is possible to request for a given category its subcategories or the pages it contains. The API returns results serialized as JSON that we can easily consume with the load.jdbc procedure in the APOC library.
On the right hand side you can see the output of this request returning the subcategories of the 'Databases' category.

Prepare the DB and create the root category

CREATE INDEX ON :Category(catId)

CREATE INDEX ON :Category(catName)

CREATE INDEX ON :Page(pageTitle)

CREATE (c:Category:RootCategory {catId: 0, catName: 'Databases', subcatsFetched : false, pagesFetched : false, level: 0 })

json

Approach #1 : Loading a reduced subset incrementally through the MediaWiki API

Loading a three level deep hierarchy of categories

We can load an arbitrarily deep hierarchy of Wikipedia categories by iteratively generating requests to the MediaWiki API starting from the root category.

Approach #1 : Loading a reduced subset incrementally through the MediaWiki API

Loading pages for each category in the graph by level

Similarly, we can load the Wikipedia pages for each of the categories in the Graph. Notice that at this point we are only loading the page id, its url and its title.

pages

UNWIND range(0,4) as level
CALL apoc.cypher.doit("
MATCH (c:Category { pagesFetched: false, level: $level })
CALL apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=page&cmtitle=Category:' + apoc.text.urlencode(c.catName) + '&cmprop=ids%7Ctitle&cmlimit=500')
YIELD value as results
UNWIND results.query.categorymembers AS page
MERGE (p:Page {pageId: page.pageid})
ON CREATE SET p.pageTitle = page.title, p.pageUrl = 'http://en.wikipedia.org/wiki/' + apoc.text.urlencode(replace(page.title, ' ', '_'))
WITH p,c
MERGE (p)-[:IN_CATEGORY]->(c)
WITH DISTINCT c
SET c.pagesFetched = true", { level: level }) yield value
return value

Enriching the knowledge graph with DBPedia data

The MediaWiki API does not expose the content of the pages, but DBpedia does to a certain extent. DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.
There is a public instance of the DBpedia that exposes a SPARQL endpoint that we can query to get an abstract for a given Wikipedia page. Here is the SPARQL query that we will use:

SELECT ?label
WHERE  {
	?x  <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <...wiki page url...> ;
		<http://dbpedia.org/ontology/abstract> ?label .
	FILTER(LANGMATCHES(LANG(?label), 'en'))
	} LIMIT 1

dbpedia

We’ll submit the query to the endpoint and get the results as JSON and we’ll do this again with APOC’s load.json procedure. The results will enrich our knowledge graph by adding the abstract as a property of Page nodes.

WITH "SELECT ?label
 WHERE  {
         ?x <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <@wikiurl@> ;
            <http://dbpedia.org/ontology/abstract> ?label .
         FILTER(LANG(?label) = '' || LANGMATCHES(LANG(?label), 'en')) } LIMIT 1
         " AS sparqlPattern
UNWIND range(0,3) as level
CALL apoc.cypher.doit("
MATCH (c:Category { level: $level })<-[:IN_CATEGORY]-(p:Page)
WHERE NOT exists(p.abstract)
WITH DISTINCT p, apoc.text.replace(sparqlPattern,'@wikiurl@',p.pageUrl) as runnableSparql LIMIT 100
CALL apoc.load.json('http://dbpedia.org/sparql/?query=' + apoc.text.urlencode(runnableSparql) + '&format=application%2Fsparql-results%2Bjson') YIELD value
SET p.abstract = value.results.bindings[0].label.value
", { level: level, sparqlPattern: sparqlPattern }) yield value
return value

Notice that the number of pages is limited to 100 per level because we are generating an HTTP request to the DBpedia endpoint for each Page node in our graph. Feel free to remove this limit but keep in mind that this can take a while.

Exploring the dataset

We can list categories by number of sub/super categories or by number of pages or even create custom indexes like ABS(toFloat(superCatCount - subCatCount)/(superCatCount + subCatCount)) that tells us how 'balanced' (ratio between supercategories and subcategories) a category is. Closer to zero are the more balanced categories and closer to one are the more unbalanced.

MATCH (c:Category)
WITH c.catName AS category,
	 size((c)<-[:SUBCAT_OF]-()) AS subCatCount,
     size((c)-[:SUBCAT_OF]->()) AS superCatCount,
	 size((c)<-[:IN_CATEGORY]-()) AS pageCount
WHERE  subCatCount > 0 AND superCatCount > 0
RETURN category,
	   pageCount,
	   subCatCount,
       superCatCount,
	   ABS(toFloat(superCatCount - subCatCount)/(superCatCount + subCatCount)) as balanceIndex
ORDER BY subCatCount DESC
LIMIT 500

We can also aggregate these values to produce stats.

MATCH (c:Category)
WITH c.catName AS category,
	 size((c)<-[:SUBCAT_OF]-()) AS subCatCount,
     size((c)-[:SUBCAT_OF]->()) AS superCatCount,
     size((c)<-[:IN_CATEGORY]-()) AS pageCount,
     size((c)-[:SUBCAT_OF]-()) AS total
RETURN AVG(subCatCount) AS `AVG #subcats`,
	   MIN(subCatCount) AS `MIN #subcats`,
       MAX(subCatCount)  AS `MAX #subcats`,
       percentileCont(subCatCount,0.9)  AS `.9p #subcats`,
       AVG(pageCount) AS `AVG #pages`,
       MIN(pageCount) AS `MIN #pages`,
       MAX(pageCount) AS `MAX #pages`,
       percentileCont(pageCount,0.95) AS `.9p #pages`,
       AVG(superCatCount) AS `AVG #supercats`,
       MIN(superCatCount) AS `MIN #supercats`,
       MAX(superCatCount) AS `MAX #supercats`,
       percentileCont(superCatCount,0.95) AS `.9p #supercats`

Exploring the dataset

path

How are Quantone and Monster.com connected in the Wikipedia hierarchy?

MATCH shortestHierarchyConnection = shortestPath((q:Page { pageTitle : 'Quantone'})-[:IN_CATEGORY|SUBCAT_OF*]-(m:Page { pageTitle : 'Monster.com'}))
RETURN shortestHierarchyConnection

Some unexpectedly(?) long hierarchies. How can a path of lenght 7 exist if we’ve only loaded a 4 level deep hierarchy? Here’s how:

MATCH path =()-[r:SUBCAT_OF*7..]->() WITH path LIMIT 1
return path

Loops!

MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat)
RETURN loop LIMIT 1

Approach #2 : Batch loading the data with LOAD CSV from an SQL dump

There is a snapshot of the Wikipedia categories and their hierarchical relationships (as of mid April 2017) here. It contains 1.4 million categories and 4 million hierarchical relationships. They can both be loaded into Neo4j using LOAD CSV. You can run the queries as they are or download the files to your Neo4j’s instance import directory and use LOAD CSV FROM "file:///…" instead.

First the categories. Notice that we are loading a couple of extra properties in the Category nodes: the pageCount and the subcatCount. These numbers are a precomputed in the data dump and not always accurate.

USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/cats.csv?raw=true" AS row
CREATE (c:Category { catId: row[0]})
SET c.catName = row[2], c.pageCount = toInt(row[3]), c.subcatCount = toInt(row[4])

And then the subcategory relationships

USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
MATCH (from:Category { catId: row[0]})
MATCH (to:Category { catId: row[1]})
CREATE (from)-[:SUBCAT_OF]->(to)

Approach #2 : Batch loading the data with LOAD CSV from an SQL dump

If you’re interested in regenerating fresh CSV files, here’s how:

Start by downloading the latest DB dumps from the wikipedia downloads page. For the category hierarchy, you’ll only need the following tables: category, categorylinks and page.
Load them in your DBMS.
Generate the categories CSV file by running the following SQL

select p.page_id as PAGE_ID, c.cat_id as CAT_ID, cast(c.cat_title as nCHAR) as CAT_TITLE , c.cat_pages as CAT_PAGES_COUNT, c.cat_subcats as CAT_SUBCAT_COUNT
into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/cats.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n'
from test.category c, test.page p
where c.cat_title = p.page_title
and p.page_namespace = 14

Generate the relationships file by running the following SQL

select p.page_id as FROM_PAGE_ID, p2.page_id as TO_PAGE_ID
into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/rels.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n'
from test.category c, test.page p , test.categorylinks l, test.category c2, test.page p2
where l.cl_type = 'subcat'
	  and c.cat_title = p.page_title
      and p.page_namespace = 14
	  and l.cl_from = p.page_id
      and l.cl_to = c2.cat_title
      and c2.cat_title = p2.cat_title
      and p2.page_namespace = 14

Exploring the dataset

The same analysis run in the first approach can be applied to the full hierarchy. However there are a couple of interesting queries that will only make sense on the full hierarchy. The first one is the stats on the number of pages per category based on precomputed counts available in the Wikipedia dump.

MATCH (c:Category)
return SUM(c.pageCount) AS `#pages categorised (with duplicates)`,
	   AVG(c.pageCount) AS `average #pages per cat`,
       percentileCont(c.pageCount, 0.75) AS `.75p #pages in a cat`,
	   MIN(c.pageCount) AS `min #pages in a cat`,
       MAX(c.pageCount) AS `max #pages in a cat`

Also, the full export contains some orphan nodes. Probably unmaintained/migrated categories? Here is how to get the number of orphan nodes:

MATCH (c:Category)
WHERE NOT (c)-[:SUBCAT_OF]-()
RETURN COUNT(c)