Knowledge Graph from Wikipedia Category

Knowledge graph or Ontology is making it easier to grasp information. When we search google for anything, we see the well-structured information on the right side of the page. For example, if we search google for Albert Einstein, then it shows: 

 

Albert Einstein, google search


Which has the facts or axiom like

  • Alberts Einstein ---- Born ---- March 14, 1879, Ulm, Germany
  • Alberts Einstein ---- Education ---- University of Zurich (1905), ETH Zürich (1896–1900)

These facts are the outcome of using a Knowledge Graph or more technically an Ontology. Google has developed its knowledge graph, by scrapping information from online and many from Wikipedia. 

Wikipedia is the largest hub of open information. Wikipedia’s articles are assigned to various categories (https://en.wikipedia.org/wiki/Category:Main_topic_classifications) according to their relatedness. For example, the article of Albert Einstein is categorized as German Inventors (there are other categories also). One of the parent categories of German Inventors is Inventors by Nationality

Wikipedia articles category hierarchy

I was looking for an open-source knowledge graph close to the Wikipedia category, which I can use off-the-shelf. I found DBpedia has a Wikipedia category hierarchy using skos:broader terms, but not the exact one I was looking for.  So I had to make the Wikipedia hierarchy knowledge graph from scratch.

I found 2 approaches to solve this. Scrap the Wikipedia hierarchy or Use the data dump of Wikipedia

I did the scrapping of Wikipedia, starting from the main category and then looking through its categories, and pages until there are no pages left. But this was a time-consuming process, as the program, had to visit each category page to find its children and subsequently their children. 

As the scrapping process was time-consuming, and I needed to make it reproducible, I opted for the data dump: http://dumps.wikimedia.org/enwiki/latest/. It has all the information we need in SQL format.  Among the dumps, 2 tables are the point of interest to me, page information and category information. These are stored on the page table and categorylinks table.

 

Page/Articles:
Download: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz 
Information on page table: https://www.mediawiki.org/wiki/Manual:Page_table.
This table has around 49 million entries, as of January 20, 2020. 

CategoryLinks:
Download: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz 
Information on the category table: https://www.mediawiki.org/wiki/Manual:Categorylinks_table
This table has around 140 million entries, as of January 20, 2020. 

Page table gives us information, about page_id and title. It also provides the page type, by column page_namespace. For category page, page_namespace is 14 and for article page, page_namespace is 0.

Categorylinks table provides the actual hierarchical information. It’s columns cl_from and cl_to hold the relation, where cl_from is the article page/subcategory and cl_to is the category/parent_category respectively.

Let’s see some examples:
If we want to get the page id of the article, Albert Einstein, we can get it by executing the SQL command.

select page_id, page_title, page_namespace from page where page_title='Albert_Einstein' and page_namespace=0;

Then, if we want to get the categories of this page, we can get those by executing

Select cl_from, cl_to from categorylinks where cl_from=736;

This returns around 148 rows. A snippet of it:

We can see, that it has the German_inventors also as the category. 

To get the parent category of German_Inventors, we can search for the page_id of the German_Inventors page.

select page_id, page_title, page_namespace from page where page_title='German_inventors' and page_namespace=14;

After getting the page_id, we can look for the parent category of this. 

Select cl_from, cl_to from categorylinks where cl_from=1033282;

We need to continue this back-and-forth computation until we get all the categories hierarchy. 

 

Making concrete Knowledge graph

Using OWLAPI or Apache Jena library or Owlready2 we can easily make a concrete knowledge graph. Wikipedia hierarchy has cycles.  For example, it has information such as, 

1949_establishments_in_Asia childCategoryOf 1949_establishments_in_India
and 
1949_establishments_in_India childCategoryOf 1949_establishments_in_Asia

which, creates a cyclic relation. Owlready2 library treats a concept as a python class, and python class supports inheritance and no cycle. so, Owlready2 can not handle this, as of January 20, 2020. OWLAPI and Jena library can support this. Here is the code to create a single fact/axiom using OWLAPI.

void createRelation(String childName, String parentName) {
IRI cIRI = IRI.create(onto_prefix + beautifyName(childName));
IRI pIRI = IRI.create(onto_prefix + beautifyName(parentName));
OWLClass cClass = owlDataFactory.getOWLClass(cIRI);
OWLClass pClass = owlDataFactory.getOWLClass(pIRI);
OWLAxiom owlAxiom = owlDataFactory.getOWLSubClassOfAxiom(cClass, pClass);
owlOntologyManager.addAxiom(owlOntology, owlAxiom);
}

After making the Knowledge Graph, it has,

  • Total axioms: 7864012 
  • Total classes: 1901708
  • Total subclassOf axioms: 5962304 

Here is a screenshot of the knowledge graph

Knowledge Graph from Wikipedia Hierarchy

Complete Knowledge graph, can be downloaded from here
Making a knowledge graph is fun!!!

Thanks:

  1. https://databus.dbpedia.org/dbpedia/generic/categories/2019.08.30
  2. https://kodingnotes.wordpress.com/2014/12/03/parsing-wikipedia-page-hierarchy/