This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.
It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.
The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.
RDF, triples, assertions, and inferences.
A thorough example can be found in a previous posting of this blog.
Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.
Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.
Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.
More on DBpedia.
There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.
One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.
That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.
The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.
More on this in the next posting.