Posted by: Roger King
automating Web searches, namespaces, RDF, SPARQL, the Semantic Web, triples, XML
This blog is dedicated to advanced and emerging Web technology. Each posting is meant to be understandable and informative on its own, but the blog as a whole tells a continuing story.
The Semantic Web.
In this posting, we will focus on the Semantic Web, which is a global effort at radically improving our ability to search the Web.
Currently, to search the web, we type in keywords into a search engine like Google, which then searches its vast index of webpages for pages that have these keywords in them. Because this sort of search is very low-level, and not at all tied to the true meaning or purpose of the information stored in webpages, searching is painfully iterative and interactive. A user must chase down countless URLs returned by a search engine to see if any of them are relevant. Quite frequently, they are not. And so, the user must refine the set of keywords and tries again. It might take many attempts before a satisfactory result is obtained.
One of the primary goals of the Semantic Web is to automate the process of searching the Web. There are two stages to this. First, people who post information on the Web must capture knowledge about the meaning of their information; this knowledge is commonly called “metadata”. The metadata is then store with the posted information.
The second stage happens when users search the Web. Rather than using the low level keyword search approach, the search is at least partly automated. The iterative process is sharply reduced by employing a smart search engine that knows how to find relevant information by searching for metadata that pertains precisely to whatever it is that the user is seeking.
The bottom line.
The Semantic Web would be able to ease the burden of searching for information, as well as find vast stores of “hidden data” that reside in databases that are accessible via webpages, but whose contents right now are not seen by search engines.
Ultimately, we would want the Web to be entirely searchable by software, without any humans guiding the process. This would be the true Semantic Web.
Namespaces and triples.
In past postings of this blog, we have discussed a handful of key approaches to implement the Semantic Web. One idea is to tag information with standardized sets of terminology called “namespaces“.
We have also looked at the idea of embedding these tags in things called “triples“. In this posting, we look at this concept more closely and consider an existing language that would allow people to specify these triples.
RDF and SPARQL.
The most well-known standard for specifying triples is RDF, which stands for the Research Description Framework. SPARQL is a query language, heavily influenced by SQL, that can be used to search data that has been structured using RDF.
This is the first of a series of blog postings in which we will first look at RDF, and then at SPARQL. Then, we’ll consider the big issue: will RDF and SPARQL enable the development of the true Semantic Web?
So, what is RDF? At its highest level, RDF is used to describe anything that can be found on the Web. RDF has an XML syntax; in other words, RDF can be written as an XML program, using a set of predefined “element” and “attribute” tags. (XML and XML languages were discussed in an earlier posting of this blog, as was XML and declarative languages.)
We might remember that on its own, XML is impotent. It is not in itself a programming language. It is simply a language standard for taking a set of tags and using them as “elements” and “attributes” in a declarative, data-intensive languages. A good example is SMIL, which is used to define multimedia presentations.
Here is a fragment in RDF, using its XML syntax. Note that XML languages are embedded languages, with opening tags beginning with <> and closing ones ending in </>
This looks complicated, but it’s not. This simple example illustrates the power of RDF. It uses a set of standardized RDF-specific tags, and the second line of code tells us where these tags come from: the w3.org site, which contains a vast store of information about advanced web technology. In other words, we can go to w3.org to find the precise definition of RDF specific tags.
RDF is engineered to also use other sets of tags, in particular, domain-specific tags. In this example, these tags come from a (non-existing) url called someurl.org. The tags themselves are prefaced with “zx:” in the rest of the code, so we know which tags are native RDF and which come from a domain-specific set of tags (called a namespace).
The xml “element” called Description is an RDF-specific tag that tells us we are giving the description of some resource on the Web, namely one at a (non-existing) website called awebsite.org.
The whole piece of code is one triple: It says that the topic of the resource at www.awebsite.org/index.html is funstuff. Here it is as a triple, with all the xml syntax and the namespace information removed:
www.awebsite.org/index.html <topic> funstuff.
Let’s overview this again. RDF is an XML language, so it uses the syntax of XML. One of the primary concepts in XML is that of an “element”, and Description is an XML element, one defined in the RDF standard. The piece of code begins with two namespace statements, one telling us which RDF specification we are using, and the second telling us that we will also be using some tags from another, domain-specific specification, which includes the tag “topic”. Then there is the guts of the triple, telling us that we are listing the topic of a Web-resident resource.
More on this in the next posting…