This is the third in a continuing series of blogs about the Semantic Web and Web 2.0/3.0. Our focus here is on the Semantic Web.
Let’s look carefully at that word. What do we mean by “semantic”?
Even though it is very far from completely existing, the Semantic Web effort is a number of years old now. But the heavy use of this word in computing is far older, dating back to at least the late 70’s.
So, what do we mean when we use this word, in particular, with regard to the Semantic Web?
Like a human or “natural” language, a programming language has two key aspects: syntax and semantics. The syntax of a language refers to the structural rules that tell us what constitutes a legal program, just as the syntax of English tells us how to speak correctly. But syntax ignores the meaning of the program or English statement. The semantic rules of a language are what tells us the meaning.
Interestingly, a human statement can be syntactically correct, while its semantics might be ambiguous. If “Time flies”, does it mean that time goes by quickly, or that your buddy, Freddy Time, likes to fly his plane on weekends? But in general, a computer program must have only one set of semantics; otherwise, the computer doesn’t know what to do with it.
There is a broader – and far more ill-defined – use of the word “semantics” in computing. It’s used heavily, especially by researchers writing academic papers, as a sort of bragging term. We like to claim that our way of reprenting data captures more of the “semantics” of the data. In other words, the more expressive our way of representing data, the more semantics that can be deduced from its structure, and this is clearly a good thing.
Very important: when we look at the structure of the data, it includes all the terms used to describe the data. If I have a relational table called “Insurance Claims”, with a character attribute called “Subscriber Name”, and an integer attribute called “Amount Charged”, can a human with a modest knowledge of insurance deduce what it means?
Yes, in fact.
In the computing world, we are constantly creating new and more powerful ways of representing information in computers. Java and C# and C++ use object structures to represent data. MySQL and Oracle and Microsoft SQL Server use relational schemas to represent data; these consist of “relations” (also known as “tables”), along with “attributes” (also known as “columns”), along with other properties, like “primary keys”. With XML, we use things called “elements” and “attributes”, and other constructs, to model data.
It’s not really accurate for me to say “more powerful”; really, we just mean different. So, more precisely, our claim is that our way of reprenting data, given the sorts of data we are manipulating, makes it easier for us to deduce its meaning from its structure, i.e., its semantics from its syntax. XML documents are inherently very different from relational tables; they are used to model very different stuff. Neither is really more poweful than the other.
Note that we do not include the data itself when we talk about the ability of the syntax to imply the semantics of the data. The rows in a relational table are irrlevant when we are judging the power of the relational model to represent data. And often, we don’t include whatever code or logic is used to manipulate the data. When I described the relational table above, I didn’t say what SQL queries are used to manipulate the Insurance Claims table. But certainly, we could have, and it would have made perfect sense to consider this part of its structure. In fact, we include the methods of an object-oriented class in its structural definition, and of course, the syntax of Java specifies how to write legal methods. And so, the methods of a Java class are part of what we use to deduce the semantics of the data represented by that class.
So here’s one way to look at the Semantic Web: we try to use ways of structuring data that are so powerful, so rich in the way they can be used to imply the semantics of the data, that this interpretation can be done largely automatically. This would make the web far more powerful.
Let’s step back for a moment and consider the terms that are used to specify the name of a relational table (“Insurance Claims”), the names of the attributes (“Subscriber Name” and “Amount Charged”), and the names of the domains of those attributes (characters and integers). In the previous blog in this series, we looked at namespaces. We could consider these terms from our relational schema to form a namespace.
Importantly, namespaces are a major aspect of the Semantic Web, and are aimed at giving us web-wide standards for using terms as a way of describing part of the structure of data. In my relational database, I might use terms that tend to be common across all insurance companies, but are not necessarily common. And sometimes, the terms might have conflicting meanings from one insurance company to another.
But on the Semantic Web, we would specify a namespace and ask that all insurance companies use these same terms with the same meanings.
What about the rest of the definition of data on the Semantic Web? How do we put terms together in a way that is analogous to putting terms together to form a relational schema? One large research community thinks we should all use “triples”. Here’s one: <Tolstoy> <author> <War and Peace>. We’ve taken three terms and put them into a triple.
Here’s the exciting part: The left node could consist of a URL that points to a website dedicated to Tolstoy. The middle part could consist of a URL that contains a set of agreed-upon terms for describing books, in other words, a namespace. The right part could consist of a URL that has the text of War and Peace on it.
In other words, we can use namesspaces, combined with triples to glue together data on the world wide web. Then, we could imagine that a program could go out on this new “Semantic Web” and find the authors of a large set of books. One critical subtlety is that we would be guaranteed that “author” means the same thing in each case, because it has been take from a shared namespace that is used by any site that represents books and their authors on the web.
This is a key aspect of why the semantic web could be so powerful: shared namespaces guarantee common usage of terms, and triples can be used to glue information together into pieces that could be located automatically, i.e., without a human having to interactively verify and interpret every piece of data returned.
The Semantic Web. We introduced it in the entry before this one, which was the first entry in this blog.
This very aggressive goal, but if it were to ever exist, the Web would become something far more powerful than it is now. Today, we can only search the Web manually, interactively. We pull up Google, type in some keywords, and see what comes back. Then we begin to iterate. There is one obvious problem and two that might not be quite so obvious – and all three of them would be fixed if the Semantic Web really existed.
The key is that word “semantic”. It means that programs that search the Web, i.e., search engines of the future, would be able to search by the meaning or semantic content of the information we are looking for, and not simply by looking for keywords in the text of pages indexed by the search engine.
What are the three problems?
First, obviously, we would be able to perform a search with little or no iterating. This would radically reduce the need for a human to be in the loop, constantly guiding the search engine with more and more refined keyword searches. On the Semantic Web, a search engine would simply go out there and find whatever it is we need, and then deliver it up. If we search the Web by keywords and are looking for a treatment for tapeworms, we might not get the right results because we don’t know enough medical terminology to realize that we are looking for treatment for a disease caused by tapeworms, a disease called Taeniasis?
Second, not so obviously, the Semantic Web could come a lot closer to assuring us that our search was complete, in that the information returned was not only relevant, but that there wasn’t anything important that had been missed. If we are searching for treatments for tapeworms, and we find four possible treatments, how do we know there isn’t a fifth one out there that is more effective, quicker acting, and safer than the other four?
Third, and even more subtly, the Semantic Web would largely solve the huge problem of heterogeneity of data, of mixing information that isn’t truly comparable – essentially of mixing apples and oranges. Right now, when we search the Web interatively with Google, we might find one site that says that a “high end” notebook computer on one site would cost $3000, while we might find another site that says that a high end notebook computer costs $2300. Wouldn’t it be nice if the search engine could automatically ensure that we are comparing two computers that are truly similar in all significant ways?
So what’s behind the Semantic Web, what will power it if it ever emerges? A keystone technology will be that of “namespaces”. The idea is simple, and while it is only very much a partial solution, it is surprisingly powerful, given its simplicty. Essentially, a namespace is a collection of terms that multiple people agree to share, and furthermore, they agree on specific meanings for those terms. The Web, as it turns out, provides a powerful way of sharing namespaces: we can plant them on websites and anyone who wants to use those terms knows where to find them, along with their meanings.
One of the first namespaces to explode on the Web is called the Dubln Core. (Sorry, but the name refers not to the Dublin in Ireland, but to the Dublin in Ohio, where a group of people met to establish this namespace.) It is a collection of terms that can be used to describe resources that can be found on the Web, or in paper libraries, or in any other place where we store information. These terms include Contributor, Date, Publisher, Subject, and many more. And if you want to find the Dublin Core, it is publicly available at:
We’ll look a lot more at namespaces in future entries in this blog. We’ll also consider such technologies as XML – the standard for specifying namespaces.
The purpose of this blog is to discuss cutting edge technology that relates to Web 2.0 and the Semantic Web. What do these terms mean?
Let’s start with the definition of a third term. A Web Application is a website that provides some sort of substantive functionality other than simply filtering and presenting information. Evernote is a fantastic web app that stores your notes on a server, and allows you to create, group, and annotate your notes. Some folks say that a web app makes it clear that there is an application at the other end of your browser, and not just a bunch of static data. This is admittedly a pretty soft definition, but it’s reasonable. Another way to look at it is that a web app provides what would otherwise be a desktop application, but makes it accessible from a server so that users do not have to install and maintain an application.
So what’s Web 2.0? It refers to web development frameworks and tools that can be used to create highly responsive websites and web applications. AJAX does this, and the conical example people give is Google Maps. AJAX allows data to be retrieved asynchronously while a prior page is being displayed and manipulated by a user, and minimizes the amount of a web page that must be replaced with the next refresh.
A somewhat newer approach is embodied in Adobe Flex and Microsoft Silverlight technologies; in these cases, a web app is sped up by running more of the application’s logic inside a browser plugin (Adobe Flash or Microsoft Silverlight), rather than making the client machine (which runs the user’s browser) continuously talk to the web server. The overall challenge is to make web pages highly dynamic (meaning the data comes from a database and is not hard-coded in the web page) while giving the user response times that approach those of a desktop application running on a dedicated or near-dedicated machine. While this is intractable at this point, it’s a good thing to hold up as a goal.
The term Semantic Web does not narrowly refer to technology that speeds up response rates. Rather, it refers to a still emerging body of software tools whose overall goal is to automate the collection and integration of information gleaned from websites. The idea is to free the Google/Yahoo user from painfully interactive, highly repetitive keyword searches where we continue to hone our queries until we seem to be finding the right stuff.
Semantic Web technology includes namespaces, which try to put more smarts in websites by having data tagged with widely shared, standardized sets of tags. And things like XML Schema and XQuery can be employed to leverage namespace technology to support high-volume, set-oriented queries of data stored on web servers. These are very similar to the sorts of queries that can today be coded in SQL and run on single database servers running database management systems like Oracle, SQL Server, DB2, MySQL, and PostgreSQL. Essentially, XML-based technology takes the ability of a relational database schema to help us interpret data, and extends it to the entire web.
We will look at XQuery and XML Schema in future entries of this blog.
By the way, some folks are already talking about Web 3.0, which in many ways draws from both Web 2.0 and Semantic Web technology. We’ll look at this in a future blog, but a key focus is on making web apps highly multimedia.