The Semantic Web archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

the Semantic Web

Aug 16 2009   4:16AM GMT

Dangers of the Semantic Web: Assertions, Inferences, and Surrogates



Posted by: Roger “Buzz” King
the Semantic Web, assertions, inferences, namespaces, next generation search engines, smart search engines, surrogates

This blog deals with advanced Web technology. Each posting should be quite understandable on its own, but the blog as a whole is a continuing story. We’ve been looking at the Semantic Web, which is a global effort to automate the searching of the Web, so that applications (we might call them smart search engines) can find, interpret, interrelate, and aggregate information stored in multiple, independent websites.

Assertions and Inferences.

A key concept is that of an “inference”, a fact that is created by putting together two or more pieces of information that we might call “assertions”. We used the following example in the example in a previous posting. The two assertions might be posted on the Web somewhere.

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

We have also discussed the fact that terminology used in inferences must be very carefully defined and widely shared.

What is a Surrogate?

The word surrogate, in the programming world, refers to a measure or model that is being used to approximate the “real” measure or model. If I am trying to estimate the depth of the ocean at some point, but don’t have a direct way of measuring the distance to the ocean floor, I might judge the depth by using a table that associates the distance from the shore to the depth of the ocean. The assumption is that all points that are a particular distance from the shore will have the same depth more or less.

Here’s the important point for us: The Semantic Web will make very heavy use of surrogates. Let’s be precise about this. We’re not talking about approximations. We might search the Web for all banks that provide accounts that earn 5%, and our smart search engine might point us to banks that on the average, over the past two years, have paid at least 5.0% on their accounts. A surrogate is something different. Suppose we wanted to find all banks that never cheated their customers. This might be impossible to answer precisely, so we might look for banks that are in the bottom 10% when it comes to the number of formal complaints filed against them. That would be a surrogate.

Surrogates on the New Web.

Now, let’s consider the Web. It doesn’t matter if we are talking about the Web today or the emerging Semantic Web.

In fact, what we are concerned with here is global to computing in general: when we take a chore normally performed by a human using an interactive interface and turn that chore over to a computer program, we often turn a real world decision into a decision based on very simplified surrogates. A human can look at a bunch of information and, although it may take a very, very long time, make a “perfect” decision based on that data. But computer programs cannot think like a human. We can only crudely simulate with software the process of thinking that goes on in the mind of a real person.

Now, back to the Web, the new Semantic Web. Suppose we build a next generation website and use an official namespace (which is a structured set of terms) to specify assertions using terms from this namespace. What we’re doing is providing a surrogate for the smart search engine to use so that it can do the filtering of URLs and the integrating of information from multiple sites.

Consider our two assertions from above, along with the inference derived from them:

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

Maybe we are shopping for a ball online. We mght have to follow hundreds of URLs and search hundreds of websites to find just the right ball. But who said the ball is orange? It’s an approximation made by the vendor of the ball in question. It has been labeled orange. But maybe it’s a shade of orange that we would actually have liked if we had looked at the picture of the ball ourselves instead of leaving it to the search engine.

Well, we might argue that the word orange, if it is precisely defined, won’t be confused with some other color. We can be confident that our notion of orange is the same as the vendor’s notion of orange. We do know how to express colors very precisely by using numbers.

So, let’s change the assertions and the inference a bit:

Assertion 1: DOROTHY THE DOLL is PRETTY.
Assertion 2: WE want a PRETTY DOLL.
An inference created by putting the two assertions together: WE might want DOROTHY THE DOLL.

Now, how could the notion of pretty ever be globally and uniformly defined?

It cannot.

Maybe we should shop for our own dolls and not leave it to a next generation search engine.

The Lesson.

The Semantic Web will trade speed for accuracy. No way around it.


Aug 5 2009   9:39PM GMT

The Semantic Web: RDF and SPARQL, part 5



Posted by: Roger “Buzz” King
the Semantic Web, ontologies, RDF, triples, knowledge, information, data

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort. The goal of the Semantic Web is to partly automate the searching of the Web, by using RDF to capture deeper semantics of information and SPARQL to query that information. This is in comparison to today’s search engine technology, which does not allow us to do much more than search for individual words in the text of webpages.

Let’s step back for a moment.

Just how universal is this notion of RDF-style triples? Will we ever have something substantially more useful, more powerful in the semantics it can express?

Data, Information, Knowledge, and Ontologies.

Academic and industrial researchers in computing like to trivialize big words. Let’s briefly look at the problem. “Data” is an old word, and most of us have a sense that virtually anything stored digitally can be considered data. This includes applications and other pieces of software, too. If you back up some applications to free up space on your hard drive, you’ve just turned applications into data, right?

“Information” is a word that came into play when researchers wanted something that was smarter than data. The word was broader, and vaguer, but information was essentially data that was ready to be used by interactive users. If I pull down a page from the Encyclopedia Britannica site, it’s filled with information.

Then, there were demands for an even richer word, one that suggests data that is beyond information, stuff that is rich in semantics that can be easily extracted. Often, knowledge was data or information that had been interconnected, turned into trees or graphs. Traversing the links in the structure told us how various things were interrelated and thereby exposing powerful semantics. The Web in a sense is knowledge. I can follow links between pages to discover how various pages on the Web are interrelated. I can follow connections on the Britannica site to connect a scientific discovery to the story of the discoverer’s life.

Here’s something significant. This blog and all its postings are related to new web technology, such as the Semantic Web. Our central concern has been the partial automation of the searching of the Web, so that users aren’t limited to typing words into Google and getting back stuff no richer than pages that happen to have these words in them. As it turns out, the term “knowledge” dates way back before the days of the Web, but back then, our notion of what it meant to be knowledge and not just data or information was pretty much the same as it is now. Knowledge can be processed by programs, thereby automating the task of finding the right knowledge and applying it to our problem domain.

Then came “ontology”. This is a relatively new word, but it’s perhaps the most embarrassing. The word, until recently, was reserved for philosophers to use. An ontological argument is an argument about the existence of something. Over the centuries, one common subject of ontological discussions has been the existence of God.

Hmm.

The same old, same old.

Flash forward to the Internet age: Computer researchers use the term to refer to a precise specification of the objects and properties (of these objects) in some well studied domain. I guess the idea is to suggest that we can capture the true nature of the existence of some domain.

These domains could be large, like banking, health insurance, or the stock market. Laying out all of the objects involved in one of these is a daunting task. Consider an insurance claim and all of its properties: type of claim, provider of medical service, patient name, etc., and then imagine laying this all out for insurance policies, underwriting tables, actuarial data, etc. To include all of the objects and properties involved in building software for an insurance company would lead us to thousands of interconnected terms. Triples, in other words.

Or our ontology could be the specification of a pencil object, which has properties like being made of wood and graphite and metal, of having yellow paint and a little pink eraser. Triples like this:

The pencil has a pink eraser.
The pencil is painted yellow.

This characterizes the nature of the challenge we have taken on in our efforts to build ontologies. We take on the problems of scale, not the problems involved in really capturing, in some formal fashion, the nature of the world around us. We build gigantic, but very simple, models of the things that concern us in the software world.

We have trivialized this term, ontology. In fact, for the most part, we’re simply referring to the same old, same old modeling construct: triples. Yes, that simple tool called RDF can be used to build a vast “ontology”.

There is something about the nature of triples that has conquered computing. It is a concept that, as we have seen in previous postings of this blog, underlies object-oriented data structures. It predates object-oriented languages, going back to the early days of AI and the attempts to model the real world.

So, what is an ontology?

An ontology is supposed to be the end of the Semantic Web rainbow: our ability to fully automate the specification and searching of the real world. But the next time some computer person tries to impress you by tossing this term at you, remember to just shake your head and say “Quit being a puff toad. You’re just talking about triples.”



Jul 29 2009   2:17AM GMT

The Semantic Web: RDF and SPARQL, part 4



Posted by: Roger “Buzz” King
the Semantic Web, RDF, SPARQL, SQL, triples, XML

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort. In this posting, we will look at SPARQL, the web language designed to search data that has been specified as RDF triples. The goal of the Semantic Web is to partly automate the searching of the Web, by using RDF to capture deeper semantics of information and SPARQL to query that information. This is in comparison to today’s technology, which does not allow us to do much more than search for individual words in the text of webpages.

From the last posting.

Here is a piece of the RDF code from the previous posting:

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>

<zx:created-by>http://www.anotherurl.org/buzz</zx:created-by>

</rdf:Description>

</rdf:RDF>

This can be interpreted as the webpage at awesite.org was created by Buzz.

Another representation of RDF-based information: 3 triples.

We see from the above that RDF simply represents triples. We could simplify it even more as:

 http://awesite.org/index.html was created by Buzz

Part of the reason that the original RDF code above is so much more complex is that the full syntax lets us specify that we are using terms that are defined at specific web addresses. This allows people to use standardized terms and greatly enhances the specitifity of an RDF specification. The full syntax also allows us to reference pieces of information that reside on the Web. (See the previous three postings, 1, 2, 3.)

Before we launch into a SPARQL example, we need to make an important distinction between syntax and symantics. The code above is written in a particular syntax for RDF, one that uses XML. We note that because syntax needs to be very precise, it tends to be verbose. This can cause syntax to obsure the conceptual simplicity of underlaying semantics, or meaning.

But this isn’t the only way to specify RDF triples. Let’s look at some information that is much simpler, and at the same time, let’s look at using a different syntax for specifying RDF-like triples. Here are three triples:

<http://awebsite.org/ > was-created-by “Buzz”

<http://awebsite.org/ > was-created-by “Suzy”

<http://anotherwebsite.org/> was-created-by “Alice”

This is a very simple program. It consists of a two triples that say that a website named awebsite was created by Buzz and Suzy, and another triple that says that Alice created a website called anotherwebsite. We are not saying that was-created-by is a widely used term; it may have been invented only for particular RDF specification, and its meaning would therefore not be precise. We can only interpret it from our general understanding of English words. We also have no idea who these people Buzz and Suzy and Alice are, and we have no other information about them.

SPARQL: searching triples distributed across the Web.

Now, here is a piece of code:

prefix website1: <http://awebsite.org/ >
SELECT ?x
WHERE
{ website1:was-created-by ?x }

We’re getting very close to real SPARQL, by the way, and if you know SQL, you can see the extremely similarity. But syntax is not our issue here. We’re trying to look at concepts.

This code will find the creators of http://awebsite.org. You could imagine that there are actually many thousands of these triples, and that they tell us who built a large number of different websites. Now, we see the power of this query. It will search through all of these triples and find the two of interest to us, and then pluck off the names of the creators.

In fact, these triples could be distributed all around the Web, and we could imagine a search engine taking this query and running it everywhere on the Web where was-created-by triples are stored, and then having it bring back all the creators of awebsite, even if there are a hundred developers, and even if these names are spread around the Internet.

Next, the bigger issue.

In the next posting, we’ll look more closely at SPARQL. One thing we will consider is why it does look so much like SQL. There is a powerful reason for this that has to do with searching information in general.



Jul 18 2009   9:57PM GMT

The Semantic Web: RDF and SPARQL, part 3



Posted by: Roger “Buzz” King
RDF, the Semantic Web, SPARQL, triples

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort.

In the previous two postings, we looked at RDF, which is an excellent example of solid software technology: It serves an important purpose. It is easy to use. And, even if you don’t write any RDF yourself, it is easy to understand what it does, and therefore, how it will impact your life.

RDF, in its simple, quiet way, allows us to interconnect any resources that exist on the Web, and at the same time, make use of standardized terminologies. This provides a highly flexible and semantically expressive way of building the new Semantic Web.

SPARQL: what is it?

RDF is great stuff, but it’s only half the story. If knowledge on the emerging Semantic Web is going to be glued together into RDF triples, how will that information be searched? It doesn’t do any good to have a book that will solve all your problems if you can’t read it or search through it.

SPARQL stands for Protocol And RDF Query Language, with an S tossed into the beginning so we can say it as “sparkle”. Interestingly, when something is called a “query” language, we start thinking in terms of SQL, that largely declarative relational language that is the core of almost all successful relational database management systems. Indeed, as we will see in a later blog posting about XQuery, the language for searching XML-based data, SQL, has served as the model for SPARQL.

A blast from the past.

There’s something about triples that we should look at before moving on. It has to do with the fact that triples are also known as “assertions”, and that assertions can be chained together to make “inferences”. Here are two triples/assertions, specified very informally: THE BALL is ORANGE. ORANGE is an UGLY COLOR. The inference we can make is THE BALL is an UGLY COLOR.

Or, getting back to the Web and RDF, below are two triples specified in RDF; the first one comes from the previous posting of this blog.

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>

<zx:created-by>http://www.anotherurl.org/buzz</zx:created-by>

</rdf:Description>

</rdf:RDF>

This first one can be interpreted as the webpage at awesite.org was created by Buzz.

Here is the second one RDF triple:

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.anotherurl.org/buzz”>

<zx:is>http://www.yetanotherurl.org/professor</zx:Is>

</rdf:Description>

</rdf:RDF>

This one can be interpreted as Buzz is the guy described at yetanotherurl.org

We can chain them together to deduce that the guy who built the page at awebsite.org/index.html is Buzz the professor.

This is an inference.

The point is that if you take a bunch of RDF statements and chain them together, you get what looks a lot like an object-oriented graph of related objects, somewhat like you see in Java.  In a sense, RDF takes an object representation and breaks in down into triples.  There’s really nothing new in RDF, other than the fact that any part of an RDF assertion (triple) can be something found on the Web.

Back to SPARQL.

So, what is SPARQL?  It is a language that can be used to traverse graphs that consist of RDF triples that are chained together into an object network.

We will look at some SPARQL code in the next posting.


Jul 10 2009   2:35AM GMT

The Semantic Web: RDF and SPARQL, part 2



Posted by: Roger “Buzz” King
the Semantic Web, RDF, triples, XML, URI's

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort. In the previous posting, we looked at a simple RDF program, which creates a relationship between a web-based resource and the term “funstuff”; the relationship is called “topic”, thus telling us that the resource located at the given URL is something fun.

RDF and URI’s.

One interesting fact is that, although we only used URI’s for two parts of the RDF triple embedded in this RDF program, we could have used URI’s for all three pieces of the triple. Thus, the program from the previous blog posting (immediately below) might be changed to look like the second program below, which now has two triples in it:

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>
<zx:topic>funstuff</zx:topic>

</rdf:Description>

</rdf:RDF>

————-

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>
<zx:topic>funstuff</zx:topic>

<zx:created-by>http://www.anotherurl.org/buzz</zx:created-by>

</rdf:Description>

</rdf:RDF>

RDF and decentralized information.

As a reminder, the triple expressed in the first program can be stated as:

www.awebsite.org/index.html <topic> funstuff

So, what did we add in the second program?  There is a new triple that has been added.  It can be roughly stated as:

www.awebsite.org/index.html <created-by> http://www.anotherurl.org/buzz

In other words, our vocabulary defined at http://www.someurl.org/zx apparently has another standardized term called “created-by”.  The added triple in our second program says that the resource found at www.awebsite.org/index.html was created by someone who is identified by the url http://www.anotherurl.org/buzz.

We see that the value in the first triple, which concerns the “topic” of our resource, consists of a character string, but the value in the second triple, which concerns the “created-by” of our resource, is actually a URL.

This is big.  It shows us that all three parts of a triple in RDF can be URI’s, and they can be distributed around the Internet.  This means that the information embedded in the triple is highly decentralized.

The bottom line

This illustrates the power of RDF.  It can be used to express information which is not controlled in any centralized fashion.  RDF is thus the glue that can be used to bring diverse pieces information together.  And it can use standardized, shared terminologies to precisely dictate the semantics of the triples in RDF programs.  In our example, the resource is defined by one URI, the kind of relationship is defined by another URI, and the value of that relationship is defined by yet another URI.

We will continue this in the next posting.


Jul 3 2009   4:28AM GMT

The Semantic Web: RDF and SPARQL, part 1



Posted by: Roger “Buzz” King
the Semantic Web, namespaces, RDF, SPARQL, XML, triples, automating Web searches

This blog is dedicated to advanced and emerging Web technology.  Each posting is meant to be understandable and informative on its own, but the blog as a whole tells a continuing story.

The Semantic Web.

In this posting, we will focus on the Semantic Web, which is a global effort at radically improving our ability to search the Web.

Currently, to search the web, we type in keywords into a search engine like Google, which then searches its vast index of webpages for pages that have these keywords in them. Because this sort of search is very low-level, and not at all tied to the true meaning or purpose of the information stored in webpages, searching is painfully iterative and interactive.  A user must chase down countless URLs returned by a search engine to see if any of them are relevant.  Quite frequently, they are not.  And so, the user must refine the set of keywords and tries again.  It might take many attempts before a satisfactory result is obtained.

One of the primary goals of the Semantic Web is to automate the process of searching the Web.  There are two stages to this.  First, people who post information on the Web must capture knowledge about the meaning of their information; this knowledge is commonly called “metadata”.  The metadata is then store with the posted information.

The second stage happens when users search the Web.  Rather than using the low level keyword search approach, the search is at least partly automated.  The iterative process is sharply reduced by employing a smart search engine that knows how to find relevant information by searching for metadata that pertains precisely to whatever it is that the user is seeking.

The bottom line.

The goal?

The Semantic Web would be able to ease the burden of searching for information, as well as find vast stores of “hidden data” that reside in databases that are accessible via webpages, but whose contents right now are not seen by search engines.

Ultimately, we would want the Web to be entirely searchable by software, without any humans guiding the process.  This would be the true Semantic Web.

Namespaces and triples.

In past postings of this blog, we have discussed a handful of key approaches to implement the Semantic Web.  One idea is to tag information with standardized sets of terminology called “namespaces“.

We have also looked at the idea of embedding these tags in things called “triples“.  In this posting, we look at this concept more closely and consider an existing language that would allow people to specify these triples.

RDF and SPARQL.

The most well-known standard for specifying triples is RDF, which stands for the Research Description Framework.  SPARQL is a query language, heavily influenced by SQL, that can be used to search data that has been structured using RDF.

This is the first of a series of blog postings in which we will first look at RDF, and then at SPARQL.  Then, we’ll consider the big issue: will RDF and SPARQL enable the development of the true Semantic Web?

RDF.

So, what is RDF?  At its highest level, RDF is used to describe anything that can be found on the Web.  RDF has an XML syntax; in other words, RDF can be written as an XML program, using a set of predefined “element” and “attribute” tags.   (XML and XML languages were discussed in an earlier posting of this blog, as was XML and declarative languages.)

We might remember that on its own, XML is impotent.  It is not in itself a programming language.  It is simply a language standard for taking a set of tags and using them as “elements” and “attributes” in a declarative, data-intensive languages.  A good example is SMIL, which is used to define multimedia presentations.

Here is a fragment in RDF, using its XML syntax.  Note that XML languages are embedded languages, with opening tags beginning with <> and closing ones ending in </>

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmls:zx=”http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>
<zx:topic>funstuff</zx:topic>

</rdf:Description>

</rdf:RDF>

This looks complicated, but it’s not.  This simple example illustrates the power of RDF.  It uses a set of standardized RDF-specific tags, and the second line of code tells us where these tags come from: the w3.org site, which contains a vast store of information about advanced web technology.  In other words, we can go to w3.org to find the precise definition of RDF specific tags.

RDF is engineered to also use other sets of tags, in particular, domain-specific tags.  In this example, these tags come from a (non-existing) url called someurl.org.  The tags themselves are prefaced with “zx:” in the rest of the code, so we know which tags are native RDF and which come from a domain-specific set of tags  (called a namespace).

The xml “element” called Description is an RDF-specific tag that tells us we are giving the description of some resource on the Web, namely one at a (non-existing) website called awebsite.org.

The whole piece of code is one triple: It says that the topic of the resource at www.awebsite.org is funstuff.  Here it is as a triple, with all the xml syntax and the namespace information removed:

www.awebsite.org/index.html <topicfunstuff.

Let’s overview this again.  RDF is an XML language, so it uses the syntax of XML.  One of the primary concepts in XML is that of an “element”, and Description is an XML element, one defined in the RDF standard.  The piece of code begins with two namespace statements, one telling us which RDF specification we are using, and the second telling us that we will also be using some tags from another, domain-specific specification, which includes the tag “topic”.  Then there is the guts of the triple, telling us that we are listing the topic of a Web-resident resource.

More on this in the next posting…


Jun 11 2009   11:44AM GMT

The two duct tapes of computing: Excel and Firefox, and the New Web



Posted by: Roger “Buzz” King
Web 3.0, Web 2.0, the Semantic Web, Multimedia, Excel, browsers, models of computing, smart browsers

This blog concerns advanced Web technologies. Each posting should be readable on its own, but the series of blogs as a whole tell a continuous story.

In this posting, we look at the Duct Tape Phenomena.

Excel.

As a researcher, I have worked with biologist in the past. Big biologists, not microbiologists, the folks who tinker with DNA. The folks I worked with study macroscopic things mostly, species, in particular. They search for as-yet undocumented species. They tend to have appointments at major universities around the world, and then take extended field trips to study life. Most of them go to rain forests because that’s where biodiversity is its greatest.

Each scientist has a chunk of the world and a kind of animal they specialize in. I know the butterfly man of Costa Rica, a fellow who has documented several thousand varieties of butterflies, some of which have wing spans of several inches. I know the bug man of the Amazon, who builds long tunnel-like things from the floor of the forest up to the canopy, fills the tunnels with bug killer, and then looks among the dead for bugs that are yet unheard-of.

Here’s the interesting part, at least from a computing perspective: a lot of the scientists I came into contact with store their data in Excel. This is a phenomena that crosscuts the entire spectrum of computer users. They had to learn Excel at some point, maybe in school or at some workplace, and the next time they needed an application to do something, they found a way to make Excel do the job. For most people, learning the “right” application to use is far too much work, even if it’s hard to query Excel the way we would a database, even if Excel spreadsheets get way out of control size-wise, given the large amount of data many of us collect.

Excel, in many ways, is the duct tape of desktop and notebook computing.

Firefox (or your favorite browser).

But what about developers of desktop apps? What do they use as a design paradigm when building the interface to an app, even if it’s not meant for the Web?

Browsers.

Indeed, there is a merging of desktop GUI and web app interface technologies, and now you could sit down in front of a running app and not be sure which of the two you are seeing. In fact, the design impact is not the end of it. We actually use browsers now to interface with some desktop apps, but not often, not yet. However, at least as a user interface paradigm, the browser is becoming the duct tape of GUI design.

For developers of interfaces, Firefox has become a sort of duct tape.

The new Web.

These are the two things that underly much of computing: the need to store and compute (as with Excel) and the need to interface (as with Firefox). But when the new Web, (in the form of the Semantic Web and truly advanced Web 3.0 apps), begins to arrive, will a new paradigm emerge?

Perhaps they will be extra smart browsers that can process code written with xml and namespace and other semantic technology, so they can do more than just look for pages according to the English keywords on them.

In other words, we could imagine them as extensions of what our browsers do for us now. They’re very stupid now, really. They’re not at all smart like Excel.

How does it work now? Crawlers commissioned by search engines like Google constantly search the Web and “invert” every static page they find by building an index on every word in them. And then later, we can search this gigantic index store according to the words that appear on the pages that the crawler has found. Once we find URLs of interest, we click on them and go visit the actual pages. These searchers are far, far less than “semantic” in nature.

Our smart browsers will also have to let us build up organized libraries of specialized web content we have found, including documents, images, video, sound, animation, and such specialized data as medical treatment advice. We might maintain these in virtual space, or we might download frozen copies of pages to store on our machines. Our smart browsers could constantly look for updated versions of pages we have copied and downloaded.

These smart browsers will also have to interrelate data of a wide variety of sorts, so that a description of certain symptoms can be accurately hooked up with the specifics of a diagnosis and a medical treatment plan. Our browsers will have to isolate conflicting information, as well.

So, in the future, we’ll need browsers with smarts. We’ll look at this much more carefully in a future posting of this blog, but for now, here’s the lesson: thats the two things that applications do for us, they let us store and search things, and they let us compute things.

And what about viewing all this information? How will so much complex, multimedia information be presented? Not as simple webpages with images, text, and things you can click on. Perhaps the new browsers will lay out multimedia presentations of complex, integrated information that has been synthesized from many, many different sources.

The point.

So, what does this imply? That these two things underly computing apps of almost all sorts: 1, storing and searching, and 2, viewing and manipulating.

And they will underlie the most complex and sophisticated end-user applications of the future.

In a vague, somewhat analogous fashion, most apps are a blend of Excel and Firefox.

Things change radically over time. And things never really change at all.



May 17 2009   3:45AM GMT

The Internet of Things Meets the Internet of Web Apps.



Posted by: Roger “Buzz” King
the Semantic Web, Web 2.0, Web 3.0, The Internet of Things, Ubiquitous Computing, advanced Web apps, RFID tags, online retail shopping

Injecting Smarts into the Semantic Web and Web 2.0/3.0.

In our continuing series on advanced web technology, we’ve looked at the difference between the Semantic Web and Web 2.0/3.0. We’ve also looked closely at the Semantic Web, and in particular, we’ve discussed what we mean by that word “semantic“. And with respect to Web 2.0/3.0, we’ve considered just what constitutes an advanced web app. And we’ve looked at some specific advanced apps.

But one thing has stood out above all else: the new world of web applications depends on our ability to make web apps smarter. At the core of this are a handful of key technological advances: namespaces, XML languages, full text searching, and web services. Still, as we have seen, we can only crudely mimic intelligence, which we do largely by using a complex mixture of standards, heuristics, and pre-made components.

Importantly, this issue of being smart is very old, and has been a far off goal of the folks who build software development tools since the very early days of computing. In truth, some of the things that seem new and exciting to us have actually been around for a long time, and have existed under multiple names.

But this base of intelligence-injecting technology, could it be used to give the Semantic Web and Web 2.0/3.0 a shot in the arm? Can we leverage the greater world of smart technology to make the new web even more powerful?

Let’s focus on just one technology that has been around a while, but is still vibrant and rapidly growing.

The Internet of Things.

This idea is centered around the idea that the objects in our world would serve us a lot better if computers could coordinate their use. Of particular interest are mobile objects. One of the key components behind this idea are RFID tags. RFID stands for “radio frequency identification”. A tag can be attached to almost anything. After they are deployed, an RFID reader can send out a signal, which is picked up by the RFID tags, when then respond. As things move around, as things are used in concert to perform tasks, they can be carefully tracked and managed.

Other technologies for tracking objects can be employed, too, and RFID is just one example of something that is fairly cheap and very dependable.

It’s also true that objects can respond with more than a “Yo, I’m here.” In particular, they are likely to tell us exactly where they are, and whether they are in use. But for the most part, these things tend to be fairly inert when it comes to intelligence. They might be warehouse items or objects in retail stores. Volume is a key factor. RFID tags are cheap enough that an organization can tag tens of thousands or hundreds of thousands of items.

Immobile Things, but Mobile Users.

We can use the Internet of things concept in another mode. The objects might be immobile, but the users might be highly mobile, and they might be carrying the tags. The objects might have computing capabilities in them, as well. If I work in a secure facility, and if I use a variety of computing devices in the course of the workday, I can be carefully tracked. And every machine could be engineered to allow me to perform only those functions for which I have been authorized. The computers could also track suspicious trends that involve multiple machines and multiple users over a period of time.

The Internet of Things and the Internet of Web Apps.

What does this all have to do with the Internet we are concerned with in this blog, the one that hosts next generation web apps? The two worlds could be blended together.

Consider this. When we buy things on the web, we normally use one of two retail models. If the object is software or data or in any downloadable electronic form, the website can ensure that by the end of the shopping session, our credit card has been paid and we have received the goods. This makes both the seller and the user happy.

Or, if the object is physical, like a printed book, the website will ensure that by the end of the session, our credit card has been charged, and we have been given a shipping number, a shipping date, or some other piece of information that gives us some assurance that we will get what we paid for. In this mode, the seller is likely to be quite happy, and the buyer might not be quite so happy.

But there’s another way. At the end of retail session, the buyer of a physical product could be given the ID of the particular object being purchased, and then, via the retail website, track that object nonstop from the moment the session ends until the moment it arrives. The buyer could even track the construction of a purchased object out of many subcomponents.

The Bigger Picture.

Here’s something to think about, something else that can be used in concert with the advanced web technology and the Internet of things concept. It’s called “ubiquitous computing”, and it is a concept that has been around for many years. It refers to the expansion of computing technology into every aspect of our lives.

Putting all of this technology together means that the new web is working its way into law enforcement, supply chains, manufacturing processes, retail shopping, education, etc., etc., etc.

This will have a huge impact over the next decade.



May 11 2009   3:07AM GMT

The Semantic Web: revealing hidden data.



Posted by: Roger “Buzz” King
the Semantic Web, Oracle, SQL Server, DB2, PostgreSQL, MySQL, triples, namespaces, static pages, dynamic pages, databases, indexing, hidden web content

The Hidden Web.

The Semantic Web - a primary topic of this continuing blog series - will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.

Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?

Forms, Databases, and Interactive Interfaces.

The Hidden Web refers to data that is out there on the web, publicly accessible - but only via webpage interfaces that are opaque to the indexing software of search engines like Google.

Let’s step back for a moment.

The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.

The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?

This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.

But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a webform”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.

So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.

Indexing Dynamic Pages.

So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site - a static page - loaded with all the right words, and that contains links to the pages and forms you want the user to discover.

On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.

As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.

Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.

An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.

So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.

Where Does that Information Come From?

Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?

Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.

Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.

What Will the Semantic Web Do?

The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.

This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.

The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.

We will look at this again, more closely, in a future blog posting.



May 3 2009   3:00AM GMT

Email addresses, the new Web, and NASCAR.



Posted by: Roger “Buzz” King
NASCAR-like web ads, Web 2.0, Web 3.0, the Semantic Web, XML, namespaces, free email accounts, web services, web-based ads

The Semantic Web.

This blog concerns advanced Web technology, in particular,Web 2.0/3.0 and the Semantic Web. Each blog entry should be fully understandable on its own, but the blog as a whole tells a continuing story.

Very roughly, we’ve defined the Web 2.0/3.0 as the class of emerging web applications that are highly responsive, to the point of being competitive with desktop apps. Another characteristic is that they can manage large volumes of very complex media, like images, sound, and animation, as well as interconnected forms of media. We’ve looked at some specific advanced web applications.

Our concern here, in this blog entry, is the Semantic Web, which we have also roughly defined. The Semantic Web is something that does not yet exist, but would meet the very aggressive goal of supporting largely automatic web searches, freeing us from excruciatingly interactive, manual Google and Yahoo sessions. And we’ve seen that we would use such things as shared namespaces, intelligent full text searching, and XML-based markup languages to embed information in websites that could be used by smart browsers to perform far more accurate searches.

Web services would help a lot, too, by taking humans out of the loop when providing powerful web-based capabilities; one website can now provide a vast amount of information, for example, by silently using web services to collect information from many other web-based sources.

(By the way, we have also looked at precisely what we mean by “semantic” in the Semantic Web.)

The way we pay.

This all sounds very good. The Web would be far more useful, with automatically searchable Semantic Web-sites. But there’s a bad side to all of this, and it has to do with how we often pay for Web use.

The problem is that we often do not pay at all. At least not directly, with money. We pay by putting up with ads. Free email services, such as those hustled by Yahoo, Hotmail, AOL, and Mail.com, are generally accessed via web browsers, and we find the main pages of these email accounts stuffed with ads.

Some free email accounts even stick ads in your outgoing mail!

Often, the only way to get the ads stripped from a web mail interface is to pay a fee. We might also get more than just ad-free web mail pages; paying sometimes allows users to access their email with POP or IMAP protocols, via desktop clients (like Outlook and Apple Mail), thus avoiding ads in another way.

(As an aside, there are free email sites that either have no ads in them, or only very subtle ones. Try Gmail.com and Inbox.com. My favorite, with its clean interface and growing set of accompanying capabilities, is GMX.com.)

As it turns out, folks looking to buy ad space online find that they have a vast array of choices, and this drives down the cost of ad space. But these two things, an ever-growing list of free online services and cheap ad space, are related. This is because it is all too easy to build useful web applications. Like browsers, bulletin boards, calendar apps, blogging services, and stickies applications, email servers are cheap to build and maintain. Venders can use canned, largely free software components.

And, transmission costs on the Internet are effectively free, and the bandwidth is huge. Free email accounts often offer a gigabyte or several gigabytes of storage, because disk space is dirt cheap, too.

There is a lot of rebranding going on, too, where someone seems to be offering free email (or some other service), but it is actually being provided by a large email provider.

So, the way things have shaken out, is that free web apps like email servers look like NASCAR racing cars, covered with colorful ads. Many of these ads consist of video, and so we have to battle distracting, flashing colors so we can focus on our mail.

The trick behind online ads.

There is something happening in the online ad world: folks who provide these free, pay-for-it-with-ads services are learning to carefully target ads. There is specialized software available for this, and by plugging in some smarts, folks can make the ads that appear on your screen far more likely to be of interest to you.

How is this done? By watching what you type into search engines, by taking advantage of personal information you supply when you sign up for free email accounts and other services, and by carefully examining the content of the messages you send and receive, that’s how it’s done.

It’s important to point out that this works. The “click through” rate on ads can be radically improved, just by using some simple heuristics in choosing your ads. Folks who pay for ads love this, and it has allowed individuals who don’t even provide free web applications turn themselves in to ad space sellers. Your blog, your specialized website, can now host ads carefully targeted toward the visitors to your blog or your website.

But just wait for the Semantic Web.

But it will really kick in when the semantic web is here. The same technology that would make browsers far, far smarter about finding good URLs for you will make the targeting of ads at you extremely precise.

This slowly-emerging technology is badly needed by the folks who sell ad space and by the people who buy that ad space. That’s because you and I are starting to get used to this world of NASCAR websites. We are looking through or past or around the ads. They need to be made a lot smarter, is order to get our attention back.

But by using Semantic Web technology to radically increase click-through rates, by getting us interested in ads again, impulse shopping on the Web might skyrocket. It’s very easy to go from seeing an ad for a product you have never heard of before to having bought it.

Like little kids watching commercials for sugar-heavy cereals on Saturday cartoon shows, we will be manipulated like we have never imagined before. That’s the bad side to the Semantic Web.