Inferences archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

inferences

Sep 25 2009   11:31PM GMT

Semantics and the new Web: Built out of very old ideas.



Posted by: Roger “Buzz” King
automating Web searches, inferences, information, knowledge, Semantic Web, Web development

Describing the real world in computers.

The word “semantic” has been a buzzword in computer science for decades. The youthful Artificial Intelligence world invented these things called Semantic Networks or Semantic Nets a half century ago. The idea was to come up with a crisp, formal language for representing real world things inside a computer. This took the form of a small set of constructs that would be general purpose, in that they could be applied to almost any sort of information. Further, these constructs would somehow be intuitive and natural, in that they would get to the heart of what it means to describe everything from horses to insurance claims to marriages to the contents of the Bill of Rights.

Basic, long-standing, core concepts.

What emerged has certainly stood the test of time. Big time. Opinions differ widely on just what constitutes the core constructs. Different people have used different names for these terms, and, although the idea was to specify something formal, the definitions of these constructs were generally sloppy. But here is a reasonable specification, in its most rudimentary form:

There are objects (which might also be called entities, things, or concepts). Objects have unique names.

Objects are interrelated by attributes (which might also be called relationships or properties). Attributes are directional, and they have names.

In other words, things in the world can be represented as a simple directed graph. We could say that there are objects called Chickens that have an attribute called Are. The value of this might be an object called Birds. Birds might have an attribute called Lives-In, which links Birds to the object Barnyard. There might be an object called Mr. Fried, which has an attribute called IS, which connects Mr. Fried to the object Chickens.

There are many popular various of this basic idea that have emerged, and they tend to be of the following nature:

One idea is to make a sharp distinction between the notion of a subtype (or sub-kind or subset) and other attributes. So, our attribute Are might become a core concept itself, and we might name it Is-A. Chickens IS-A Birds, People IS-A Biped, etc. Other attributes like Lives-In would be considered inherently different from Is-A.

We could introduce another generalization. A general term for attributes Lives-In and other similar attributes might be Has-A. In fact, we could stop using special words for attributes in general, and just use the terms Is-A and Has-A. We would then say that Marriages Has-A Wife, as well as a Husband, as well as a Date.

These general ideas are actually old, and actually significantly predate computing. We have been struggling with the problem of describing real world objects (like Cows), real world concepts (like Marriages and Respects), and their interrelationships and categories since the emergence of the earliest philosophers. Aristotle distinguished between objects and their attributes, and carefully studied and described many animals and plants.

What does it all mean for the new Web?

So, what does all this mean to us, today, and what does it have to do with modern Web technology? Well, first of all, these concepts of objects and attributes have spread throughout all of computer science.

There have been some significant extensions, like distinguishing between an attribute that we might call a relationship, which interconnects complex objects or notions (like a driver owning a car) and attributes that interconnect complex objects and notions with atomic or simple things (like a car having a color or a driver having a name). Generally, these latter, simple kinds of attributes are now what we call attributes, and are considered inherently different from (and simpler than) relationships.

Another extension that has become a core concept in programming languages is something we might call an object identifier, which is a unique number or other identifier for individual objects; this allows us to carefully distinguish between two people who have the same mother, and two people who have mothers who just happen to have the same name.

Programing languages also introduced the concept of methods, or little programs that can give life to objects. You might be able to tell a marriage object to tell us the names of the husband and wife.

But basic concepts have not changed. There seems to be something natural and fundamental about them.

Building a new world out of old concepts.

And the Web? A revolution is happening today. We are developing languages that allow Web designers to embed machine-readable specifications in Web-resident information. This will largely automate the process of searching the Web, as well as the integration of information at multiple sites. This will in turn lead to the discovery of knowledge by putting together diverse information from across the Web. We have discussed these emerging technologies in the previous postings of this blog; they are heavily and deliberately built on top of ideas that date back to the 1950’s, and in fact can trace their roots to ancient Greece.

Sep 18 2009   3:02AM GMT

Dynamic pages, hidden data, and infered information: the danger of scale.



Posted by: Roger “Buzz” King
assertions, databases, dynamic pages, hidden web content, inferences, next generation search engines, Semantic Web, smart search engines, static pages, triples

The good and bad sides of the powerful Semantic Web.

So what happens when the Semantic Web is here? It’s supposed to largely automate the process of searching the Web by allowing us to attach machine-readable assertions (perhaps by using RDF) to information posted on the Web. Then, instead of us poor flailing humans having to painstakingly chase down countless URLs until we get what we want, smart search engines would be able to find precisely what we want in a single shot.

There is an obvious danger to all of this. The new Web will scale, in both good ways and bad. I am certainly not the first person to point out that the smarter the Web, the easier it will be for software to peruse the Web and dig up personal information about us. There will be software that carefully crafts ads in Spam mail that will target our vulnerabilities and our preferences. Websites will dynamically create webpages that target us individually, as well. When we shop online, when we read news, when we make social connections online, the Web will be disarmingly efficient and effective, and this leaves lots of room for fraud and manipulation.

This is already happening to a significant degree, and most of us are aware of it.

The no-longer-hidden database factor.

There is something more subtle about all of this, however. One of the most difficult things to do with traditional Web technology is to expose the content of databases to Web visitors. That’s because the pages that deliver up content pulled from databases are highly dynamic in nature, and so it is very hard for web designers to make search engines (like Google) find and index the content of these databases. There are simple and somewhat effective things web designers can do, like creating static pages that contain terms that are meant to draw web visitors to their sites. These pages are not “destination” pages; rather, they exist only as a way of advertising the information contained in databases.

In the future, RDF assertions (and other machine-readable content) will be added to websites, and they will server as far more effective draws.

But what about privacy? Will web designers inadvertently facilitate fraud and identity theft by enabling the automatic cross-referencing of detailed information existing in databases that have been built and deployed on the Web in isolation? This capability is at the heart of the Semantic Web effort. Information that right now can only be obtained by individual users manipulating individual web interfaces will be discoverable by smart search engines.

The real problem: it will scale.

This is a big deal. It’s not just that previously hidden information will now be discoverable. Because standardized terms and assertions will be used to describe information in databases, smart search engines will be able to automatically interrelate data from otherwise unrelated database systems. When information from multiple places is integrated, new information is effectively created.

For a moment, let’s forget about databases and look at a simple example of information that might be stored statically in two websites. Here is an example adapted from the previous posting of this blog:

Assertion 1: Joe is tall for an athlete.
Assertion 2: Tall athletes should try out for basketball.

A new inference: Joe should try out for basketball.

The point here is that this new inference can be inferred automatically, without the intervention of a human being.

We noted in the previous posting that the information about Joe and the information about basketball might be on different websites. These websites could easily have been built independently. But a key notion - and that is the semantics of the word “tall” in the context of basketball - is what allows this information to be automatically integrated. Another site might point out that Timmy is tall for a kindergarten student, but this would not trigger the suggestion that Timmy try out for the NBA.

Now, let’s get back to database systems, these things that can contain countless terabytes of personal information. Perhaps there is a database at one site containing information about many thousands of athletes. Perhaps there are hundreds or thousands of such sites. The Semantic Web would allow us to find tall athletes without having to know in advance what databases around the world have this sort of data inside them, data that previously could only have been extracted through tedious, time-consume human/computer interaction. Now, a high school counselor or a sports agent looking for new clients can be far more effective at their jobs.

Or, maybe it’s a drug company matching potential customers up with expensive drugs targeted toward specific diseases, or toward people who might have vague symptoms of various diseases, and who might be easily convinced they are sick. Ora con artist looking to scam elderly people who are likely to have dementias.

Or - well, get it? The Semantic Web will scale because it will have access to huge databases, and not just a world wide web of static pages. That’s the danger.


Sep 9 2009   6:02PM GMT

Real-World Look at the Semantic Web, part 2



Posted by: Roger “Buzz” King
assertions, inferences, information, namespaces, RDF, SPARQL, triples, URI's, wikis

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. Last time, we looked at DBpedia, something that a former graduate student at my university, Greg Ziebold, pointed me toward.

The Semantic MediaWiki.

In this posting, we look at the Semantic MediaWiki, something else that Greg told me about. It is an extension of MediaWiki, the application that the Wikipedia is built out of. You can learn all about it at the Semantic MediaWiki website. The idea behind Semantic MediaWiki is to provide a more powerful wiki tool, namely one that supports more than just human-readable things like text and images.

RDF and namespaces: creating machine-readable, web-based information.

The idea is to allow entries in wikis that contain machine-readable information, so that searching can be performed in a largely automatic fashion. Specifically, the Semantic MediaWiki allows users to export information from a wiki in RDF format. An RDF specification consists of “triples” that form “assertions”. Consider the following

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.

The idea is for terms in triples (“Joe”, “tall”, “is”, “Tall People”, etc.) to be taken from predefined and globally accessible namespaces. This would ensure that everyone who uses a given term (like “tall” or “Should try out for”) will have the same meaning in mind. In this way, rather than having to painfully search for information that pertains to Tall People, for example, a smart search engine could do the searching for us.

Building locally, growing globally.

There is more to this. These namespaces can be available on the Web, and RDF statements can point to the relevant namespaces. This means that software searching the Web, and processing these triples, can easily find the relevant namespaces.

Also, the things in the right and left side of a triple (like “Joe” and “tall”) can themselves be Web-based resources. This means that information scattered around the Web can be interconnected - but all the work can be done locally. No one has to manually integrate millions of websites. The job can be done little by little, in a quiet way, as people start to store their information in an RDF compatible fashion.

This is how the Semantic Web will scale. Everyone will use shared namespaces and shared protocols like RDF. This will, in essence, turn the Web into one big website that can be searched in a partly automatic fashion.

SPARQL: querying RDF-based information.

How will we interrelate data scattered around the Web?

There is a query language out there, called SPARQL, that can be used to search the Web. SPARQL can follow RDF connections around the globe. How is this done? It has to do with being able to “infer” new things. Consider a fact that can be automatically deduced from the two assertions above:

A new inference: Joe should try out for Basketball.

Assertion 1 could be on a server in Detroit, and assertion 2 could be on a server in Miami, and SPARQL could do the job of making the leap that leads to the new inference.

This means that we could figure out what Joe should be doing right now without having to find the two pieces of information manually (the fact that he is tall, and that tall people should play basketball), and without having to make the inference ourselves.

This is a big deal. This sort of automation is what the Semantic Web is all about.

So what do real people do with the Semantic MediaWiki? We’ll look at this next.


Aug 31 2009   3:40AM GMT

A Real-World Look at the Semantic Web, part 1



Posted by: Roger “Buzz” King
assertions, databases, inferences, information, knowledge, namespaces, RDF, Semantic Web, SPARQL, triples, wikis, ontologies

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.

DBpedia.

It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.

The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.

RDF, triples, assertions, and inferences.

A thorough example can be found in a previous posting of this blog.

Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.

Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.

More on DBpedia.

There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.

One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.

That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.

The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.

More on this in the next posting.


Aug 16 2009   4:16AM GMT

Dangers of the Semantic Web: Assertions, Inferences, and Surrogates



Posted by: Roger “Buzz” King
the Semantic Web, assertions, inferences, namespaces, next generation search engines, smart search engines, surrogates

This blog deals with advanced Web technology. Each posting should be quite understandable on its own, but the blog as a whole is a continuing story. We’ve been looking at the Semantic Web, which is a global effort to automate the searching of the Web, so that applications (we might call them smart search engines) can find, interpret, interrelate, and aggregate information stored in multiple, independent websites.

Assertions and Inferences.

A key concept is that of an “inference”, a fact that is created by putting together two or more pieces of information that we might call “assertions”. We used the following example in the example in a previous posting. The two assertions might be posted on the Web somewhere.

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

We have also discussed the fact that terminology used in inferences must be very carefully defined and widely shared.

What is a Surrogate?

The word surrogate, in the programming world, refers to a measure or model that is being used to approximate the “real” measure or model. If I am trying to estimate the depth of the ocean at some point, but don’t have a direct way of measuring the distance to the ocean floor, I might judge the depth by using a table that associates the distance from the shore to the depth of the ocean. The assumption is that all points that are a particular distance from the shore will have the same depth more or less.

Here’s the important point for us: The Semantic Web will make very heavy use of surrogates. Let’s be precise about this. We’re not talking about approximations. We might search the Web for all banks that provide accounts that earn 5%, and our smart search engine might point us to banks that on the average, over the past two years, have paid at least 5.0% on their accounts. A surrogate is something different. Suppose we wanted to find all banks that never cheated their customers. This might be impossible to answer precisely, so we might look for banks that are in the bottom 10% when it comes to the number of formal complaints filed against them. That would be a surrogate.

Surrogates on the New Web.

Now, let’s consider the Web. It doesn’t matter if we are talking about the Web today or the emerging Semantic Web.

In fact, what we are concerned with here is global to computing in general: when we take a chore normally performed by a human using an interactive interface and turn that chore over to a computer program, we often turn a real world decision into a decision based on very simplified surrogates. A human can look at a bunch of information and, although it may take a very, very long time, make a “perfect” decision based on that data. But computer programs cannot think like a human. We can only crudely simulate with software the process of thinking that goes on in the mind of a real person.

Now, back to the Web, the new Semantic Web. Suppose we build a next generation website and use an official namespace (which is a structured set of terms) to specify assertions using terms from this namespace. What we’re doing is providing a surrogate for the smart search engine to use so that it can do the filtering of URLs and the integrating of information from multiple sites.

Consider our two assertions from above, along with the inference derived from them:

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

Maybe we are shopping for a ball online. We mght have to follow hundreds of URLs and search hundreds of websites to find just the right ball. But who said the ball is orange? It’s an approximation made by the vendor of the ball in question. It has been labeled orange. But maybe it’s a shade of orange that we would actually have liked if we had looked at the picture of the ball ourselves instead of leaving it to the search engine.

Well, we might argue that the word orange, if it is precisely defined, won’t be confused with some other color. We can be confident that our notion of orange is the same as the vendor’s notion of orange. We do know how to express colors very precisely by using numbers.

So, let’s change the assertions and the inference a bit:

Assertion 1: DOROTHY THE DOLL is PRETTY.
Assertion 2: WE want a PRETTY DOLL.
An inference created by putting the two assertions together: WE might want DOROTHY THE DOLL.

Now, how could the notion of pretty ever be globally and uniformly defined?

It cannot.

Maybe we should shop for our own dolls and not leave it to a next generation search engine.

The Lesson.

The Semantic Web will trade speed for accuracy. No way around it.