Buzz’s Blog: On Web 3.0 and the Semantic Web


October 18, 2009  10:37 PM

Personal Information Management Applications and Web 3.0



Posted by: Roger King
advanced Web apps, databases, information, media applications, Multimedia, note-taking, notebooks, rich internet apps, tagging, Web 2.0, Web 3.0, web applications

This blog is devoted to the discussion of Semantic Web and Web 2.0/3.0 technology.

Managing personal and small group information.

When it comes to so-called Web 2.0 and 3.0 technology, one of the most proliferate marketplaces involves the explosion of applications for managing information for individuals and small groups. Looking only at applications developed for Macs, we see an array of information management technologies.

Notebooks.

One of the most popular formats for managing information uses the paradigm of a notebook. The user can create a notebook, often selecting from multiple canned formats, such as a diary, class notes, or a novel, complete perhaps with a notebook cover and a spiral wire down the left side. The application creates a table of contents, and users can create sections and pages – and stuff virtually any kind of information on each page. Two very good examples of this approach are NoteShare and Notebook.

Interestingly, and perhaps because many of the applications in this category have been around for a number of years, these tend to not be true web applications. Often you can share notebooks, including full read/write access, via a URL and a simple browser interface, and you can publish a notebook at a URL. But the products are primarily for single-user, desktop use.

A good example of a notebook application that is a true web application is Zoho Notebook. (Zoho actually provides a large set of web based applications, of which the note program is just one.)

Buckets.

The other very popular note format uses the bucket or folder approach. The application may or may not support the nesting of these buckets and/or the creation of conceptual buckets, so that a given note can exist in more than one bucket. Two very good applications that use this approach are SOHO Notes and Yojimbo. These two applications are desktop-based, although most applications in this category support the synching of notes over multiple machines, using the Apple web-synching technology.

A hybrid desktop/web application is Evernote, which has elegant desktop applications for Windows machines, Macs, and a variety of handhelds and cell phones. It also has a very effective web interface. The user can sync multiple Evernote desktop instances via Evernote’s web server. Users can thus avoid ever using the web interface.

Outlines.

One specialized sort of information management application involves the creation of embedded outlines and bulleted lists. These applications, such as OmniOutliner, actually provide a full notebook functionality as well. OmniOutliner notebooks can be published on the web, but it is very definitely a desktop application.

Task lists.

An even more specialized class of information management applications support To-Do lists. Great examples are Zenbe Lists (they also provide integrated email and collaborative software) and rememberthemilk.com. These are web applications.

Photos and video.

There are a rapidly growing number of applications that allow users to collect, sort, tag, edit, and share photographs and video. Apple’s iPhoto is a great example. It is very much a desktop app, although applications in this class typically support the publication of images and video on the web, and sometimes, even read/write access via the web.

Stories, scripts, novels, and storyboards.

There are a number of highly specialized applications that support the development of fiction, including Final Draft and Montage (scripts), Scrivener and StoryMill (fiction prose), and Toon Boom storyboard (which is actually an impressive drawing program). Again, users can often publish to the web. Interestingly, many of these applications can easily be used as full blown, generic note applications, and can manage many forms of media.

Diary Applications.

Perhaps the most popular diary application on Macs is MacJournal (by the Montage and StoryMill folks). An interesting twist is that it is also an excellent blogging program. I use it to write this blog. This is, of course, one of the most widely used vehicles for sharing information on the web, and you can expect other sorts of personal information management systems to have blogging capabilities added to them.

Small, forms-based database management systems.

These applications are desktop apps. Apple’s Bento is a very good example. It actually is a sort of hybrid database/spreadsheet application. The most recent release allows multiple instances of Bento to share databases running on computers on a shared network.

Mind-Mapping.

The “circles and lines” applications have become highly specialized. The most well known one is MindManager, and there are versions for Windows machines and Macs. These are desktop apps. The vender, MindJet, recently introduced both web interfaces for sharing and updating desktop mind maps, as well as a web-based application that has a fresh, smooth interface, and provides team collaboration tools. Many forms of media can be placed in MindManager, including data from a wide variety of relational database management systems.

Screen and audio capture.

There are a number of applications that allow users to capture desktop video, along with audio voice-overs. Camtasia (which has Windows and Mac products) and Screenium are popular products.

These applications are, in a way, successors to slide applications like Microsoft Powerpoint and Apple Keynote. More and more presentations are being engineered with screen capture and audio applications, and these applications often support text and image data, as well as the insertion of video capture of the speaker. Sometimes, Powerpoint slides can be imported.

Conferencing apps.

There are several applications that provide hybrid desktop/browser live communication, including video, sound, and collaborative white-boarding. The best known one is probably Cisco WebEx, which comes in varieties for Macs and Windows machines. Skype supports a similar, limited product – which is free. One of the nice things about these products is that they come with their own voice lines. Other products, like Adobe ConnectNow, require the use of a cell phone to carry voice. With most of these products, a conference can be recorded for later use.

Finally…

Importantly, we note that in this rapidly-exploding marketplace, the borders between these various categories are being broken down, and applications often support a number of these capabilities at once. A good example is Curio, a desktop application that supports notes, lists, video, audio, white-boarding, mind-mapping, and limited web publishing.

October 11, 2009  11:07 PM

Making information management scale: leveraging metadata on the new Web



Posted by: Roger King
3D modeling, automating Web searches, databases, DB2, information, Multimedia, MySQL, Oracle, PostgreSQL, RDF, Semantic Web, Video, Web 3.0, Web development frameworks, Web3.0

Previous postings of this blog.

This blog is dedicated to advanced Web development tools and concepts. Previous blog postings have focused on the emerging Semantic Web, which promises to make the Web radically easier to search and to greatly enhance the value of the vast sea of currently-disconnected information spread across the Web. We have also looked at Web 3.0 efforts, which promise to make multimedia websites highly usable and capable of conveying far more information than the current generation of websites. Previous postings describe breadth and depth of cutting edge Web technology.

Metadata: making that ratio small.

Here’s something that’s very important: Much of the ongoing research and development that is loosely categorized as Semantic Web and Web 3.0 efforts is focused on a specific technical goal, one that has been at the core of information management technology since the mainframe era that was epitomized by the IBM 360 series. That goal is to leverage metadata as much as possible.

It’s our best weapon against the truly staggering amount of information on the Web. This includes traditional text-based and numeric data, as well as books, medical advice, photographs, entertainment and training videos, music and recorded books, investment information, educational materials, scientific materials, e-government information, etc., etc. How can we possibly organize information and then search it in a way that scales? The Web is far from a closed world. In traditional data processing environments like banking, insurance, and credit card processing, we could get our arms around all of the data, as vast as it may have seemed. But the world of information today is an open world, effectively infinite in size.

Very informally, if you look at the size of the metadata divided by the size of the data itself, the smaller that fraction the better. In traditional relational databases (built with database management systems, such as Oracle, MS SQL Server, MySQL, PostgreSQL, or DB2), the extreme focus on minimizing this ratio has enabled the fast processing of extremely large volumes of data. The tradeoff is that the table definitions (or the “schema”), which form the heart of the metadata are very, very simplistic.

The old days: relational database schemas.

An insurance claim may be defined as a table with such columns as Subscriber_Name, Medical_Provider, etc., and thus, may consist of little or no more than a series of simple character and numeric fields. But if we need to process fifty thousand of them tonight, we must be able to bring many such table rows into memory at once, and quickly move through them. The database world was an extension of the paper world: a row in an insurance claim table was effectively an electronic successor to the traditional claim form.

Today: a far more challenging problem.

But on the new Web, information can be far more complex in nature, making the metadata to data ratio far larger. We’ve looked at some of the emerging technology and technical trends for embedding metadata in advanced forms of data (and for processing that metadata); this data includes books, images, video, modeling and animation, and sound. This new generation of information formats make up our personal health records and medical records images, industrial training materials, university “distance” courses, and the like. Each instance of these tends to be far more unique than individual insurance claim forms. And, it takes a lot of metadata to properly convey their “meaning”.

The challenge.

What we’re struggling with right now is to succinctly specify the meaning of modern media assets and to automate searching based on this metadata. This is our only hope for leveraging that ratio of metadata size divided by data size.


October 3, 2009  9:12 PM

Multimedia: The Problem of Subtle Semantics



Posted by: Roger King
3D animation, 3D modeling, advanced Web apps, automating Web searches, blob data, continuous data, databases, information, Multimedia, rich internet apps, Semantic Web, smart search engines, tagging, Text, Web 2.0, Web 3.0, web applications, Web development, Web development frameworks, XML

The challenge of the Semantic Web.

We’ve looked at the emerging Semantic Web technology in the previous postings of this blog. The idea is to have a far, far smarter Web, one where the process of finding and interpreting and making use of far flung information can be largely automated. This is in sharp contrast with today’s Web, where these things have to be done in a painful, extremely time-consuming fashion.

So that is the key challenge. It has to do with searching the kinds of information that are important to us in our daily lives. This information, as it turns out, is very difficult to process automatically. Why is this?

The complexity of modern multimedia.

I teach a very basic 3D animation class to mostly computer science students. We use Maya, arguably the most popular 3D animation application, one that is used in the making of many animated features. The interesting thing about animation is that it is truly multimedia. It can give us a lot of insight into what we need the new Web to do for us.

That’s because the number and diversity of applications that are used for drawing, documenting, modeling, animating, motion capture, texturing, video rendering, video editing, video conversion and compression, sound editing, in even small projects, can be very impressive. Correspondingly, the wide variety and complexity of media formats involved in an animation project can be overwhelming.

What happens in an animation project? The workflow might begin with vector storyboard drawings to break the story down into scenes. In a typical animation project, 3D models in a variety of proprietary formats are used. Models must be transformed as they are exported from one application and imported into the next. Multiple video renders of animated models are made, and they must be edited together, along with multiple sound files. Multiple video and audio formats might be used. 2D images are used for textures; photographs of butterfly wings can be used to make an animated butterfly very realistic, and a checkerboard image made with Photoshop can be used to make a Linoleum floor. And along the way, a variety of note taking, screen capture, and conferencing software might be used to facilitate group communication.

There is also a heavy focus on reuse in an animation project. Building every model, editing every texture, creating every environment and background, recording every sound from scratch is frequently intractable. If existing assets cannot be tailored and reused, the project would be far too expensive and time consuming, and would demand too wide a variety of professionals to always be available. This raises the multimedia stakes, as assets of widely differing forms must be constantly reconfigured and used in concert in new ways.

But what’s the real problem? We aren’t all trying to produce complex animated videos. But very interestingly, in our everyday lives we essentially face the animator’s challenge when we try to find and use information on the Web. That’s because we’re often looking for things whose meaning, whose interpretation, demands focused human thought. We are looking not for business data, but for pieces of media, and the problem is that today, most of our searching has to be based on tags or brief textual descriptions that are associated with pieces of media, and not on the true meaning of the media itself.

The needs of the business world are not our needs.

It’s the subjective nature of media assets – this is what is at the heart of the problem facing us. Existing technology for searching the web is based on keywords and very short pieces of text.

There is other technology, though, under active development, stuff that serves as the information storage backbone of most commercial websites. It’s the technology that has for decades been used in-house (not on the Web) by businesses when they process large databases. But this stuff was designed to handle traditional business data forms, like integers, character strings, real numbers, dates, timestamps, and full text.

There is more, though. All of the major database management systems, along with tools for building and searching advanced websites are being retrofitted (or in some cases, built from the ground up) to manage more than keywords and text, more than standard business data.

But up to now, the focus has not been on supporting the kinds of information you and I are most interested in. The focus has been on extending database and Web technology to support xml documents, as well as more complex data objects, like those inside a Java program, as well as other forms of data found inside programs. This includes arrays and lists and short pieces of textual data, like the names of diseases.

In other words, we’ve been busy extending our support of the business world, so they can store complex business data in databases and make that information processable over the Web. You and I have largely been left out.

Finally, we are attacking our needs.

But there now many ongoing efforts to extend database and Web technology to make it useful to us. The new focus is on supporting blob and continuous media like images, video, and audio. This is extremely hard to do.

Why? Because the strongest means by which we deduce the meeting of business data is by looking at its internal structure and the terms that are used to describe that structure. A relational table named Prescriptions, with a character attributes Patient Name, Doctor’s Name, and Medication, and with a numeric attribute Dosage, is pretty easy to interpret.

But what do we do with a photograph, which is just a grid of pixels with no internal structure? Or a long series of images, along with a sound track, put together to form a piece of video?

The U.S. military has been pumping money into image processing for several decades, and so all is not lost. There is a vast body of mathematical research and software development that allows us to write programs that can find a particular face in a crowd and search satellite photos for airplane runways. But in general, we cannot at this time write a program that can process an arbitrary photo or video clip and tell us what it means. That means we can’t quickly search vast media database for useful pieces of information.

The goal behind the Semantic Web effort is to build a new generation of websites whose information can be searched automatically, and where information from multiple sites can be automatically integrated. To do this with numeric and character based data is quite doable. But when it comes to multimedia, like images and sound and video and 3D models and engineering designs, well, we have a long way to go. The meaning – in other words, the semantics – of these forms of data are complex and subtle, and highly dependent upon an individual’s interpretation of that media.

So, we see that we have only just begun our journey to create the new Web.


September 25, 2009  11:31 PM

Semantics and the new Web: Built out of very old ideas.



Posted by: Roger King
automating Web searches, inferences, information, knowledge, Semantic Web, Web development

Describing the real world in computers.

The word “semantic” has been a buzzword in computer science for decades. The youthful Artificial Intelligence world invented these things called Semantic Networks or Semantic Nets a half century ago. The idea was to come up with a crisp, formal language for representing real world things inside a computer. This took the form of a small set of constructs that would be general purpose, in that they could be applied to almost any sort of information. Further, these constructs would somehow be intuitive and natural, in that they would get to the heart of what it means to describe everything from horses to insurance claims to marriages to the contents of the Bill of Rights.

Basic, long-standing, core concepts.

What emerged has certainly stood the test of time. Big time. Opinions differ widely on just what constitutes the core constructs. Different people have used different names for these terms, and, although the idea was to specify something formal, the definitions of these constructs were generally sloppy. But here is a reasonable specification, in its most rudimentary form:

There are objects (which might also be called entities, things, or concepts). Objects have unique names.

Objects are interrelated by attributes (which might also be called relationships or properties). Attributes are directional, and they have names.

In other words, things in the world can be represented as a simple directed graph. We could say that there are objects called Chickens that have an attribute called Are. The value of this might be an object called Birds. Birds might have an attribute called Lives-In, which links Birds to the object Barnyard. There might be an object called Mr. Fried, which has an attribute called IS, which connects Mr. Fried to the object Chickens.

There are many popular various of this basic idea that have emerged, and they tend to be of the following nature:

One idea is to make a sharp distinction between the notion of a subtype (or sub-kind or subset) and other attributes. So, our attribute Are might become a core concept itself, and we might name it Is-A. Chickens IS-A Birds, People IS-A Biped, etc. Other attributes like Lives-In would be considered inherently different from Is-A.

We could introduce another generalization. A general term for attributes Lives-In and other similar attributes might be Has-A. In fact, we could stop using special words for attributes in general, and just use the terms Is-A and Has-A. We would then say that Marriages Has-A Wife, as well as a Husband, as well as a Date.

These general ideas are actually old, and actually significantly predate computing. We have been struggling with the problem of describing real world objects (like Cows), real world concepts (like Marriages and Respects), and their interrelationships and categories since the emergence of the earliest philosophers. Aristotle distinguished between objects and their attributes, and carefully studied and described many animals and plants.

What does it all mean for the new Web?

So, what does all this mean to us, today, and what does it have to do with modern Web technology? Well, first of all, these concepts of objects and attributes have spread throughout all of computer science.

There have been some significant extensions, like distinguishing between an attribute that we might call a relationship, which interconnects complex objects or notions (like a driver owning a car) and attributes that interconnect complex objects and notions with atomic or simple things (like a car having a color or a driver having a name). Generally, these latter, simple kinds of attributes are now what we call attributes, and are considered inherently different from (and simpler than) relationships.

Another extension that has become a core concept in programming languages is something we might call an object identifier, which is a unique number or other identifier for individual objects; this allows us to carefully distinguish between two people who have the same mother, and two people who have mothers who just happen to have the same name.

Programing languages also introduced the concept of methods, or little programs that can give life to objects. You might be able to tell a marriage object to tell us the names of the husband and wife.

But basic concepts have not changed. There seems to be something natural and fundamental about them.

Building a new world out of old concepts.

And the Web? A revolution is happening today. We are developing languages that allow Web designers to embed machine-readable specifications in Web-resident information. This will largely automate the process of searching the Web, as well as the integration of information at multiple sites. This will in turn lead to the discovery of knowledge by putting together diverse information from across the Web. We have discussed these emerging technologies in the previous postings of this blog; they are heavily and deliberately built on top of ideas that date back to the 1950’s, and in fact can trace their roots to ancient Greece.


September 18, 2009  3:02 AM

Dynamic pages, hidden data, and infered information: the danger of scale.



Posted by: Roger King
assertions, databases, dynamic pages, hidden web content, inferences, next generation search engines, Semantic Web, smart search engines, static pages, triples

The good and bad sides of the powerful Semantic Web.

So what happens when the Semantic Web is here? It’s supposed to largely automate the process of searching the Web by allowing us to attach machine-readable assertions (perhaps by using RDF) to information posted on the Web. Then, instead of us poor flailing humans having to painstakingly chase down countless URLs until we get what we want, smart search engines would be able to find precisely what we want in a single shot.

There is an obvious danger to all of this. The new Web will scale, in both good ways and bad. I am certainly not the first person to point out that the smarter the Web, the easier it will be for software to peruse the Web and dig up personal information about us. There will be software that carefully crafts ads in Spam mail that will target our vulnerabilities and our preferences. Websites will dynamically create webpages that target us individually, as well. When we shop online, when we read news, when we make social connections online, the Web will be disarmingly efficient and effective, and this leaves lots of room for fraud and manipulation.

This is already happening to a significant degree, and most of us are aware of it.

The no-longer-hidden database factor.

There is something more subtle about all of this, however. One of the most difficult things to do with traditional Web technology is to expose the content of databases to Web visitors. That’s because the pages that deliver up content pulled from databases are highly dynamic in nature, and so it is very hard for web designers to make search engines (like Google) find and index the content of these databases. There are simple and somewhat effective things web designers can do, like creating static pages that contain terms that are meant to draw web visitors to their sites. These pages are not “destination” pages; rather, they exist only as a way of advertising the information contained in databases.

In the future, RDF assertions (and other machine-readable content) will be added to websites, and they will server as far more effective draws.

But what about privacy? Will web designers inadvertently facilitate fraud and identity theft by enabling the automatic cross-referencing of detailed information existing in databases that have been built and deployed on the Web in isolation? This capability is at the heart of the Semantic Web effort. Information that right now can only be obtained by individual users manipulating individual web interfaces will be discoverable by smart search engines.

The real problem: it will scale.

This is a big deal. It’s not just that previously hidden information will now be discoverable. Because standardized terms and assertions will be used to describe information in databases, smart search engines will be able to automatically interrelate data from otherwise unrelated database systems. When information from multiple places is integrated, new information is effectively created.

For a moment, let’s forget about databases and look at a simple example of information that might be stored statically in two websites. Here is an example adapted from the previous posting of this blog:

Assertion 1: Joe is tall for an athlete.
Assertion 2: Tall athletes should try out for basketball.

A new inference: Joe should try out for basketball.

The point here is that this new inference can be inferred automatically, without the intervention of a human being.

We noted in the previous posting that the information about Joe and the information about basketball might be on different websites. These websites could easily have been built independently. But a key notion – and that is the semantics of the word “tall” in the context of basketball – is what allows this information to be automatically integrated. Another site might point out that Timmy is tall for a kindergarten student, but this would not trigger the suggestion that Timmy try out for the NBA.

Now, let’s get back to database systems, these things that can contain countless terabytes of personal information. Perhaps there is a database at one site containing information about many thousands of athletes. Perhaps there are hundreds or thousands of such sites. The Semantic Web would allow us to find tall athletes without having to know in advance what databases around the world have this sort of data inside them, data that previously could only have been extracted through tedious, time-consume human/computer interaction. Now, a high school counselor or a sports agent looking for new clients can be far more effective at their jobs.

Or, maybe it’s a drug company matching potential customers up with expensive drugs targeted toward specific diseases, or toward people who might have vague symptoms of various diseases, and who might be easily convinced they are sick. Ora con artist looking to scam elderly people who are likely to have dementias.

Or – well, get it? The Semantic Web will scale because it will have access to huge databases, and not just a world wide web of static pages. That’s the danger.


September 9, 2009  6:02 PM

Real-World Look at the Semantic Web, part 2



Posted by: Roger King
assertions, inferences, information, namespaces, RDF, SPARQL, triples, URI's, wikis

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. Last time, we looked at DBpedia, something that a former graduate student at my university, Greg Ziebold, pointed me toward.

The Semantic MediaWiki.

In this posting, we look at the Semantic MediaWiki, something else that Greg told me about. It is an extension of MediaWiki, the application that the Wikipedia is built out of. You can learn all about it at the Semantic MediaWiki website. The idea behind Semantic MediaWiki is to provide a more powerful wiki tool, namely one that supports more than just human-readable things like text and images.

RDF and namespaces: creating machine-readable, web-based information.

The idea is to allow entries in wikis that contain machine-readable information, so that searching can be performed in a largely automatic fashion. Specifically, the Semantic MediaWiki allows users to export information from a wiki in RDF format. An RDF specification consists of “triples” that form “assertions”. Consider the following

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.

The idea is for terms in triples (“Joe”, “tall”, “is”, “Tall People”, etc.) to be taken from predefined and globally accessible namespaces. This would ensure that everyone who uses a given term (like “tall” or “Should try out for”) will have the same meaning in mind. In this way, rather than having to painfully search for information that pertains to Tall People, for example, a smart search engine could do the searching for us.

Building locally, growing globally.

There is more to this. These namespaces can be available on the Web, and RDF statements can point to the relevant namespaces. This means that software searching the Web, and processing these triples, can easily find the relevant namespaces.

Also, the things in the right and left side of a triple (like “Joe” and “tall”) can themselves be Web-based resources. This means that information scattered around the Web can be interconnected – but all the work can be done locally. No one has to manually integrate millions of websites. The job can be done little by little, in a quiet way, as people start to store their information in an RDF compatible fashion.

This is how the Semantic Web will scale. Everyone will use shared namespaces and shared protocols like RDF. This will, in essence, turn the Web into one big website that can be searched in a partly automatic fashion.

SPARQL: querying RDF-based information.

How will we interrelate data scattered around the Web?

There is a query language out there, called SPARQL, that can be used to search the Web. SPARQL can follow RDF connections around the globe. How is this done? It has to do with being able to “infer” new things. Consider a fact that can be automatically deduced from the two assertions above:

A new inference: Joe should try out for Basketball.

Assertion 1 could be on a server in Detroit, and assertion 2 could be on a server in Miami, and SPARQL could do the job of making the leap that leads to the new inference.

This means that we could figure out what Joe should be doing right now without having to find the two pieces of information manually (the fact that he is tall, and that tall people should play basketball), and without having to make the inference ourselves.

This is a big deal. This sort of automation is what the Semantic Web is all about.

So what do real people do with the Semantic MediaWiki? We’ll look at this next.


August 31, 2009  3:40 AM

A Real-World Look at the Semantic Web, part 1



Posted by: Roger King
assertions, databases, inferences, information, knowledge, namespaces, ontologies, RDF, Semantic Web, SPARQL, triples, wikis

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.

DBpedia.

It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.

The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.

RDF, triples, assertions, and inferences.

A thorough example can be found in a previous posting of this blog.

Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.

Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.

More on DBpedia.

There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.

One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.

That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.

The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.

More on this in the next posting.


August 16, 2009  4:16 AM

Dangers of the Semantic Web: Assertions, Inferences, and Surrogates



Posted by: Roger King
assertions, inferences, namespaces, next generation search engines, smart search engines, surrogates, the Semantic Web

This blog deals with advanced Web technology. Each posting should be quite understandable on its own, but the blog as a whole is a continuing story. We’ve been looking at the Semantic Web, which is a global effort to automate the searching of the Web, so that applications (we might call them smart search engines) can find, interpret, interrelate, and aggregate information stored in multiple, independent websites.

Assertions and Inferences.

A key concept is that of an “inference”, a fact that is created by putting together two or more pieces of information that we might call “assertions”. We used the following example in the example in a previous posting. The two assertions might be posted on the Web somewhere.

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

We have also discussed the fact that terminology used in inferences must be very carefully defined and widely shared.

What is a Surrogate?

The word surrogate, in the programming world, refers to a measure or model that is being used to approximate the “real” measure or model. If I am trying to estimate the depth of the ocean at some point, but don’t have a direct way of measuring the distance to the ocean floor, I might judge the depth by using a table that associates the distance from the shore to the depth of the ocean. The assumption is that all points that are a particular distance from the shore will have the same depth more or less.

Here’s the important point for us: The Semantic Web will make very heavy use of surrogates. Let’s be precise about this. We’re not talking about approximations. We might search the Web for all banks that provide accounts that earn 5%, and our smart search engine might point us to banks that on the average, over the past two years, have paid at least 5.0% on their accounts. A surrogate is something different. Suppose we wanted to find all banks that never cheated their customers. This might be impossible to answer precisely, so we might look for banks that are in the bottom 10% when it comes to the number of formal complaints filed against them. That would be a surrogate.

Surrogates on the New Web.

Now, let’s consider the Web. It doesn’t matter if we are talking about the Web today or the emerging Semantic Web.

In fact, what we are concerned with here is global to computing in general: when we take a chore normally performed by a human using an interactive interface and turn that chore over to a computer program, we often turn a real world decision into a decision based on very simplified surrogates. A human can look at a bunch of information and, although it may take a very, very long time, make a “perfect” decision based on that data. But computer programs cannot think like a human. We can only crudely simulate with software the process of thinking that goes on in the mind of a real person.

Now, back to the Web, the new Semantic Web. Suppose we build a next generation website and use an official namespace (which is a structured set of terms) to specify assertions using terms from this namespace. What we’re doing is providing a surrogate for the smart search engine to use so that it can do the filtering of URLs and the integrating of information from multiple sites.

Consider our two assertions from above, along with the inference derived from them:

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

Maybe we are shopping for a ball online. We mght have to follow hundreds of URLs and search hundreds of websites to find just the right ball. But who said the ball is orange? It’s an approximation made by the vendor of the ball in question. It has been labeled orange. But maybe it’s a shade of orange that we would actually have liked if we had looked at the picture of the ball ourselves instead of leaving it to the search engine.

Well, we might argue that the word orange, if it is precisely defined, won’t be confused with some other color. We can be confident that our notion of orange is the same as the vendor’s notion of orange. We do know how to express colors very precisely by using numbers.

So, let’s change the assertions and the inference a bit:

Assertion 1: DOROTHY THE DOLL is PRETTY.
Assertion 2: WE want a PRETTY DOLL.
An inference created by putting the two assertions together: WE might want DOROTHY THE DOLL.

Now, how could the notion of pretty ever be globally and uniformly defined?

It cannot.

Maybe we should shop for our own dolls and not leave it to a next generation search engine.

The Lesson.

The Semantic Web will trade speed for accuracy. No way around it.



August 5, 2009  9:39 PM

The Semantic Web: RDF and SPARQL, part 5



Posted by: Roger King
data, information, knowledge, ontologies, RDF, the Semantic Web, triples

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort. The goal of the Semantic Web is to partly automate the searching of the Web, by using RDF to capture deeper semantics of information and SPARQL to query that information. This is in comparison to today’s search engine technology, which does not allow us to do much more than search for individual words in the text of webpages.

Let’s step back for a moment.

Just how universal is this notion of RDF-style triples? Will we ever have something substantially more useful, more powerful in the semantics it can express?

Data, Information, Knowledge, and Ontologies.

Academic and industrial researchers in computing like to trivialize big words. Let’s briefly look at the problem. “Data” is an old word, and most of us have a sense that virtually anything stored digitally can be considered data. This includes applications and other pieces of software, too. If you back up some applications to free up space on your hard drive, you’ve just turned applications into data, right?

“Information” is a word that came into play when researchers wanted something that was smarter than data. The word was broader, and vaguer, but information was essentially data that was ready to be used by interactive users. If I pull down a page from the Encyclopedia Britannica site, it’s filled with information.

Then, there were demands for an even richer word, one that suggests data that is beyond information, stuff that is rich in semantics that can be easily extracted. Often, knowledge was data or information that had been interconnected, turned into trees or graphs. Traversing the links in the structure told us how various things were interrelated and thereby exposing powerful semantics. The Web in a sense is knowledge. I can follow links between pages to discover how various pages on the Web are interrelated. I can follow connections on the Britannica site to connect a scientific discovery to the story of the discoverer’s life.

Here’s something significant. This blog and all its postings are related to new web technology, such as the Semantic Web. Our central concern has been the partial automation of the searching of the Web, so that users aren’t limited to typing words into Google and getting back stuff no richer than pages that happen to have these words in them. As it turns out, the term “knowledge” dates way back before the days of the Web, but back then, our notion of what it meant to be knowledge and not just data or information was pretty much the same as it is now. Knowledge can be processed by programs, thereby automating the task of finding the right knowledge and applying it to our problem domain.

Then came “ontology”. This is a relatively new word, but it’s perhaps the most embarrassing. The word, until recently, was reserved for philosophers to use. An ontological argument is an argument about the existence of something. Over the centuries, one common subject of ontological discussions has been the existence of God.

Hmm.

The same old, same old.

Flash forward to the Internet age: Computer researchers use the term to refer to a precise specification of the objects and properties (of these objects) in some well studied domain. I guess the idea is to suggest that we can capture the true nature of the existence of some domain.

These domains could be large, like banking, health insurance, or the stock market. Laying out all of the objects involved in one of these is a daunting task. Consider an insurance claim and all of its properties: type of claim, provider of medical service, patient name, etc., and then imagine laying this all out for insurance policies, underwriting tables, actuarial data, etc. To include all of the objects and properties involved in building software for an insurance company would lead us to thousands of interconnected terms. Triples, in other words.

Or our ontology could be the specification of a pencil object, which has properties like being made of wood and graphite and metal, of having yellow paint and a little pink eraser. Triples like this:

The pencil has a pink eraser.
The pencil is painted yellow.

This characterizes the nature of the challenge we have taken on in our efforts to build ontologies. We take on the problems of scale, not the problems involved in really capturing, in some formal fashion, the nature of the world around us. We build gigantic, but very simple, models of the things that concern us in the software world.

We have trivialized this term, ontology. In fact, for the most part, we’re simply referring to the same old, same old modeling construct: triples. Yes, that simple tool called RDF can be used to build a vast “ontology”.

There is something about the nature of triples that has conquered computing. It is a concept that, as we have seen in previous postings of this blog, underlies object-oriented data structures. It predates object-oriented languages, going back to the early days of AI and the attempts to model the real world.

So, what is an ontology?

An ontology is supposed to be the end of the Semantic Web rainbow: our ability to fully automate the specification and searching of the real world. But the next time some computer person tries to impress you by tossing this term at you, remember to just shake your head and say “Quit being a puff toad. You’re just talking about triples.”



July 29, 2009  2:17 AM

The Semantic Web: RDF and SPARQL, part 4



Posted by: Roger King
RDF, SPARQL, SQL, the Semantic Web, triples, XML

This posting is a continuation of the previous posting. We are discussing RDF, the “triples” language that is serving as a cornerstone of the Semantic Web effort. In this posting, we will look at SPARQL, the web language designed to search data that has been specified as RDF triples. The goal of the Semantic Web is to partly automate the searching of the Web, by using RDF to capture deeper semantics of information and SPARQL to query that information. This is in comparison to today’s technology, which does not allow us to do much more than search for individual words in the text of webpages.

From the last posting.

Here is a piece of the RDF code from the previous posting:

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
xmls:zx=”
http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>

<zx:created-by>http://www.anotherurl.org/buzz</zx:created-by>

</rdf:Description>

</rdf:RDF>

This can be interpreted as the webpage at awesite.org/index.html was created by Buzz.

Another representation of RDF-based information: 3 triples.

We see from the above that RDF simply represents triples. We could simplify it even more as:

http://awesite.org/index.html was created by Buzz

Part of the reason that the original RDF code above is so much more complex is that the full syntax lets us specify that we are using terms that are defined at specific web addresses. This allows people to use standardized terms and greatly enhances the specitifity of an RDF specification. The full syntax also allows us to reference pieces of information that reside on the Web. (See the previous three postings, 1, 2, 3.)

Before we launch into a SPARQL example, we need to make an important distinction between syntax and symantics. The code above is written in a particular syntax for RDF, one that uses XML. We note that because syntax needs to be very precise, it tends to be verbose. This can cause syntax to obsure the conceptual simplicity of underlaying semantics, or meaning.

But this isn’t the only way to specify RDF triples. Let’s look at some information that is much simpler, and at the same time, let’s look at using a different syntax for specifying RDF-like triples. Here are three triples:

<http://awebsite.org/ > was-created-by “Buzz”

<http://awebsite.org/ > was-created-by “Suzy”

<http://anotherwebsite.org/> was-created-by “Alice”

This is a very simple program. It consists of a two triples that say that a website named awebsite was created by Buzz and Suzy, and another triple that says that Alice created a website called anotherwebsite. We are not saying that was-created-by is a widely used term; it may have been invented only for particular RDF specification, and its meaning would therefore not be precise. We can only interpret it from our general understanding of English words. We also have no idea who these people Buzz and Suzy and Alice are, and we have no other information about them.

SPARQL: searching triples distributed across the Web.

Now, here is a piece of code:

prefix website1: <http://awebsite.org/ >
SELECT ?x
WHERE
{ website1:was-created-by ?x }

We’re getting very close to real SPARQL, by the way, and if you know SQL, you can see the extremely similarity. But syntax is not our issue here. We’re trying to look at concepts.

This code will find the creators of http://awebsite.org. You could imagine that there are actually many thousands of these triples, and that they tell us who built a large number of different websites. Now, we see the power of this query. It will search through all of these triples and find the two of interest to us, and then pluck off the names of the creators.

In fact, these triples could be distributed all around the Web, and we could imagine a search engine taking this query and running it everywhere on the Web where was-created-by triples are stored, and then having it bring back all the creators of awebsite, even if there are a hundred developers, and even if these names are spread around the Internet.

Next, the bigger issue.

In the next posting, we’ll look more closely at SPARQL. One thing we will consider is why it does look so much like SQL. There is a powerful reason for this that has to do with searching information in general.



Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: