Semantic Web archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

Semantic Web

Oct 11 2009   11:07PM GMT

Making information management scale: leveraging metadata on the new Web



Posted by: Roger “Buzz” King
3D modeling, automating Web searches, databases, DB2, information, Multimedia, MySQL, Oracle, PostgreSQL, RDF, Semantic Web, Video, Web 3.0, Web development frameworks, Web3.0

Previous postings of this blog.

This blog is dedicated to advanced Web development tools and concepts. Previous blog postings have focused on the emerging Semantic Web, which promises to make the Web radically easier to search and to greatly enhance the value of the vast sea of currently-disconnected information spread across the Web. We have also looked at Web 3.0 efforts, which promise to make multimedia websites highly usable and capable of conveying far more information than the current generation of websites. Previous postings describe breadth and depth of cutting edge Web technology.

Metadata: making that ratio small.

Here’s something that’s very important: Much of the ongoing research and development that is loosely categorized as Semantic Web and Web 3.0 efforts is focused on a specific technical goal, one that has been at the core of information management technology since the mainframe era that was epitomized by the IBM 360 series. That goal is to leverage metadata as much as possible.

It’s our best weapon against the truly staggering amount of information on the Web. This includes traditional text-based and numeric data, as well as books, medical advice, photographs, entertainment and training videos, music and recorded books, investment information, educational materials, scientific materials, e-government information, etc., etc. How can we possibly organize information and then search it in a way that scales? The Web is far from a closed world. In traditional data processing environments like banking, insurance, and credit card processing, we could get our arms around all of the data, as vast as it may have seemed. But the world of information today is an open world, effectively infinite in size.

Very informally, if you look at the size of the metadata divided by the size of the data itself, the smaller that fraction the better. In traditional relational databases (built with database management systems, such as Oracle, MS SQL Server, MySQL, PostgreSQL, or DB2), the extreme focus on minimizing this ratio has enabled the fast processing of extremely large volumes of data. The tradeoff is that the table definitions (or the “schema”), which form the heart of the metadata are very, very simplistic.

The old days: relational database schemas.

An insurance claim may be defined as a table with such columns as Subscriber_Name, Medical_Provider, etc., and thus, may consist of little or no more than a series of simple character and numeric fields. But if we need to process fifty thousand of them tonight, we must be able to bring many such table rows into memory at once, and quickly move through them. The database world was an extension of the paper world: a row in an insurance claim table was effectively an electronic successor to the traditional claim form.

Today: a far more challenging problem.

But on the new Web, information can be far more complex in nature, making the metadata to data ratio far larger. We’ve looked at some of the emerging technology and technical trends for embedding metadata in advanced forms of data (and for processing that metadata); this data includes books, images, video, modeling and animation, and sound. This new generation of information formats make up our personal health records and medical records images, industrial training materials, university “distance” courses, and the like. Each instance of these tends to be far more unique than individual insurance claim forms. And, it takes a lot of metadata to properly convey their “meaning”.

The challenge.

What we’re struggling with right now is to succinctly specify the meaning of modern media assets and to automate searching based on this metadata. This is our only hope for leveraging that ratio of metadata size divided by data size.

Oct 3 2009   9:12PM GMT

Multimedia: The Problem of Subtle Semantics



Posted by: Roger “Buzz” King
3D animation, 3D modeling, advanced Web apps, automating Web searches, blob data, continuous data, databases, information, Multimedia, rich internet apps, Semantic Web, smart search engines, tagging, Text, Web 2.0, Web 3.0, web applications, Web development, Web development frameworks, XML

The challenge of the Semantic Web.

We’ve looked at the emerging Semantic Web technology in the previous postings of this blog. The idea is to have a far, far smarter Web, one where the process of finding and interpreting and making use of far flung information can be largely automated. This is in sharp contrast with today’s Web, where these things have to be done in a painful, extremely time-consuming fashion.

So that is the key challenge. It has to do with searching the kinds of information that are important to us in our daily lives. This information, as it turns out, is very difficult to process automatically. Why is this?

The complexity of modern multimedia.

I teach a very basic 3D animation class to mostly computer science students. We use Maya, arguably the most popular 3D animation application, one that is used in the making of many animated features. The interesting thing about animation is that it is truly multimedia. It can give us a lot of insight into what we need the new Web to do for us.

That’s because the number and diversity of applications that are used for drawing, documenting, modeling, animating, motion capture, texturing, video rendering, video editing, video conversion and compression, sound editing, in even small projects, can be very impressive. Correspondingly, the wide variety and complexity of media formats involved in an animation project can be overwhelming.

What happens in an animation project? The workflow might begin with vector storyboard drawings to break the story down into scenes. In a typical animation project, 3D models in a variety of proprietary formats are used. Models must be transformed as they are exported from one application and imported into the next. Multiple video renders of animated models are made, and they must be edited together, along with multiple sound files. Multiple video and audio formats might be used. 2D images are used for textures; photographs of butterfly wings can be used to make an animated butterfly very realistic, and a checkerboard image made with Photoshop can be used to make a Linoleum floor. And along the way, a variety of note taking, screen capture, and conferencing software might be used to facilitate group communication.

There is also a heavy focus on reuse in an animation project. Building every model, editing every texture, creating every environment and background, recording every sound from scratch is frequently intractable. If existing assets cannot be tailored and reused, the project would be far too expensive and time consuming, and would demand too wide a variety of professionals to always be available. This raises the multimedia stakes, as assets of widely differing forms must be constantly reconfigured and used in concert in new ways.

But what’s the real problem? We aren’t all trying to produce complex animated videos. But very interestingly, in our everyday lives we essentially face the animator’s challenge when we try to find and use information on the Web. That’s because we’re often looking for things whose meaning, whose interpretation, demands focused human thought. We are looking not for business data, but for pieces of media, and the problem is that today, most of our searching has to be based on tags or brief textual descriptions that are associated with pieces of media, and not on the true meaning of the media itself.

The needs of the business world are not our needs.

It’s the subjective nature of media assets - this is what is at the heart of the problem facing us. Existing technology for searching the web is based on keywords and very short pieces of text.

There is other technology, though, under active development, stuff that serves as the information storage backbone of most commercial websites. It’s the technology that has for decades been used in-house (not on the Web) by businesses when they process large databases. But this stuff was designed to handle traditional business data forms, like integers, character strings, real numbers, dates, timestamps, and full text.

There is more, though. All of the major database management systems, along with tools for building and searching advanced websites are being retrofitted (or in some cases, built from the ground up) to manage more than keywords and text, more than standard business data.

But up to now, the focus has not been on supporting the kinds of information you and I are most interested in. The focus has been on extending database and Web technology to support xml documents, as well as more complex data objects, like those inside a Java program, as well as other forms of data found inside programs. This includes arrays and lists and short pieces of textual data, like the names of diseases.

In other words, we’ve been busy extending our support of the business world, so they can store complex business data in databases and make that information processable over the Web. You and I have largely been left out.

Finally, we are attacking our needs.

But there now many ongoing efforts to extend database and Web technology to make it useful to us. The new focus is on supporting blob and continuous media like images, video, and audio. This is extremely hard to do.

Why? Because the strongest means by which we deduce the meeting of business data is by looking at its internal structure and the terms that are used to describe that structure. A relational table named Prescriptions, with a character attributes Patient Name, Doctor’s Name, and Medication, and with a numeric attribute Dosage, is pretty easy to interpret.

But what do we do with a photograph, which is just a grid of pixels with no internal structure? Or a long series of images, along with a sound track, put together to form a piece of video?

The U.S. military has been pumping money into image processing for several decades, and so all is not lost. There is a vast body of mathematical research and software development that allows us to write programs that can find a particular face in a crowd and search satellite photos for airplane runways. But in general, we cannot at this time write a program that can process an arbitrary photo or video clip and tell us what it means. That means we can’t quickly search vast media database for useful pieces of information.

The goal behind the Semantic Web effort is to build a new generation of websites whose information can be searched automatically, and where information from multiple sites can be automatically integrated. To do this with numeric and character based data is quite doable. But when it comes to multimedia, like images and sound and video and 3D models and engineering designs, well, we have a long way to go. The meaning - in other words, the semantics - of these forms of data are complex and subtle, and highly dependent upon an individual’s interpretation of that media.

So, we see that we have only just begun our journey to create the new Web.


Sep 25 2009   11:31PM GMT

Semantics and the new Web: Built out of very old ideas.



Posted by: Roger “Buzz” King
automating Web searches, inferences, information, knowledge, Semantic Web, Web development

Describing the real world in computers.

The word “semantic” has been a buzzword in computer science for decades. The youthful Artificial Intelligence world invented these things called Semantic Networks or Semantic Nets a half century ago. The idea was to come up with a crisp, formal language for representing real world things inside a computer. This took the form of a small set of constructs that would be general purpose, in that they could be applied to almost any sort of information. Further, these constructs would somehow be intuitive and natural, in that they would get to the heart of what it means to describe everything from horses to insurance claims to marriages to the contents of the Bill of Rights.

Basic, long-standing, core concepts.

What emerged has certainly stood the test of time. Big time. Opinions differ widely on just what constitutes the core constructs. Different people have used different names for these terms, and, although the idea was to specify something formal, the definitions of these constructs were generally sloppy. But here is a reasonable specification, in its most rudimentary form:

There are objects (which might also be called entities, things, or concepts). Objects have unique names.

Objects are interrelated by attributes (which might also be called relationships or properties). Attributes are directional, and they have names.

In other words, things in the world can be represented as a simple directed graph. We could say that there are objects called Chickens that have an attribute called Are. The value of this might be an object called Birds. Birds might have an attribute called Lives-In, which links Birds to the object Barnyard. There might be an object called Mr. Fried, which has an attribute called IS, which connects Mr. Fried to the object Chickens.

There are many popular various of this basic idea that have emerged, and they tend to be of the following nature:

One idea is to make a sharp distinction between the notion of a subtype (or sub-kind or subset) and other attributes. So, our attribute Are might become a core concept itself, and we might name it Is-A. Chickens IS-A Birds, People IS-A Biped, etc. Other attributes like Lives-In would be considered inherently different from Is-A.

We could introduce another generalization. A general term for attributes Lives-In and other similar attributes might be Has-A. In fact, we could stop using special words for attributes in general, and just use the terms Is-A and Has-A. We would then say that Marriages Has-A Wife, as well as a Husband, as well as a Date.

These general ideas are actually old, and actually significantly predate computing. We have been struggling with the problem of describing real world objects (like Cows), real world concepts (like Marriages and Respects), and their interrelationships and categories since the emergence of the earliest philosophers. Aristotle distinguished between objects and their attributes, and carefully studied and described many animals and plants.

What does it all mean for the new Web?

So, what does all this mean to us, today, and what does it have to do with modern Web technology? Well, first of all, these concepts of objects and attributes have spread throughout all of computer science.

There have been some significant extensions, like distinguishing between an attribute that we might call a relationship, which interconnects complex objects or notions (like a driver owning a car) and attributes that interconnect complex objects and notions with atomic or simple things (like a car having a color or a driver having a name). Generally, these latter, simple kinds of attributes are now what we call attributes, and are considered inherently different from (and simpler than) relationships.

Another extension that has become a core concept in programming languages is something we might call an object identifier, which is a unique number or other identifier for individual objects; this allows us to carefully distinguish between two people who have the same mother, and two people who have mothers who just happen to have the same name.

Programing languages also introduced the concept of methods, or little programs that can give life to objects. You might be able to tell a marriage object to tell us the names of the husband and wife.

But basic concepts have not changed. There seems to be something natural and fundamental about them.

Building a new world out of old concepts.

And the Web? A revolution is happening today. We are developing languages that allow Web designers to embed machine-readable specifications in Web-resident information. This will largely automate the process of searching the Web, as well as the integration of information at multiple sites. This will in turn lead to the discovery of knowledge by putting together diverse information from across the Web. We have discussed these emerging technologies in the previous postings of this blog; they are heavily and deliberately built on top of ideas that date back to the 1950’s, and in fact can trace their roots to ancient Greece.


Sep 18 2009   3:02AM GMT

Dynamic pages, hidden data, and infered information: the danger of scale.



Posted by: Roger “Buzz” King
assertions, databases, dynamic pages, hidden web content, inferences, next generation search engines, Semantic Web, smart search engines, static pages, triples

The good and bad sides of the powerful Semantic Web.

So what happens when the Semantic Web is here? It’s supposed to largely automate the process of searching the Web by allowing us to attach machine-readable assertions (perhaps by using RDF) to information posted on the Web. Then, instead of us poor flailing humans having to painstakingly chase down countless URLs until we get what we want, smart search engines would be able to find precisely what we want in a single shot.

There is an obvious danger to all of this. The new Web will scale, in both good ways and bad. I am certainly not the first person to point out that the smarter the Web, the easier it will be for software to peruse the Web and dig up personal information about us. There will be software that carefully crafts ads in Spam mail that will target our vulnerabilities and our preferences. Websites will dynamically create webpages that target us individually, as well. When we shop online, when we read news, when we make social connections online, the Web will be disarmingly efficient and effective, and this leaves lots of room for fraud and manipulation.

This is already happening to a significant degree, and most of us are aware of it.

The no-longer-hidden database factor.

There is something more subtle about all of this, however. One of the most difficult things to do with traditional Web technology is to expose the content of databases to Web visitors. That’s because the pages that deliver up content pulled from databases are highly dynamic in nature, and so it is very hard for web designers to make search engines (like Google) find and index the content of these databases. There are simple and somewhat effective things web designers can do, like creating static pages that contain terms that are meant to draw web visitors to their sites. These pages are not “destination” pages; rather, they exist only as a way of advertising the information contained in databases.

In the future, RDF assertions (and other machine-readable content) will be added to websites, and they will server as far more effective draws.

But what about privacy? Will web designers inadvertently facilitate fraud and identity theft by enabling the automatic cross-referencing of detailed information existing in databases that have been built and deployed on the Web in isolation? This capability is at the heart of the Semantic Web effort. Information that right now can only be obtained by individual users manipulating individual web interfaces will be discoverable by smart search engines.

The real problem: it will scale.

This is a big deal. It’s not just that previously hidden information will now be discoverable. Because standardized terms and assertions will be used to describe information in databases, smart search engines will be able to automatically interrelate data from otherwise unrelated database systems. When information from multiple places is integrated, new information is effectively created.

For a moment, let’s forget about databases and look at a simple example of information that might be stored statically in two websites. Here is an example adapted from the previous posting of this blog:

Assertion 1: Joe is tall for an athlete.
Assertion 2: Tall athletes should try out for basketball.

A new inference: Joe should try out for basketball.

The point here is that this new inference can be inferred automatically, without the intervention of a human being.

We noted in the previous posting that the information about Joe and the information about basketball might be on different websites. These websites could easily have been built independently. But a key notion - and that is the semantics of the word “tall” in the context of basketball - is what allows this information to be automatically integrated. Another site might point out that Timmy is tall for a kindergarten student, but this would not trigger the suggestion that Timmy try out for the NBA.

Now, let’s get back to database systems, these things that can contain countless terabytes of personal information. Perhaps there is a database at one site containing information about many thousands of athletes. Perhaps there are hundreds or thousands of such sites. The Semantic Web would allow us to find tall athletes without having to know in advance what databases around the world have this sort of data inside them, data that previously could only have been extracted through tedious, time-consume human/computer interaction. Now, a high school counselor or a sports agent looking for new clients can be far more effective at their jobs.

Or, maybe it’s a drug company matching potential customers up with expensive drugs targeted toward specific diseases, or toward people who might have vague symptoms of various diseases, and who might be easily convinced they are sick. Ora con artist looking to scam elderly people who are likely to have dementias.

Or - well, get it? The Semantic Web will scale because it will have access to huge databases, and not just a world wide web of static pages. That’s the danger.


Aug 31 2009   3:40AM GMT

A Real-World Look at the Semantic Web, part 1



Posted by: Roger “Buzz” King
assertions, databases, inferences, information, knowledge, namespaces, RDF, Semantic Web, SPARQL, triples, wikis, ontologies

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.

DBpedia.

It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.

The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.

RDF, triples, assertions, and inferences.

A thorough example can be found in a previous posting of this blog.

Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.

Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.

More on DBpedia.

There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.

One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.

That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.

The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.

More on this in the next posting.


Apr 9 2009   3:28AM GMT

The Dublin Core and the Metadata Object Description Schema: a look at namespaces



Posted by: Roger “Buzz” King
Semantic Web, namespaces, Dublin Core, MODS, the Metadata Object Description Schema

Namespaces.

As we have seen, namespaces are a core element of the emerging Semantic Web. By posting namespaces on the Web, we can share precise vocabularies that will hopefully enable us to automate the process of searching the Web.

Searching with today’s search engines, like Google, is an inaccurate and highly iterative process. Searches are based on matching our search words with words in the documents that have been found and indexed in advance by the search engine. It can be a very painstaking process: we have to click on the URLs that are returned, and for each one, make a decision as to whether or not the page is relevant. We typically end up changing our search words gradually, as we hone our search criteria.

Namespaces are intended as a key element of a long term goal to make search engines of the future smarter. If the terms we used to formulate our searches came from widely-adopted, standardized namespaces, there would be far less painstaking iteration involved in finding the right webpages. We would accompany our search requests with links to the namespaces that define terms we are using. And in fact, searching would become at least partly automatic, with the browser able to narrow the set of returned URLs by making use of its knowledge of namespaces.

The Dublin Core.

Let’s take a look at one of the most widely known namespaces. It’s called the Dublin Core. But, as it turns out, it proved too simple and has since been eclipsed, at least in part, by a somewhat more sophisticated namespace called the Metadata Object Description Schema.

To get started, here’s another way to look at a namespace: it is used to create metadata that describes some data source. In particular, the Dublin Core was engineered to provide metadata for resources that can be found on the Web, including text-based documents, images, and video, and in particular, web pages. Want to know what a web page is all about? Look at its metadata, specified with the Dublin Core standard.

By the way, the namespace is named after Dublin, Ohio, not the other Dublin. The namespace was the result of a workshop held in Dublin in 1995. It is not an XML extension, like SMIL, the language used for building multimedia presentations. However, the Dublin Core can be used to create metadata for documents that are specified with XML or one of its many extensions.

So, what is in the Dublin Core? Basically it is a set of terms such as Contributor, Publisher, and Language. Some of the terms generally refer to very simple values, like Contributer, which is the person or organization that created a document.

To look at one of the potentially more complex Dublin Core terms, Coverage can describe the 3D (x,y,z) coordinates, or the time period, or the nation referenced by the document being described. It could refer to all of these. Note that this is not the time the document was written, or where it was written. Coverage refers specifically to the content of the document itself.

So, if we tell a smart browser of the future to find all documents that pertain to the year 1865, it will not return documents that were written in 1865, but are about the year 1012.

One drawback of the Dublin Core is that it is very loosely defined. So, it often fails in its true purpose: to provide precisely-defined terms that all of us can use, and where we can be confident they will be uniformly interpreted.

A More Sophisticated Standard: MODS.

A newer proposed standard, called the Meta Object Description Schema, or MODS, is an XML language that has been very actively promoted as a successor to the Dublin Core. MODS has more terms, and more precisely-defined terms. Since it leverages the ability of XML to express nested or embedded structures, it can convey much more information than a list of Dublin Core terms can convey.

Here’s a little piece of MODS:

<name type=”personal”>
<namePart type=”family”>King</namePart>
<namePart type=”given”>Bugs</namePart>
</name>

This only gives a hint of the rich metadata that can be specified by using MODS. (The MODS website provides some far more detailed examples.)

Still, compare this to the Dublin Core Contributor term, which might have the value “Bugs King”. Is this a human name? Is it a pest control company?

But - even though it seems like an odd name, in the MODS example, we know that this is a person who goes by the name Bugs King.

Dublin Core might die and blow away - but it will always be recognized as a pivotal point in the development of the Semantic Web.