Namespaces archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

namespaces

Sep 9 2009   6:02PM GMT

Real-World Look at the Semantic Web, part 2



Posted by: Roger “Buzz” King
assertions, inferences, information, namespaces, RDF, SPARQL, triples, URI's, wikis

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. Last time, we looked at DBpedia, something that a former graduate student at my university, Greg Ziebold, pointed me toward.

The Semantic MediaWiki.

In this posting, we look at the Semantic MediaWiki, something else that Greg told me about. It is an extension of MediaWiki, the application that the Wikipedia is built out of. You can learn all about it at the Semantic MediaWiki website. The idea behind Semantic MediaWiki is to provide a more powerful wiki tool, namely one that supports more than just human-readable things like text and images.

RDF and namespaces: creating machine-readable, web-based information.

The idea is to allow entries in wikis that contain machine-readable information, so that searching can be performed in a largely automatic fashion. Specifically, the Semantic MediaWiki allows users to export information from a wiki in RDF format. An RDF specification consists of “triples” that form “assertions”. Consider the following

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.

The idea is for terms in triples (“Joe”, “tall”, “is”, “Tall People”, etc.) to be taken from predefined and globally accessible namespaces. This would ensure that everyone who uses a given term (like “tall” or “Should try out for”) will have the same meaning in mind. In this way, rather than having to painfully search for information that pertains to Tall People, for example, a smart search engine could do the searching for us.

Building locally, growing globally.

There is more to this. These namespaces can be available on the Web, and RDF statements can point to the relevant namespaces. This means that software searching the Web, and processing these triples, can easily find the relevant namespaces.

Also, the things in the right and left side of a triple (like “Joe” and “tall”) can themselves be Web-based resources. This means that information scattered around the Web can be interconnected - but all the work can be done locally. No one has to manually integrate millions of websites. The job can be done little by little, in a quiet way, as people start to store their information in an RDF compatible fashion.

This is how the Semantic Web will scale. Everyone will use shared namespaces and shared protocols like RDF. This will, in essence, turn the Web into one big website that can be searched in a partly automatic fashion.

SPARQL: querying RDF-based information.

How will we interrelate data scattered around the Web?

There is a query language out there, called SPARQL, that can be used to search the Web. SPARQL can follow RDF connections around the globe. How is this done? It has to do with being able to “infer” new things. Consider a fact that can be automatically deduced from the two assertions above:

A new inference: Joe should try out for Basketball.

Assertion 1 could be on a server in Detroit, and assertion 2 could be on a server in Miami, and SPARQL could do the job of making the leap that leads to the new inference.

This means that we could figure out what Joe should be doing right now without having to find the two pieces of information manually (the fact that he is tall, and that tall people should play basketball), and without having to make the inference ourselves.

This is a big deal. This sort of automation is what the Semantic Web is all about.

So what do real people do with the Semantic MediaWiki? We’ll look at this next.

Aug 31 2009   3:40AM GMT

A Real-World Look at the Semantic Web, part 1



Posted by: Roger “Buzz” King
assertions, databases, inferences, information, knowledge, namespaces, RDF, Semantic Web, SPARQL, triples, wikis, ontologies

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.

DBpedia.

It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.

The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.

RDF, triples, assertions, and inferences.

A thorough example can be found in a previous posting of this blog.

Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.

Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.

More on DBpedia.

There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.

One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.

That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.

The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.

More on this in the next posting.


Aug 16 2009   4:16AM GMT

Dangers of the Semantic Web: Assertions, Inferences, and Surrogates



Posted by: Roger “Buzz” King
the Semantic Web, assertions, inferences, namespaces, next generation search engines, smart search engines, surrogates

This blog deals with advanced Web technology. Each posting should be quite understandable on its own, but the blog as a whole is a continuing story. We’ve been looking at the Semantic Web, which is a global effort to automate the searching of the Web, so that applications (we might call them smart search engines) can find, interpret, interrelate, and aggregate information stored in multiple, independent websites.

Assertions and Inferences.

A key concept is that of an “inference”, a fact that is created by putting together two or more pieces of information that we might call “assertions”. We used the following example in the example in a previous posting. The two assertions might be posted on the Web somewhere.

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

We have also discussed the fact that terminology used in inferences must be very carefully defined and widely shared.

What is a Surrogate?

The word surrogate, in the programming world, refers to a measure or model that is being used to approximate the “real” measure or model. If I am trying to estimate the depth of the ocean at some point, but don’t have a direct way of measuring the distance to the ocean floor, I might judge the depth by using a table that associates the distance from the shore to the depth of the ocean. The assumption is that all points that are a particular distance from the shore will have the same depth more or less.

Here’s the important point for us: The Semantic Web will make very heavy use of surrogates. Let’s be precise about this. We’re not talking about approximations. We might search the Web for all banks that provide accounts that earn 5%, and our smart search engine might point us to banks that on the average, over the past two years, have paid at least 5.0% on their accounts. A surrogate is something different. Suppose we wanted to find all banks that never cheated their customers. This might be impossible to answer precisely, so we might look for banks that are in the bottom 10% when it comes to the number of formal complaints filed against them. That would be a surrogate.

Surrogates on the New Web.

Now, let’s consider the Web. It doesn’t matter if we are talking about the Web today or the emerging Semantic Web.

In fact, what we are concerned with here is global to computing in general: when we take a chore normally performed by a human using an interactive interface and turn that chore over to a computer program, we often turn a real world decision into a decision based on very simplified surrogates. A human can look at a bunch of information and, although it may take a very, very long time, make a “perfect” decision based on that data. But computer programs cannot think like a human. We can only crudely simulate with software the process of thinking that goes on in the mind of a real person.

Now, back to the Web, the new Semantic Web. Suppose we build a next generation website and use an official namespace (which is a structured set of terms) to specify assertions using terms from this namespace. What we’re doing is providing a surrogate for the smart search engine to use so that it can do the filtering of URLs and the integrating of information from multiple sites.

Consider our two assertions from above, along with the inference derived from them:

Assertion 1: THE BALL is ORANGE.
Assertion 2: ORANGE is an UGLY COLOR.
An inference created by putting the two assertions together: THE BALL is an UGLY COLOR.

Maybe we are shopping for a ball online. We mght have to follow hundreds of URLs and search hundreds of websites to find just the right ball. But who said the ball is orange? It’s an approximation made by the vendor of the ball in question. It has been labeled orange. But maybe it’s a shade of orange that we would actually have liked if we had looked at the picture of the ball ourselves instead of leaving it to the search engine.

Well, we might argue that the word orange, if it is precisely defined, won’t be confused with some other color. We can be confident that our notion of orange is the same as the vendor’s notion of orange. We do know how to express colors very precisely by using numbers.

So, let’s change the assertions and the inference a bit:

Assertion 1: DOROTHY THE DOLL is PRETTY.
Assertion 2: WE want a PRETTY DOLL.
An inference created by putting the two assertions together: WE might want DOROTHY THE DOLL.

Now, how could the notion of pretty ever be globally and uniformly defined?

It cannot.

Maybe we should shop for our own dolls and not leave it to a next generation search engine.

The Lesson.

The Semantic Web will trade speed for accuracy. No way around it.



Jul 3 2009   4:28AM GMT

The Semantic Web: RDF and SPARQL, part 1



Posted by: Roger “Buzz” King
the Semantic Web, namespaces, RDF, SPARQL, XML, triples, automating Web searches

This blog is dedicated to advanced and emerging Web technology.  Each posting is meant to be understandable and informative on its own, but the blog as a whole tells a continuing story.

The Semantic Web.

In this posting, we will focus on the Semantic Web, which is a global effort at radically improving our ability to search the Web.

Currently, to search the web, we type in keywords into a search engine like Google, which then searches its vast index of webpages for pages that have these keywords in them. Because this sort of search is very low-level, and not at all tied to the true meaning or purpose of the information stored in webpages, searching is painfully iterative and interactive.  A user must chase down countless URLs returned by a search engine to see if any of them are relevant.  Quite frequently, they are not.  And so, the user must refine the set of keywords and tries again.  It might take many attempts before a satisfactory result is obtained.

One of the primary goals of the Semantic Web is to automate the process of searching the Web.  There are two stages to this.  First, people who post information on the Web must capture knowledge about the meaning of their information; this knowledge is commonly called “metadata”.  The metadata is then store with the posted information.

The second stage happens when users search the Web.  Rather than using the low level keyword search approach, the search is at least partly automated.  The iterative process is sharply reduced by employing a smart search engine that knows how to find relevant information by searching for metadata that pertains precisely to whatever it is that the user is seeking.

The bottom line.

The goal?

The Semantic Web would be able to ease the burden of searching for information, as well as find vast stores of “hidden data” that reside in databases that are accessible via webpages, but whose contents right now are not seen by search engines.

Ultimately, we would want the Web to be entirely searchable by software, without any humans guiding the process.  This would be the true Semantic Web.

Namespaces and triples.

In past postings of this blog, we have discussed a handful of key approaches to implement the Semantic Web.  One idea is to tag information with standardized sets of terminology called “namespaces“.

We have also looked at the idea of embedding these tags in things called “triples“.  In this posting, we look at this concept more closely and consider an existing language that would allow people to specify these triples.

RDF and SPARQL.

The most well-known standard for specifying triples is RDF, which stands for the Research Description Framework.  SPARQL is a query language, heavily influenced by SQL, that can be used to search data that has been structured using RDF.

This is the first of a series of blog postings in which we will first look at RDF, and then at SPARQL.  Then, we’ll consider the big issue: will RDF and SPARQL enable the development of the true Semantic Web?

RDF.

So, what is RDF?  At its highest level, RDF is used to describe anything that can be found on the Web.  RDF has an XML syntax; in other words, RDF can be written as an XML program, using a set of predefined “element” and “attribute” tags.   (XML and XML languages were discussed in an earlier posting of this blog, as was XML and declarative languages.)

We might remember that on its own, XML is impotent.  It is not in itself a programming language.  It is simply a language standard for taking a set of tags and using them as “elements” and “attributes” in a declarative, data-intensive languages.  A good example is SMIL, which is used to define multimedia presentations.

Here is a fragment in RDF, using its XML syntax.  Note that XML languages are embedded languages, with opening tags beginning with <> and closing ones ending in </>

<rdf:RDF

xmls:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmls:zx=”http://www.someurl.org/zx/”>

<rdf:Description

rdf:about=”http://www.awebsite.org/index.html”>
<zx:topic>funstuff</zx:topic>

</rdf:Description>

</rdf:RDF>

This looks complicated, but it’s not.  This simple example illustrates the power of RDF.  It uses a set of standardized RDF-specific tags, and the second line of code tells us where these tags come from: the w3.org site, which contains a vast store of information about advanced web technology.  In other words, we can go to w3.org to find the precise definition of RDF specific tags.

RDF is engineered to also use other sets of tags, in particular, domain-specific tags.  In this example, these tags come from a (non-existing) url called someurl.org.  The tags themselves are prefaced with “zx:” in the rest of the code, so we know which tags are native RDF and which come from a domain-specific set of tags  (called a namespace).

The xml “element” called Description is an RDF-specific tag that tells us we are giving the description of some resource on the Web, namely one at a (non-existing) website called awebsite.org.

The whole piece of code is one triple: It says that the topic of the resource at www.awebsite.org is funstuff.  Here it is as a triple, with all the xml syntax and the namespace information removed:

www.awebsite.org/index.html <topicfunstuff.

Let’s overview this again.  RDF is an XML language, so it uses the syntax of XML.  One of the primary concepts in XML is that of an “element”, and Description is an XML element, one defined in the RDF standard.  The piece of code begins with two namespace statements, one telling us which RDF specification we are using, and the second telling us that we will also be using some tags from another, domain-specific specification, which includes the tag “topic”.  Then there is the guts of the triple, telling us that we are listing the topic of a Web-resident resource.

More on this in the next posting…


May 11 2009   3:07AM GMT

The Semantic Web: revealing hidden data.



Posted by: Roger “Buzz” King
the Semantic Web, Oracle, SQL Server, DB2, PostgreSQL, MySQL, triples, namespaces, static pages, dynamic pages, databases, indexing, hidden web content

The Hidden Web.

The Semantic Web - a primary topic of this continuing blog series - will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.

Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?

Forms, Databases, and Interactive Interfaces.

The Hidden Web refers to data that is out there on the web, publicly accessible - but only via webpage interfaces that are opaque to the indexing software of search engines like Google.

Let’s step back for a moment.

The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.

The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?

This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.

But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a webform”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.

So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.

Indexing Dynamic Pages.

So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site - a static page - loaded with all the right words, and that contains links to the pages and forms you want the user to discover.

On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.

As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.

Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.

An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.

So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.

Where Does that Information Come From?

Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?

Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.

Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.

What Will the Semantic Web Do?

The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.

This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.

The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.

We will look at this again, more closely, in a future blog posting.



May 3 2009   3:00AM GMT

Email addresses, the new Web, and NASCAR.



Posted by: Roger “Buzz” King
NASCAR-like web ads, Web 2.0, Web 3.0, the Semantic Web, XML, namespaces, free email accounts, web services, web-based ads

The Semantic Web.

This blog concerns advanced Web technology, in particular,Web 2.0/3.0 and the Semantic Web. Each blog entry should be fully understandable on its own, but the blog as a whole tells a continuing story.

Very roughly, we’ve defined the Web 2.0/3.0 as the class of emerging web applications that are highly responsive, to the point of being competitive with desktop apps. Another characteristic is that they can manage large volumes of very complex media, like images, sound, and animation, as well as interconnected forms of media. We’ve looked at some specific advanced web applications.

Our concern here, in this blog entry, is the Semantic Web, which we have also roughly defined. The Semantic Web is something that does not yet exist, but would meet the very aggressive goal of supporting largely automatic web searches, freeing us from excruciatingly interactive, manual Google and Yahoo sessions. And we’ve seen that we would use such things as shared namespaces, intelligent full text searching, and XML-based markup languages to embed information in websites that could be used by smart browsers to perform far more accurate searches.

Web services would help a lot, too, by taking humans out of the loop when providing powerful web-based capabilities; one website can now provide a vast amount of information, for example, by silently using web services to collect information from many other web-based sources.

(By the way, we have also looked at precisely what we mean by “semantic” in the Semantic Web.)

The way we pay.

This all sounds very good. The Web would be far more useful, with automatically searchable Semantic Web-sites. But there’s a bad side to all of this, and it has to do with how we often pay for Web use.

The problem is that we often do not pay at all. At least not directly, with money. We pay by putting up with ads. Free email services, such as those hustled by Yahoo, Hotmail, AOL, and Mail.com, are generally accessed via web browsers, and we find the main pages of these email accounts stuffed with ads.

Some free email accounts even stick ads in your outgoing mail!

Often, the only way to get the ads stripped from a web mail interface is to pay a fee. We might also get more than just ad-free web mail pages; paying sometimes allows users to access their email with POP or IMAP protocols, via desktop clients (like Outlook and Apple Mail), thus avoiding ads in another way.

(As an aside, there are free email sites that either have no ads in them, or only very subtle ones. Try Gmail.com and Inbox.com. My favorite, with its clean interface and growing set of accompanying capabilities, is GMX.com.)

As it turns out, folks looking to buy ad space online find that they have a vast array of choices, and this drives down the cost of ad space. But these two things, an ever-growing list of free online services and cheap ad space, are related. This is because it is all too easy to build useful web applications. Like browsers, bulletin boards, calendar apps, blogging services, and stickies applications, email servers are cheap to build and maintain. Venders can use canned, largely free software components.

And, transmission costs on the Internet are effectively free, and the bandwidth is huge. Free email accounts often offer a gigabyte or several gigabytes of storage, because disk space is dirt cheap, too.

There is a lot of rebranding going on, too, where someone seems to be offering free email (or some other service), but it is actually being provided by a large email provider.

So, the way things have shaken out, is that free web apps like email servers look like NASCAR racing cars, covered with colorful ads. Many of these ads consist of video, and so we have to battle distracting, flashing colors so we can focus on our mail.

The trick behind online ads.

There is something happening in the online ad world: folks who provide these free, pay-for-it-with-ads services are learning to carefully target ads. There is specialized software available for this, and by plugging in some smarts, folks can make the ads that appear on your screen far more likely to be of interest to you.

How is this done? By watching what you type into search engines, by taking advantage of personal information you supply when you sign up for free email accounts and other services, and by carefully examining the content of the messages you send and receive, that’s how it’s done.

It’s important to point out that this works. The “click through” rate on ads can be radically improved, just by using some simple heuristics in choosing your ads. Folks who pay for ads love this, and it has allowed individuals who don’t even provide free web applications turn themselves in to ad space sellers. Your blog, your specialized website, can now host ads carefully targeted toward the visitors to your blog or your website.

But just wait for the Semantic Web.

But it will really kick in when the semantic web is here. The same technology that would make browsers far, far smarter about finding good URLs for you will make the targeting of ads at you extremely precise.

This slowly-emerging technology is badly needed by the folks who sell ad space and by the people who buy that ad space. That’s because you and I are starting to get used to this world of NASCAR websites. We are looking through or past or around the ads. They need to be made a lot smarter, is order to get our attention back.

But by using Semantic Web technology to radically increase click-through rates, by getting us interested in ads again, impulse shopping on the Web might skyrocket. It’s very easy to go from seeing an ad for a product you have never heard of before to having bought it.

Like little kids watching commercials for sugar-heavy cereals on Saturday cartoon shows, we will be manipulated like we have never imagined before. That’s the bad side to the Semantic Web.



Apr 26 2009   8:09PM GMT

The world of advanced Web applications: what are they?



Posted by: Roger “Buzz” King
Web 2.0, Web 3.0, the Semantic Web, XML, mashups, wikis, social networking sites, tagging, distance education, zenbe.com, evernote, GlideOS, namespaces, web services

This blog is dedicated to an ongoing discussion of Web 2.0/3.0 and the Semantic Web. The slant is on the technology itself, how it works and what’s going on inside advanced Web applications. We’ve looked at a couple different Web 2.0, in particular, Evernote and GlideOS. We’ve tried to characterize the capabilities of Web apps.

The impact of the new Web.

This posting addresses a non-technical question: What has been the impact of this technology our society?

Technological advancement can be very roughly broken into two groups: incremental and radical. Which of these is Web 2.0/3.0? Is it a radical advance?

Consider what highly responsive, multimedia web applications have done for us. They have enabled the development of:

* Wikis: These are web applications that allow us to collaboratively develop sophisticated, easily searchable information bases. These can range from dictionaries for specialized disciplines to vast databases containing DNA information. Data can be vetted by experts and/or challenged by random users.

Everybody knows about Wikipedia, but like blog and bulletin board software, wiki software can be easily installed and configured for deployment on almost any web server, whether it is publicly accessible, or used privately within a corporation or by a professional organization.

* Social networking sites: These are web applications that allow us to actively participate in a myriad of communities based on professional and personal interests. We find work, develop contacts, share music and photographs and video, and develop lifelong collaborations with people we would never have met otherwise.

They are also used by people who are in daily physical contact, but who find they can deepen their relationships by posting personal information on public sites like MySpace and Facebook. The interesting thing about these sites is that new and successful ones keep emerging,

* Tagged content vendor sites: Volunteers and paid individuals can contribute multimedia content and collaboratively tag it, using both freeform and highly sophisticated tagging protocols, such as the sophisticated MPEG-7 standard. (We will look at MPEG-7 in a future posting of this blog.) These include images and sound and video, and many taggers are highly trained professionals who can carefully categorize content according its detailed meaning. This technology makes a vast sea of otherwise-unknown assets available to us. It also makes these assets searchable, thus transforming a completely intractable task into something we easily perform.

In particular, this has radically enhanced the creative power of both professional and hobbyist animators by giving them complex scenery and character components to work with. Check out thoughtequity.com for an example of a content vendor. Take a look at daz3d.com for animation content.

* Mashups: These are portal or second tier web applications that take content from other web sources, such as Google Maps, investment information, medical advice, and scientific data. Often mashups take data from several or hundreds of other sites and create complex, highly valuable multimedia assets.

Take a look at woozor.com. It combines Google map and weather data.

* Distance learning: Universities, corporations, professional organizations, and lone instructors can develop and sell effective, multimedia educational packages that bring education to anyone who has Internet access. This allows us to retrain ourselves for new occupations, stay current in our professional skills, and find employment that is satisfying, steady, and high paying.

I teach on my university’s distance learning site, and we use video, sound, desktop video capture, slide presentations, and software demonstrations - and they can all be edited into a unified product. There are online universities now, where you can get a college degree. Take a look at jonesuniversity,com.

* Hybrid applications that support things like email, calendar, collaboration, RSS feeds, etc.

A good example of a hybrid application is zenbe.com, which provides a combined web-based email, list making, and calendar application, and in that sense is similar to many other email providers. But Zenbe also provides a collaborative tool called Zenbe Pages, which can be used by collaborators to organize their activities. A Zenbe page can have notes, calendars, lists, RSS feeds (not new ones, but existing RSS feeds) on them. Zenbe also provides quick access to Twitter, Google Talk, and Facebook.

By the way, it’s important to point out that the categories I list above are not as clear-cut as one might think. Many modern web apps contain elements from more than one of these categories.

The software building blocks.

From a programming perspective, what specific Web 2.0/3.0 software has allowed all of this to come about? We’ve discussed much of this already in previous postings of this blog. It includes XML and the exploding class of XML languages, namespaces, IDE’s (Integrated Development Environments), large code bases (such as the vast library of ready-made Java components), web service software development tools, and AJAX web page optimization technology. It also includes web development frameworks like Ruby on Rails, and newer ones, engineered toward high responsiveness, like Flex and Silverlight.

Also included are powerful media formats, codecs, players, and editors, which allow web users to do more than upload and search media; we can edit it and reform video, images, and sound, without leaving the simple world of our browsers. And of course, modern mega media apps enable us to build media assets. The list of contributing software tools goes on, but we’ll stop here.

It scales!

And there is something subtle, but important that gives advanced web technology extraordinary power: it scales. We manage shared resources that are truly gigantic in size, and are spread across countless machines around the world. We leverage global user bases, cheap server technology, and wide open Internet bandwidth to give media stores belonging to Web apps astonishing growth rates.

The bottom line.

Yep. Web 2.0/3.0, as a whole, is a truly radical advancement. It has fundamentally and globally changed society in a big way.



Apr 9 2009   3:28AM GMT

The Dublin Core and the Metadata Object Description Schema: a look at namespaces



Posted by: Roger “Buzz” King
Semantic Web, namespaces, Dublin Core, MODS, the Metadata Object Description Schema

Namespaces.

As we have seen, namespaces are a core element of the emerging Semantic Web. By posting namespaces on the Web, we can share precise vocabularies that will hopefully enable us to automate the process of searching the Web.

Searching with today’s search engines, like Google, is an inaccurate and highly iterative process. Searches are based on matching our search words with words in the documents that have been found and indexed in advance by the search engine. It can be a very painstaking process: we have to click on the URLs that are returned, and for each one, make a decision as to whether or not the page is relevant. We typically end up changing our search words gradually, as we hone our search criteria.

Namespaces are intended as a key element of a long term goal to make search engines of the future smarter. If the terms we used to formulate our searches came from widely-adopted, standardized namespaces, there would be far less painstaking iteration involved in finding the right webpages. We would accompany our search requests with links to the namespaces that define terms we are using. And in fact, searching would become at least partly automatic, with the browser able to narrow the set of returned URLs by making use of its knowledge of namespaces.

The Dublin Core.

Let’s take a look at one of the most widely known namespaces. It’s called the Dublin Core. But, as it turns out, it proved too simple and has since been eclipsed, at least in part, by a somewhat more sophisticated namespace called the Metadata Object Description Schema.

To get started, here’s another way to look at a namespace: it is used to create metadata that describes some data source. In particular, the Dublin Core was engineered to provide metadata for resources that can be found on the Web, including text-based documents, images, and video, and in particular, web pages. Want to know what a web page is all about? Look at its metadata, specified with the Dublin Core standard.

By the way, the namespace is named after Dublin, Ohio, not the other Dublin. The namespace was the result of a workshop held in Dublin in 1995. It is not an XML extension, like SMIL, the language used for building multimedia presentations. However, the Dublin Core can be used to create metadata for documents that are specified with XML or one of its many extensions.

So, what is in the Dublin Core? Basically it is a set of terms such as Contributor, Publisher, and Language. Some of the terms generally refer to very simple values, like Contributer, which is the person or organization that created a document.

To look at one of the potentially more complex Dublin Core terms, Coverage can describe the 3D (x,y,z) coordinates, or the time period, or the nation referenced by the document being described. It could refer to all of these. Note that this is not the time the document was written, or where it was written. Coverage refers specifically to the content of the document itself.

So, if we tell a smart browser of the future to find all documents that pertain to the year 1865, it will not return documents that were written in 1865, but are about the year 1012.

One drawback of the Dublin Core is that it is very loosely defined. So, it often fails in its true purpose: to provide precisely-defined terms that all of us can use, and where we can be confident they will be uniformly interpreted.

A More Sophisticated Standard: MODS.

A newer proposed standard, called the Meta Object Description Schema, or MODS, is an XML language that has been very actively promoted as a successor to the Dublin Core. MODS has more terms, and more precisely-defined terms. Since it leverages the ability of XML to express nested or embedded structures, it can convey much more information than a list of Dublin Core terms can convey.

Here’s a little piece of MODS:

<name type=”personal”>
<namePart type=”family”>King</namePart>
<namePart type=”given”>Bugs</namePart>
</name>

This only gives a hint of the rich metadata that can be specified by using MODS. (The MODS website provides some far more detailed examples.)

Still, compare this to the Dublin Core Contributor term, which might have the value “Bugs King”. Is this a human name? Is it a pest control company?

But - even though it seems like an odd name, in the MODS example, we know that this is a person who goes by the name Bugs King.

Dublin Core might die and blow away - but it will always be recognized as a pivotal point in the development of the Semantic Web.



Mar 26 2009   11:31PM GMT

SQL and XML: declarative is exciting



Posted by: Roger “Buzz” King
namespaces, SMIL, XML

In the continuing series of blogs on the Semantic Web and other advanced web technology, we’ve looked at XML as a cornerstone of the technology that allows us to markup data, and in combination with namespaces, create powerful tools for sharing the meaning - and not just the structure - of data. There’s something special about XML that is at the core of its truly amazing widespread adoption, that explains its versatility as the language of choice for tagging data, no matter what the purpose.

What is it?

It’s that XML is “declarative”.

A declarative language is one that allows us to write programs that tell us what needs to be computed, not the order in which primitive operations need to be carried out in order to get the result. Java, C++, JavaScript, C#, Objective C, ActionScript, PHP - none of these are declarative.

Some Non-Declarative Code.

A Here’s some code:

for ( i = 0; i <100; i++ )
stuff[i] = stuff[i] + 1;

It says to start i at 0, then add 1 to i until you get to 99, and each time i is incremented, add 1 to that element in an array called stuff.

This manipulating-an-array program is the classic piece of non-declarative code. It doesn’t just say to add one to every element in an an array, it also tells the order in which to do it. This extra information shouldn’t really be needed, but in non-declarative languages - known as “imperative” languages - it is frequently necessary.

Some Declarative Code.

Now, here is some declarative code. It’s SQL, the universal database language:

SELECT Firstname
FROM Clients
WHERE (Lastname = ‘Smith’) AND (City = “Boulder”) AND (Bday BETWEEN ‘2/10/1970′ AND ‘2/10/1980′)

Clients is a relational “table”, and Firstname, Lastname, City, and Bday are all “attributes” (or “columns”) of that table.

This piece of code gives us the first name of any client whose last name is Smith, and who is from Boulder, and was born between Feb 10 of 1970 and Feb 10 of 1980.

Notice that it tells the computer what data we want, and not the sequence of steps that must be carried out to return the value. We don’t know what order the rows in the table will be examined. We don’t know if the three conditions will all be checked at once, or if we will filter the table first by picking out all clients who are from Boulder.

XML is Declarative.

XML is a declarative language, too. Let’s look at it.

This is the XML from the previous posting of this blog.

<smil xmlns:qt=”http://www.apple.com/quicktime
/resources/smilextensions” qt:autoplay=”true” qt:time-slider=”true”>
<head>
<meta name=”title” content=”Buzz’s Video”/>
<layout>
<root-layout background-color=”white” width=”320″ height=”290″/>
<region id=”videoregion” top=”0″ left=”0″ width=”320″ height=”290″/>
</layout>
</head>
<body>
<seq>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion”/>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion”/>
</seq>
</body>
</smil>

An XML program consists of “elements” and “attributes”. Notice <head> and </head> form the bounds for an element, as do <body> and </body>. Also note that there is an element nested within <head> and it’s marked by <layout> and </layout>. The tags <seq> and </seq> denote an element inside <body>.

The other major construct in XML is called an attribute, and name and content are two attributes with values “title” and “Buzz’s Video”, respectively. Attributes are always simple character values, and therefore cannot be nested.

When this program is saved with the name buzz.smil, and then run by Quicktime, it will download a video from my website (a very nice piece of animation by a student named Jochen Wendel), and then play it twice in succession. See the previous blog for more of an explanation of how SMIL works. It also discusses the difference between XML and its extensions, such as SMIL.

Note that these tags are not part of XML itself; rather they are part of the namespace that has been defined for the SMIL extension of XML. This illustrates the power of XML: it can be used to define other languages.

To understand the XML above, all that is needed is access to the SMIL namespace (which is available at the URL listed at the beginning of the code), and a program that knows how to interpret XML that contains these tags. In this case, it defines a layout for the screen, and that a video should be played twice, sequentially. Quicktime has been programmed to understand the elements and attributes of the SMIL XML language.

Going from “sequential” to “parallel”.

To make our point stronger, here’s a variation. Instead of playing the two videos sequentially, I am using the <par> and </par> tags that represent “parallel” in the SMIL namespace. I have also made the layout area twice as big, and broken it up into two regions. Now, the program plays the video twice, side-by-side, one in each region. At the bottom of this blog entry is what you should see if you save it as buzz.smil and run it with Quicktime. There is also a nice soundtrack.

<smil xmlns:qt=”http://www.apple.com/quicktime
/resources/smilextensions” qt:autoplay=”true” qt:time-slider=”true”>
<head>
<meta name=”title” content=”Buzz’s Video”/>
<layout>
<root-layout background-color=”white” width=”640″ height=”290″/>
<region id=”videoregion” top=”0″ left=”0″ width=”320″ height=”290″/> <region id=”videoregion2″ top=”0″ left=”320″ width=”320″ height=”290″/>
</layout>
</head>
<body>
<par>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion”/>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion2″/>
</par>
</body>
</smil>

IMPORTANT.

Notice that this program, written in the SMIL extension of XML, is quite declarative: it says to create a layout, break it into two regions, and then place the animation (a .mov video) in both regions, in parallel. It does not say how to do this. The program doesn’t specify the sequence of steps that are needed to get the job done - rather, what the result should look like.

This makes XML programs far easier to read than programs in imperative languages, thus making the programs easier for a programmer to write, and easier for another programmer to read and perhaps change later on. This makes programs in XML far more likely to be written correctly and then used appropriately.

We’ll look at declarative languages again, in future entries of this blog.

An SMIL XML program that plays a video twice, in parallel.
An XML program that plays a video twice, in parallel.


Mar 21 2009   3:17AM GMT

XML and its powerful children



Posted by: Roger “Buzz” King
XML, SMIL, namespaces, Quicktime, the Semantic Web

A key purpose of this blog is to provide a continuing examination of the Semantic Web - and certainly one of the most critical technologies to discuss is XML. Why is it so important?

First of all, just what is XML?

XML stands for eXtensible Markup Language, the extensible part is the key to its power.

Markup Languages.

Let’s step back though and look at the markup part first. “Markup” refers to the process of embedding commands in data. HTML is a markup language. When a browser fetches a web page from a web server, it processes the text-based HTML “markups” that appear in the page in order to present the page to us.

HTML: a Markup Language.

Importantly, HTML is focused on the visual appearance of information. It controls the layout of web pages, including “controls” such as menus and buttons. It also allows us to link pages together. One of the biggest jobs of HTML is to tell the browser how to layout pieces of text, such as the descriptions of books sold by Amazon.

HMTL has a fixed set of legal tags. Here is a sample HTML file:

<html>-
<body>
<h1> This is a heading </hl>
</body>
</html>

Notice that every tag comes in pairs, one with with a “<>” and the other with </>.

This HTML opens by telling us that it is an html file. Then it says there is a body to the file, and that there is a heading to be printed. This file will print the words “This is a heading”.

The important point, though is that these tags - html, body, and h1 - are HTML specific tags, and we cannot invent our own.

XML: a Far More Powerful Markup Language.

Now, let’s see what happens when we can invent our own tags.

XML is also a markup language. It was developed as a way to embed markups in data, so that the meaning of information can be communicated. In order to do this, XML allows us to do something we cannot do with HTML: we can specify our own “tags” so that we can add a lot more expressive power to our markups. There are two particularly critical aspects of XML tages.

The first is that there are two main sorts of tags: “elements” and “attributes”. Elements can have complex structure, and in fact, we can embedd elements inside elements. Attributes are simple values and have no internal structure.

The second critical thing is that we can use words taken from shared namespaces as values in tags in XML. This gives XML the power of shared, detailed terminologies that are available globally via the web.

The real power of XML is this: we can produce our own extensions of XML by defining our own tags. Each of these extensions is itself a complete markup language.

This is why it is such a critical part of Semantic Web technology: we can use it to capture the meaning (or “semantics“) of data so that it can be processed automatically. HTML controls the way a page is displayed only, and we have to use our eyes and minds to interactively interpret this information. But XML can be interpreted by a program, thus allowing powerful, automatic searching of the web.

An Example of an XML Extension.

Let’s look at an XML-based language, in particular, at its use of elements, attributes, and values from a shared namespace.

Below is a piece of code written in an XML extended language called SMIL. SMIL allows us to create multimedia presentations, with various pieces of media laid out on the display, as well as being sequenced in time.  (SMIL stands for Synchronized Multimedia Integration Language.)

First, let’s start with the core of a SMIL program:

<smil>
<head>
<layout>

… here is where we put commands that control the visual layout of the page we are constructing with SMIL …

</layout>
</head>
<body>

… this is where we put the core of our SMIL program, the part that specifies the multimedia presentation that is to appear in the page …

</body>

</smil>

Here is the entire program, fleshed out:

<smil xmlns:qt=”http://www.apple.com/quicktime
/resources/smilextensions” qt:autoplay=”true” qt:time-slider=”true”>
<head>
<meta name=”title” content=”Buzz’s Video”/>
<layout>
<root-layout background-color=”white” width=”320″ height=”290″/>
<region id=”videoregion” top=”0″ left=”0″ width=”320″ height=”290″/>
</layout>
</head>
<body>
<seq>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion”/>
<video src=”http://files.me.com/kingbuzz/radljq.mov” region=”videoregion”/>
</seq>
</body>
</smil>

We don’t need to worry about the specifics of this code. The values of attributes are in quotes, and the values of elements are inside <> and </>. So, background-color is an attribute, and video is a element.

Let’s look at the beginning of this program:

<smil xmlns:qt=”http://www.apple.com/quicktime

/resources/smilextensions” qt:autoplay=”true” qt:time-slider=”true”>

This code refers to the SMIL extension, i.e., namespace.  That’s what xmlns stands for, by the way: XML namespace - i.e., the set of attribute and element tags invented specifically for the SMIL extension to XML. By pointing to this namespace, our program identifies itself as being a legal SMIL file, and this tells Quicktime, which can play SMIL files, how to interpret it.

To see this, do this: Download Quicktime from Apple, if you don’t have it. Then put the above program in a file called buzz.smil. Then open buzz.smil with Quicktime.

Quicktime will read the file, locate the SMIL namespace on the web, then read the tags inside the SMIL program, and use them to interpret the rest of the code. This will direct it to download a video from my site - an excellent piece of animation built by one of my Intro to 3D Animation students, named Jochen Wendel. And in fact, Quicktime will play it twice - that’s that the <seq> </seq> tags mean: play it twice in sequence.

The Exciting Part.

Do you see what happened? We used a predefined namespace belonging to the SMIL extension of XML to write a program that can find a video, download it, and play it twice!

Why do we care?  It’s not that building a language to play various pieces of media, like Jochen’s animation, is a big deal in itself.  It’s that XML is extremely versatile.  By defining a set of tags and then sharing them, we can embed within information the means for interpreting it - and thereby create an endless array of powerful languages.

This is very important. XML and its powerful children (such as SMIL) are changing the web in a big way.

There’s something  else, too, something that is equally important.  XML is declarative. We’ll look at this in another blog soon, but essentially it means that an XML language like SMIL is easier to read than imperative code, like Java or C.  Look at my SMIL program, and then look at a Java program.  Which is easier to understand?  We’ll get to this.