Relational database management systems, such as MySQL, Oracle, MS SQL Server, DB2, and Postgresql, support the relational model. A database is broken up into tables, and each table consists of rows. Each row is a series of values. A row in a table called Insured Drivers in a motor vehicle database might consist of:
Fred, 2010 Toyota Prius, State Farm Insurance, 1112233444.
1112233444 might be a unique identifier that the government assigns to each driver. This would be the “primary key” for the table Insured Drivers. The point is that human names are not at all unique, and so in relational databases, we introduce artificial keys in order to disambiguate queries. We still need the value Fred in the row because we want to know how to address him with a letter or email.
Problems with relational databases.
There are a few critical points to note with this approach. First, such a simple way of representing data allows the database to quickly deliver large sets of rows from this table to the memory of a computer, so that they can be effectively searched in bulk. We might want to know the names of all people who drive a Toyota Prius and are insured by State Farm, for example.
Another thing is that we might like to be able to put more complex items in a row. We might want to have another value in a row, one that gives a driver’s address. But an address has a few parts to it, and is not itself a simple value like a name or a car model or the name of an insurance company.
It is important to also note, however, that relational databases do indeed support the creation of more complex values, such as an address. But the more complex values we put in rows in tables, the harder it is to read in a large number of rows at once.
In fact, we could create a value that represents a very complex object, one that refers to rows in other tables. For example, we might want to replace the value Fred with a reference to a row in another table called Licensed Drivers, because there is a lot we might want to know about Fred, other than just his name. But then it would become very difficult to read in lots of rows of a single table quickly.
It might be that if we follow a link to another table that describes drivers, these rows might themselves have links in them, thus allowing a value in a row to actually consist of an object, like we would in Java or C++. And in general, these links between tables could be chained together, and extend arbitrarily far. Do we chase all of these linked references down for every row of Insured Drivers, or do we not follow any of these links so we can read in a large number of rows? Then we would worry later about getting more information on each driver.
Importantly, relational databases are still very much the dominant database technology in use in businesses and other organizations, as well as on the Web. We need to keep in mind that we have already aggressively extended them by supporting values that have internal structure (like addresses) and with the ability to create complex objects (like drivers). How far do we go in extending them?
Where we stand today.
Indeed, the extensions we have already made to relational databases have created a serious optimization problem.
But it’s worse than that. Here’s something else to consider. Relational databases were born into a world where flat business data was pretty much the only game in town. However, relational databases are being asked to manage far more sophisticated forms of data, like photos and video clips and voice tracks. There are a couple of problems that crop up. First, a row with a video clip as a field could be huge. We might only be able to read in a single row at a time and this could make searching an entire table intractable. Worse, how do we even search for rows that contain certain pieces of video? How can we search for all video clips that show Fred getting into a car accident?
Where to go from here.
In previous postings of this blog we have looked at media databases, and in particular, at techniques that can be used to tag complex forms of blob and continuous media (like photos and video clips). What’s important to note, though, is that there is a major dilemma right now in the world of database software. Can we continue to shoehorn more and more complex forms of data into relational databases, or do we need to throw in the towel and start over?
More on this next time…]]>
Metadata: making that ratio small.
Here’s something that’s very important: Much of the ongoing research and development that is loosely categorized as Semantic Web and Web 3.0 efforts is focused on a specific technical goal, one that has been at the core of information management technology since the mainframe era that was epitomized by the IBM 360 series. That goal is to leverage metadata as much as possible.
It’s our best weapon against the truly staggering amount of information on the Web. This includes traditional text-based and numeric data, as well as books, medical advice, photographs, entertainment and training videos, music and recorded books, investment information, educational materials, scientific materials, e-government information, etc., etc. How can we possibly organize information and then search it in a way that scales? The Web is far from a closed world. In traditional data processing environments like banking, insurance, and credit card processing, we could get our arms around all of the data, as vast as it may have seemed. But the world of information today is an open world, effectively infinite in size.
Very informally, if you look at the size of the metadata divided by the size of the data itself, the smaller that fraction the better. In traditional relational databases (built with database management systems, such as Oracle, MS SQL Server, MySQL, PostgreSQL, or DB2), the extreme focus on minimizing this ratio has enabled the fast processing of extremely large volumes of data. The tradeoff is that the table definitions (or the “schema”), which form the heart of the metadata are very, very simplistic.
The old days: relational database schemas.
An insurance claim may be defined as a table with such columns as Subscriber_Name, Medical_Provider, etc., and thus, may consist of little or no more than a series of simple character and numeric fields. But if we need to process fifty thousand of them tonight, we must be able to bring many such table rows into memory at once, and quickly move through them. The database world was an extension of the paper world: a row in an insurance claim table was effectively an electronic successor to the traditional claim form.
Today: a far more challenging problem.
But on the new Web, information can be far more complex in nature, making the metadata to data ratio far larger. We’ve looked at some of the emerging technology and technical trends for embedding metadata in advanced forms of data (and for processing that metadata); this data includes books, images, video, modeling and animation, and sound. This new generation of information formats make up our personal health records and medical records images, industrial training materials, university “distance” courses, and the like. Each instance of these tends to be far more unique than individual insurance claim forms. And, it takes a lot of metadata to properly convey their “meaning”.
What we’re struggling with right now is to succinctly specify the meaning of modern media assets and to automate searching based on this metadata. This is our only hope for leveraging that ratio of metadata size divided by data size.]]>
The Semantic Web – a primary topic of this continuing blog series – will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.
Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?
Forms, Databases, and Interactive Interfaces.
The Hidden Web refers to data that is out there on the web, publicly accessible – but only via webpage interfaces that are opaque to the indexing software of search engines like Google.
Let’s step back for a moment.
The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.
The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?
This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.
But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a web “form”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.
So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.
Indexing Dynamic Pages.
So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site – a static page – loaded with all the right words, and that contains links to the pages and forms you want the user to discover.
On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.
As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.
Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.
An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.
So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.
Where Does that Information Come From?
Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?
Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.
Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.
What Will the Semantic Web Do?
The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.
This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.
The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.
We will look at this again, more closely, in a future blog posting.