Posted by: Roger King
databases, DB2, dynamic pages, hidden web content, indexing, MySQL, namespaces, Oracle, PostgreSQL, SQL Server, static pages, the Semantic Web, triples
The Hidden Web.
The Semantic Web – a primary topic of this continuing blog series – will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.
Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?
Forms, Databases, and Interactive Interfaces.
The Hidden Web refers to data that is out there on the web, publicly accessible – but only via webpage interfaces that are opaque to the indexing software of search engines like Google.
Let’s step back for a moment.
The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.
The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?
This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.
But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a web “form”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.
So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.
Indexing Dynamic Pages.
So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site – a static page – loaded with all the right words, and that contains links to the pages and forms you want the user to discover.
On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.
As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.
Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.
An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.
So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.
Where Does that Information Come From?
Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?
Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.
Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.
What Will the Semantic Web Do?
The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.
This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.
The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.
We will look at this again, more closely, in a future blog posting.