Hidden Web Content archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

hidden web content

Sep 18 2009   3:02AM GMT

Dynamic pages, hidden data, and infered information: the danger of scale.



Posted by: Roger “Buzz” King
assertions, databases, dynamic pages, hidden web content, inferences, next generation search engines, Semantic Web, smart search engines, static pages, triples

The good and bad sides of the powerful Semantic Web.

So what happens when the Semantic Web is here? It’s supposed to largely automate the process of searching the Web by allowing us to attach machine-readable assertions (perhaps by using RDF) to information posted on the Web. Then, instead of us poor flailing humans having to painstakingly chase down countless URLs until we get what we want, smart search engines would be able to find precisely what we want in a single shot.

There is an obvious danger to all of this. The new Web will scale, in both good ways and bad. I am certainly not the first person to point out that the smarter the Web, the easier it will be for software to peruse the Web and dig up personal information about us. There will be software that carefully crafts ads in Spam mail that will target our vulnerabilities and our preferences. Websites will dynamically create webpages that target us individually, as well. When we shop online, when we read news, when we make social connections online, the Web will be disarmingly efficient and effective, and this leaves lots of room for fraud and manipulation.

This is already happening to a significant degree, and most of us are aware of it.

The no-longer-hidden database factor.

There is something more subtle about all of this, however. One of the most difficult things to do with traditional Web technology is to expose the content of databases to Web visitors. That’s because the pages that deliver up content pulled from databases are highly dynamic in nature, and so it is very hard for web designers to make search engines (like Google) find and index the content of these databases. There are simple and somewhat effective things web designers can do, like creating static pages that contain terms that are meant to draw web visitors to their sites. These pages are not “destination” pages; rather, they exist only as a way of advertising the information contained in databases.

In the future, RDF assertions (and other machine-readable content) will be added to websites, and they will server as far more effective draws.

But what about privacy? Will web designers inadvertently facilitate fraud and identity theft by enabling the automatic cross-referencing of detailed information existing in databases that have been built and deployed on the Web in isolation? This capability is at the heart of the Semantic Web effort. Information that right now can only be obtained by individual users manipulating individual web interfaces will be discoverable by smart search engines.

The real problem: it will scale.

This is a big deal. It’s not just that previously hidden information will now be discoverable. Because standardized terms and assertions will be used to describe information in databases, smart search engines will be able to automatically interrelate data from otherwise unrelated database systems. When information from multiple places is integrated, new information is effectively created.

For a moment, let’s forget about databases and look at a simple example of information that might be stored statically in two websites. Here is an example adapted from the previous posting of this blog:

Assertion 1: Joe is tall for an athlete.
Assertion 2: Tall athletes should try out for basketball.

A new inference: Joe should try out for basketball.

The point here is that this new inference can be inferred automatically, without the intervention of a human being.

We noted in the previous posting that the information about Joe and the information about basketball might be on different websites. These websites could easily have been built independently. But a key notion - and that is the semantics of the word “tall” in the context of basketball - is what allows this information to be automatically integrated. Another site might point out that Timmy is tall for a kindergarten student, but this would not trigger the suggestion that Timmy try out for the NBA.

Now, let’s get back to database systems, these things that can contain countless terabytes of personal information. Perhaps there is a database at one site containing information about many thousands of athletes. Perhaps there are hundreds or thousands of such sites. The Semantic Web would allow us to find tall athletes without having to know in advance what databases around the world have this sort of data inside them, data that previously could only have been extracted through tedious, time-consume human/computer interaction. Now, a high school counselor or a sports agent looking for new clients can be far more effective at their jobs.

Or, maybe it’s a drug company matching potential customers up with expensive drugs targeted toward specific diseases, or toward people who might have vague symptoms of various diseases, and who might be easily convinced they are sick. Ora con artist looking to scam elderly people who are likely to have dementias.

Or - well, get it? The Semantic Web will scale because it will have access to huge databases, and not just a world wide web of static pages. That’s the danger.

May 11 2009   3:07AM GMT

The Semantic Web: revealing hidden data.



Posted by: Roger “Buzz” King
the Semantic Web, Oracle, SQL Server, DB2, PostgreSQL, MySQL, triples, namespaces, static pages, dynamic pages, databases, indexing, hidden web content

The Hidden Web.

The Semantic Web - a primary topic of this continuing blog series - will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.

Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?

Forms, Databases, and Interactive Interfaces.

The Hidden Web refers to data that is out there on the web, publicly accessible - but only via webpage interfaces that are opaque to the indexing software of search engines like Google.

Let’s step back for a moment.

The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.

The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?

This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.

But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a webform”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.

So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.

Indexing Dynamic Pages.

So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site - a static page - loaded with all the right words, and that contains links to the pages and forms you want the user to discover.

On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.

As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.

Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.

An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.

So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.

Where Does that Information Come From?

Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?

Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.

Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.

What Will the Semantic Web Do?

The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.

This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.

The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.

We will look at this again, more closely, in a future blog posting.