Databases archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

databases

Oct 18 2009   10:37PM GMT

Personal Information Management Applications and Web 3.0



Posted by: Roger “Buzz” King
advanced Web apps, databases, information, media applications, Multimedia, note-taking, notebooks, rich internet apps, tagging, Web 2.0, Web 3.0, web applications

This blog is devoted to the discussion of Semantic Web and Web 2.0/3.0 technology.

Managing personal and small group information.

When it comes to so-called Web 2.0 and 3.0 technology, one of the most proliferate marketplaces involves the explosion of applications for managing information for individuals and small groups. Looking only at applications developed for Macs, we see an array of information management technologies.

Notebooks.

One of the most popular formats for managing information uses the paradigm of a notebook. The user can create a notebook, often selecting from multiple canned formats, such as a diary, class notes, or a novel, complete perhaps with a notebook cover and a spiral wire down the left side. The application creates a table of contents, and users can create sections and pages - and stuff virtually any kind of information on each page. Two very good examples of this approach are NoteShare and Notebook.

Interestingly, and perhaps because many of the applications in this category have been around for a number of years, these tend to not be true web applications. Often you can share notebooks, including full read/write access, via a URL and a simple browser interface, and you can publish a notebook at a URL. But the products are primarily for single-user, desktop use.

A good example of a notebook application that is a true web application is Zoho Notebook. (Zoho actually provides a large set of web based applications, of which the note program is just one.)

Buckets.

The other very popular note format uses the bucket or folder approach. The application may or may not support the nesting of these buckets and/or the creation of conceptual buckets, so that a given note can exist in more than one bucket. Two very good applications that use this approach are SOHO Notes and Yojimbo. These two applications are desktop-based, although most applications in this category support the synching of notes over multiple machines, using the Apple web-synching technology.

A hybrid desktop/web application is Evernote, which has elegant desktop applications for Windows machines, Macs, and a variety of handhelds and cell phones. It also has a very effective web interface. The user can sync multiple Evernote desktop instances via Evernote’s web server. Users can thus avoid ever using the web interface.

Outlines.

One specialized sort of information management application involves the creation of embedded outlines and bulleted lists. These applications, such as OmniOutliner, actually provide a full notebook functionality as well. OmniOutliner notebooks can be published on the web, but it is very definitely a desktop application.

Task lists.

An even more specialized class of information management applications support To-Do lists. Great examples are Zenbe Lists (they also provide integrated email and collaborative software) and rememberthemilk.com. These are web applications.

Photos and video.

There are a rapidly growing number of applications that allow users to collect, sort, tag, edit, and share photographs and video. Apple’s iPhoto is a great example. It is very much a desktop app, although applications in this class typically support the publication of images and video on the web, and sometimes, even read/write access via the web.

Stories, scripts, novels, and storyboards.

There are a number of highly specialized applications that support the development of fiction, including Final Draft and Montage (scripts), Scrivener and StoryMill (fiction prose), and Toon Boom storyboard (which is actually an impressive drawing program). Again, users can often publish to the web. Interestingly, many of these applications can easily be used as full blown, generic note applications, and can manage many forms of media.

Diary Applications.

Perhaps the most popular diary application on Macs is MacJournal (by the Montage and StoryMill folks). An interesting twist is that it is also an excellent blogging program. I use it to write this blog. This is, of course, one of the most widely used vehicles for sharing information on the web, and you can expect other sorts of personal information management systems to have blogging capabilities added to them.

Small, forms-based database management systems.

These applications are desktop apps. Apple’s Bento is a very good example. It actually is a sort of hybrid database/spreadsheet application. The most recent release allows multiple instances of Bento to share databases running on computers on a shared network.

Mind-Mapping.

The “circles and lines” applications have become highly specialized. The most well known one is MindManager, and there are versions for Windows machines and Macs. These are desktop apps. The vender, MindJet, recently introduced both web interfaces for sharing and updating desktop mind maps, as well as a web-based application that has a fresh, smooth interface, and provides team collaboration tools. Many forms of media can be placed in MindManager, including data from a wide variety of relational database management systems.

Screen and audio capture.

There are a number of applications that allow users to capture desktop video, along with audio voice-overs. Camtasia (which has Windows and Mac products) and Screenium are popular products.

These applications are, in a way, successors to slide applications like Microsoft Powerpoint and Apple Keynote. More and more presentations are being engineered with screen capture and audio applications, and these applications often support text and image data, as well as the insertion of video capture of the speaker. Sometimes, Powerpoint slides can be imported.

Conferencing apps.

There are several applications that provide hybrid desktop/browser live communication, including video, sound, and collaborative white-boarding. The best known one is probably Cisco WebEx, which comes in varieties for Macs and Windows machines. Skype supports a similar, limited product - which is free. One of the nice things about these products is that they come with their own voice lines. Other products, like Adobe ConnectNow, require the use of a cell phone to carry voice. With most of these products, a conference can be recorded for later use.

Finally…

Importantly, we note that in this rapidly-exploding marketplace, the borders between these various categories are being broken down, and applications often support a number of these capabilities at once. A good example is Curio, a desktop application that supports notes, lists, video, audio, white-boarding, mind-mapping, and limited web publishing.

Oct 11 2009   11:07PM GMT

Making information management scale: leveraging metadata on the new Web



Posted by: Roger “Buzz” King
3D modeling, automating Web searches, databases, DB2, information, Multimedia, MySQL, Oracle, PostgreSQL, RDF, Semantic Web, Video, Web 3.0, Web development frameworks, Web3.0

Previous postings of this blog.

This blog is dedicated to advanced Web development tools and concepts. Previous blog postings have focused on the emerging Semantic Web, which promises to make the Web radically easier to search and to greatly enhance the value of the vast sea of currently-disconnected information spread across the Web. We have also looked at Web 3.0 efforts, which promise to make multimedia websites highly usable and capable of conveying far more information than the current generation of websites. Previous postings describe breadth and depth of cutting edge Web technology.

Metadata: making that ratio small.

Here’s something that’s very important: Much of the ongoing research and development that is loosely categorized as Semantic Web and Web 3.0 efforts is focused on a specific technical goal, one that has been at the core of information management technology since the mainframe era that was epitomized by the IBM 360 series. That goal is to leverage metadata as much as possible.

It’s our best weapon against the truly staggering amount of information on the Web. This includes traditional text-based and numeric data, as well as books, medical advice, photographs, entertainment and training videos, music and recorded books, investment information, educational materials, scientific materials, e-government information, etc., etc. How can we possibly organize information and then search it in a way that scales? The Web is far from a closed world. In traditional data processing environments like banking, insurance, and credit card processing, we could get our arms around all of the data, as vast as it may have seemed. But the world of information today is an open world, effectively infinite in size.

Very informally, if you look at the size of the metadata divided by the size of the data itself, the smaller that fraction the better. In traditional relational databases (built with database management systems, such as Oracle, MS SQL Server, MySQL, PostgreSQL, or DB2), the extreme focus on minimizing this ratio has enabled the fast processing of extremely large volumes of data. The tradeoff is that the table definitions (or the “schema”), which form the heart of the metadata are very, very simplistic.

The old days: relational database schemas.

An insurance claim may be defined as a table with such columns as Subscriber_Name, Medical_Provider, etc., and thus, may consist of little or no more than a series of simple character and numeric fields. But if we need to process fifty thousand of them tonight, we must be able to bring many such table rows into memory at once, and quickly move through them. The database world was an extension of the paper world: a row in an insurance claim table was effectively an electronic successor to the traditional claim form.

Today: a far more challenging problem.

But on the new Web, information can be far more complex in nature, making the metadata to data ratio far larger. We’ve looked at some of the emerging technology and technical trends for embedding metadata in advanced forms of data (and for processing that metadata); this data includes books, images, video, modeling and animation, and sound. This new generation of information formats make up our personal health records and medical records images, industrial training materials, university “distance” courses, and the like. Each instance of these tends to be far more unique than individual insurance claim forms. And, it takes a lot of metadata to properly convey their “meaning”.

The challenge.

What we’re struggling with right now is to succinctly specify the meaning of modern media assets and to automate searching based on this metadata. This is our only hope for leveraging that ratio of metadata size divided by data size.


Oct 3 2009   9:12PM GMT

Multimedia: The Problem of Subtle Semantics



Posted by: Roger “Buzz” King
3D animation, 3D modeling, advanced Web apps, automating Web searches, blob data, continuous data, databases, information, Multimedia, rich internet apps, Semantic Web, smart search engines, tagging, Text, Web 2.0, Web 3.0, web applications, Web development, Web development frameworks, XML

The challenge of the Semantic Web.

We’ve looked at the emerging Semantic Web technology in the previous postings of this blog. The idea is to have a far, far smarter Web, one where the process of finding and interpreting and making use of far flung information can be largely automated. This is in sharp contrast with today’s Web, where these things have to be done in a painful, extremely time-consuming fashion.

So that is the key challenge. It has to do with searching the kinds of information that are important to us in our daily lives. This information, as it turns out, is very difficult to process automatically. Why is this?

The complexity of modern multimedia.

I teach a very basic 3D animation class to mostly computer science students. We use Maya, arguably the most popular 3D animation application, one that is used in the making of many animated features. The interesting thing about animation is that it is truly multimedia. It can give us a lot of insight into what we need the new Web to do for us.

That’s because the number and diversity of applications that are used for drawing, documenting, modeling, animating, motion capture, texturing, video rendering, video editing, video conversion and compression, sound editing, in even small projects, can be very impressive. Correspondingly, the wide variety and complexity of media formats involved in an animation project can be overwhelming.

What happens in an animation project? The workflow might begin with vector storyboard drawings to break the story down into scenes. In a typical animation project, 3D models in a variety of proprietary formats are used. Models must be transformed as they are exported from one application and imported into the next. Multiple video renders of animated models are made, and they must be edited together, along with multiple sound files. Multiple video and audio formats might be used. 2D images are used for textures; photographs of butterfly wings can be used to make an animated butterfly very realistic, and a checkerboard image made with Photoshop can be used to make a Linoleum floor. And along the way, a variety of note taking, screen capture, and conferencing software might be used to facilitate group communication.

There is also a heavy focus on reuse in an animation project. Building every model, editing every texture, creating every environment and background, recording every sound from scratch is frequently intractable. If existing assets cannot be tailored and reused, the project would be far too expensive and time consuming, and would demand too wide a variety of professionals to always be available. This raises the multimedia stakes, as assets of widely differing forms must be constantly reconfigured and used in concert in new ways.

But what’s the real problem? We aren’t all trying to produce complex animated videos. But very interestingly, in our everyday lives we essentially face the animator’s challenge when we try to find and use information on the Web. That’s because we’re often looking for things whose meaning, whose interpretation, demands focused human thought. We are looking not for business data, but for pieces of media, and the problem is that today, most of our searching has to be based on tags or brief textual descriptions that are associated with pieces of media, and not on the true meaning of the media itself.

The needs of the business world are not our needs.

It’s the subjective nature of media assets - this is what is at the heart of the problem facing us. Existing technology for searching the web is based on keywords and very short pieces of text.

There is other technology, though, under active development, stuff that serves as the information storage backbone of most commercial websites. It’s the technology that has for decades been used in-house (not on the Web) by businesses when they process large databases. But this stuff was designed to handle traditional business data forms, like integers, character strings, real numbers, dates, timestamps, and full text.

There is more, though. All of the major database management systems, along with tools for building and searching advanced websites are being retrofitted (or in some cases, built from the ground up) to manage more than keywords and text, more than standard business data.

But up to now, the focus has not been on supporting the kinds of information you and I are most interested in. The focus has been on extending database and Web technology to support xml documents, as well as more complex data objects, like those inside a Java program, as well as other forms of data found inside programs. This includes arrays and lists and short pieces of textual data, like the names of diseases.

In other words, we’ve been busy extending our support of the business world, so they can store complex business data in databases and make that information processable over the Web. You and I have largely been left out.

Finally, we are attacking our needs.

But there now many ongoing efforts to extend database and Web technology to make it useful to us. The new focus is on supporting blob and continuous media like images, video, and audio. This is extremely hard to do.

Why? Because the strongest means by which we deduce the meeting of business data is by looking at its internal structure and the terms that are used to describe that structure. A relational table named Prescriptions, with a character attributes Patient Name, Doctor’s Name, and Medication, and with a numeric attribute Dosage, is pretty easy to interpret.

But what do we do with a photograph, which is just a grid of pixels with no internal structure? Or a long series of images, along with a sound track, put together to form a piece of video?

The U.S. military has been pumping money into image processing for several decades, and so all is not lost. There is a vast body of mathematical research and software development that allows us to write programs that can find a particular face in a crowd and search satellite photos for airplane runways. But in general, we cannot at this time write a program that can process an arbitrary photo or video clip and tell us what it means. That means we can’t quickly search vast media database for useful pieces of information.

The goal behind the Semantic Web effort is to build a new generation of websites whose information can be searched automatically, and where information from multiple sites can be automatically integrated. To do this with numeric and character based data is quite doable. But when it comes to multimedia, like images and sound and video and 3D models and engineering designs, well, we have a long way to go. The meaning - in other words, the semantics - of these forms of data are complex and subtle, and highly dependent upon an individual’s interpretation of that media.

So, we see that we have only just begun our journey to create the new Web.


Sep 18 2009   3:02AM GMT

Dynamic pages, hidden data, and infered information: the danger of scale.



Posted by: Roger “Buzz” King
assertions, databases, dynamic pages, hidden web content, inferences, next generation search engines, Semantic Web, smart search engines, static pages, triples

The good and bad sides of the powerful Semantic Web.

So what happens when the Semantic Web is here? It’s supposed to largely automate the process of searching the Web by allowing us to attach machine-readable assertions (perhaps by using RDF) to information posted on the Web. Then, instead of us poor flailing humans having to painstakingly chase down countless URLs until we get what we want, smart search engines would be able to find precisely what we want in a single shot.

There is an obvious danger to all of this. The new Web will scale, in both good ways and bad. I am certainly not the first person to point out that the smarter the Web, the easier it will be for software to peruse the Web and dig up personal information about us. There will be software that carefully crafts ads in Spam mail that will target our vulnerabilities and our preferences. Websites will dynamically create webpages that target us individually, as well. When we shop online, when we read news, when we make social connections online, the Web will be disarmingly efficient and effective, and this leaves lots of room for fraud and manipulation.

This is already happening to a significant degree, and most of us are aware of it.

The no-longer-hidden database factor.

There is something more subtle about all of this, however. One of the most difficult things to do with traditional Web technology is to expose the content of databases to Web visitors. That’s because the pages that deliver up content pulled from databases are highly dynamic in nature, and so it is very hard for web designers to make search engines (like Google) find and index the content of these databases. There are simple and somewhat effective things web designers can do, like creating static pages that contain terms that are meant to draw web visitors to their sites. These pages are not “destination” pages; rather, they exist only as a way of advertising the information contained in databases.

In the future, RDF assertions (and other machine-readable content) will be added to websites, and they will server as far more effective draws.

But what about privacy? Will web designers inadvertently facilitate fraud and identity theft by enabling the automatic cross-referencing of detailed information existing in databases that have been built and deployed on the Web in isolation? This capability is at the heart of the Semantic Web effort. Information that right now can only be obtained by individual users manipulating individual web interfaces will be discoverable by smart search engines.

The real problem: it will scale.

This is a big deal. It’s not just that previously hidden information will now be discoverable. Because standardized terms and assertions will be used to describe information in databases, smart search engines will be able to automatically interrelate data from otherwise unrelated database systems. When information from multiple places is integrated, new information is effectively created.

For a moment, let’s forget about databases and look at a simple example of information that might be stored statically in two websites. Here is an example adapted from the previous posting of this blog:

Assertion 1: Joe is tall for an athlete.
Assertion 2: Tall athletes should try out for basketball.

A new inference: Joe should try out for basketball.

The point here is that this new inference can be inferred automatically, without the intervention of a human being.

We noted in the previous posting that the information about Joe and the information about basketball might be on different websites. These websites could easily have been built independently. But a key notion - and that is the semantics of the word “tall” in the context of basketball - is what allows this information to be automatically integrated. Another site might point out that Timmy is tall for a kindergarten student, but this would not trigger the suggestion that Timmy try out for the NBA.

Now, let’s get back to database systems, these things that can contain countless terabytes of personal information. Perhaps there is a database at one site containing information about many thousands of athletes. Perhaps there are hundreds or thousands of such sites. The Semantic Web would allow us to find tall athletes without having to know in advance what databases around the world have this sort of data inside them, data that previously could only have been extracted through tedious, time-consume human/computer interaction. Now, a high school counselor or a sports agent looking for new clients can be far more effective at their jobs.

Or, maybe it’s a drug company matching potential customers up with expensive drugs targeted toward specific diseases, or toward people who might have vague symptoms of various diseases, and who might be easily convinced they are sick. Ora con artist looking to scam elderly people who are likely to have dementias.

Or - well, get it? The Semantic Web will scale because it will have access to huge databases, and not just a world wide web of static pages. That’s the danger.


Aug 31 2009   3:40AM GMT

A Real-World Look at the Semantic Web, part 1



Posted by: Roger “Buzz” King
assertions, databases, inferences, information, knowledge, namespaces, RDF, Semantic Web, SPARQL, triples, wikis, ontologies

This blog is dedicated to the study of emerging Web technology, in particular, ongoing research and development aimed at building software tools that will underlie the emerging Semantic Web. In this posting, we look at a little-known website that has the potential of setting the pace for the developers of the Semantic Web.

DBpedia.

It’s called DBpedia. A former graduate student at my university, Greg Ziebold, pointed me toward it. The goal of the DBpedia is to transform data from the Wikipedia into a chunk of the Semantic Web. To do this, DBpedia is using RDF technology, something we have discussed is past postings of this blog. Behind RDF is an extremely simple concept, but one that has proven extremely powerful and versatile.

The general idea is to break knowledge up into “triples” that describe relationships between pieces of information. These triples can be chained together to discover new relationships. And, importantly, triples must make use of widely shared sets of terminology, called namespaces, in order for knowledge from different places on the Web to be properly chained together.

RDF, triples, assertions, and inferences.

A thorough example can be found in a previous posting of this blog.

Here is a very simple example of triples (also known as “assertions”) and how they can be put together into “inferences”.

Assertion 1: Joe is tall.
Assertion 2: Tall People should try out for Basketball.
A new inference: Joe should try out for Basketball.

Keep in mind that we would want to make sure that the words used in these assertions have precise, global meanings. We might take the terms in these two assertions from a basketball namespace, one that would carefully dictate exactly what “tall” means in the basketball world. Certainly, it would be quite different from the meaning of “tall” in a kindergarten namespace.

More on DBpedia.

There’s a fancy word for sets of triples that use namespaces and represent various areas of knowledge. They are called “ontologies”, taken from the term used by philosophers to argue about the existence of various things, like God. The DBpedia is essentially a vast ontology, formed from triples and namespaces. Most of the knowledge defined by this ontology comes from the Wikipedia. The folks behind the DBpedia have been given direct access to the flow of information into the Wikipedia, so that the DBpedia can stay current.

One way to look at the DBpedia is that it takes the Wikipedia and reforms it into something that can be searched far more effectively. Right now, to search the Wikipedia, most of us simply type in terms (either into Google/Yahoo or into the Wikipedia search page). We try various terms and follow links inside the Wikipedia until we find what we think we are looking for. With the DBpedia, users can search with SPARQL, a language based on the structure of SQL and engineered specifically for searching large bases of triples. SPARQL allows us to traverse networks that consists of triples linked by inferences.

That way, if we were a coach looking for promising candidates for our team, we would use SPARQL to make the connection between Joe being tall and the fact that tall people should try out for basketball. This is clearly much faster and more accurate than googling things like “tall”, “basketball”, etc, until we happened to find Joe in one of the web pages that pop up.

The DBpedia website, by the way, claims to have a triple base that consists of 274 million RDF triples.

More on this in the next posting.


May 11 2009   3:07AM GMT

The Semantic Web: revealing hidden data.



Posted by: Roger “Buzz” King
the Semantic Web, Oracle, SQL Server, DB2, PostgreSQL, MySQL, triples, namespaces, static pages, dynamic pages, databases, indexing, hidden web content

The Hidden Web.

The Semantic Web - a primary topic of this continuing blog series - will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what’s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.

Perhaps you have heard of the mysterious “Hidden Web”? So, what is this stuff and where is it?

Forms, Databases, and Interactive Interfaces.

The Hidden Web refers to data that is out there on the web, publicly accessible - but only via webpage interfaces that are opaque to the indexing software of search engines like Google.

Let’s step back for a moment.

The way search engines work, in case you don’t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results.

The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?

This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user.

But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn’t known until an interactive user types some words into a webform”. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a “dynamically” created page that is sent to the client machine and viewed by the browser user.

So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.

Indexing Dynamic Pages.

So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site - a static page - loaded with all the right words, and that contains links to the pages and forms you want the user to discover.

On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible namespaces, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.

As an example, a namespace concerning books might have words like ISBN-10 and ISBN-13. If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in.

Here’s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.

An example of a namespace that is used to describe documents on the web is the Dublin Core, by the way.

So, that’s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.

Where Does that Information Come From?

Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?

Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.

Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like “pharaoh” and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.

What Will the Semantic Web Do?

The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe.

This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate.

The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called “triples” that might, combined with namespaces, provide a partial solution to this problem.

We will look at this again, more closely, in a future blog posting.



Apr 2 2009   5:59AM GMT

Full Text searching: cleaver heuristics for managing large web-based document collections.



Posted by: Roger “Buzz” King
XML, the Semantic Web, SMIL, web applications, Web 2.0, documents, Web 3.0, databases, MySQL, SQL Server, Multimedia, full text, full text searching

There is an explosion of technology for supporting sophisticated forms of media on websites and in web applications. In our continuing series on advanced web applications (in particular, as they pertains to the Semantic Web and Web 2.0/3.0), we’ve looked at continuous media, in particular, video and multimedia presentations. But there is a very old form of continuous media, something that is perhaps the dominant media on the Web, and that’s text.

It’s becoming a very major issue in web development.

Text.

In this blog entry, we’ll be looking at a particular form of text, called “full text”.

But just what is text to begin with? It’s character-based data, anything we can read.

And what will we want to do with it in next-generation web applications? It’s important to note that more and more vast libraries of documents are being put online. Web applications need to provide far faster and more accurate searches of documents than what we can perform with Google.

Interestingly, a successful technology, called “full text retrieval”, is already in place in the relational database systems that underlie modern web applications. It’s there working for us, and we are likely to not be aware of how clever it is.

It’s also something that should be used much more heavily by web application developers.

Let’s step back and consider three different - and increasingly more sophisticated - ways of managing character data.

Atomic Character Attributes.

First, there is the traditional relational database approach, whereby data is stored as tables made of rows of atomic, fixed sized attributes. By atomic, we mean that each attribute has no internal structure. So, a table of insurance claims might have rows with the following attributes: Claims-ID (an integer), Amount (an integer), Medical_Problem (a fixed length character string), and Subscriber_Name (a fixed length character string). Using SQL, the universal database “query” language, we might look for all rows that contain the name “Fred Jones”. Or, we might search for all rows that have claim numbers that are between 110 and 115.

Essentially, this approach limits us to comparing small strings of data to each other or to fixed values. There are some common extensions that we find in relational databases, such as being able to ask the question to find all rows where the Medical_Problem is something like “broken leg”. Then if a row actually has the value “broken legs”, we would most likely see this row in our results.

Full Text.

Second, there is the ability to search pieces of text according to their natural language (in this case, English) meaning. In this case, we consider the character data to have internal structure, and the values are not considered atomic. Often, these pieces of text are long and of variable length from one row to the next.

It is actually an extension of - but a very dramatic one - of the like operator in SQL.

It is what we call “full text” management or retrieval, and modern relational database management systems like MySQL and Microsoft SQL Server support this. This was seen long ago as a critical extension to relational database technology. Thus, we might rename our Medical_Problem field to Doctor’s_Diagnosis, and allow free form English text in this attribute, as well as allowing the value to be quite long. Then we might search for all rows where the doctor describes “fractures of the lower limbs”. Notice that none of these words might actually appear in the attribute, which might simply refer to “broken legs”.

Natural Language Processing.

This capability would clearly be very powerful, if we could do it right. The problem is that to support it fully, we would need to use highly advanced natural language processing techniques, which are very time consuming to execute, especially on huge databases of large documents. The full text approach tries to simulate true natural language searching in a far less expensive way. The real thing, by the way, might not be all that accurate anyway. Natural language is naturally ambiguous and very subtle.

True natural language searching would be our third way of processing character-based data, by the way. It is not a fully developed technology. And importantly, we usually don’t need anything that fancy.

The Clever Compromise.

So, our middle option, full text searching, is what dominates today - and it is a surprisingly accurate, and efficient, technique that operates on a small set of heuristics. It can transform a dumb webpage where we can only search for small, fixed character strings, to a rich next-generation webpage that can effectively be searched according to its meaning. It allows us to manage very large text documents in web applications - and get us surprisingly close to the semantic power of true natural language searching.

We’re not going to go into a lot of detail here, but here are some of the heuristics that are used in full text search. First, “stemming” and related techniques are used; they conjugate verbs, detect plurals of nouns, and remove prefixes and suffixes. Another technique is to use a “stop list” that lists words that should be ignored, like “the”. The system might also let us specify the “proximity” of words; this refers to how closely specific words should appear in a document. It can also be powerful to include a synonym checker. And the ability to allow for “wild cards”, in particular, letters that may vary in a passage without changing its meaning, can be quite useful. Dictionaries of technical words that pertain to specific domains (like medicine or law) are very useful. We might also provide a feedback capability, whereby users can train full text search engines to be more accurate.

This clearly doesn’t come anywhere near true natural language processing - but it is fast. It will be a growing technology on the new web, with a lot of hidden development, making this heuristic-based technique more and more effective.

Indexing.

We should note that there is a significant up front cost in preparing a document for full text searching: we need to build an index with an entry for every (non-stop) word in the text. Then, when a query is executed, we can look for words in the document by searching the index, instead of searching the full text. If there were no index, the search would be extremely time-consuming.

The Future.

As more and more governmental, educational, medical, and other complex documents become available on the web, advanced full text searching will enable us to search vast databases in a tractable fashion. Even more clever full text retrieval engines will turn dumb, “gotta Google them” document portals into true Web 3.0 and Semantic Web applications.