Web 2.0 archives - Buzz’s Blog: On Web 3.0 and the Semantic Web

Buzz’s Blog: On Web 3.0 and the Semantic Web:

Web 2.0

Oct 18 2009   10:37PM GMT

Personal Information Management Applications and Web 3.0



Posted by: Roger “Buzz” King
advanced Web apps, databases, information, media applications, Multimedia, note-taking, notebooks, rich internet apps, tagging, Web 2.0, Web 3.0, web applications

This blog is devoted to the discussion of Semantic Web and Web 2.0/3.0 technology.

Managing personal and small group information.

When it comes to so-called Web 2.0 and 3.0 technology, one of the most proliferate marketplaces involves the explosion of applications for managing information for individuals and small groups. Looking only at applications developed for Macs, we see an array of information management technologies.

Notebooks.

One of the most popular formats for managing information uses the paradigm of a notebook. The user can create a notebook, often selecting from multiple canned formats, such as a diary, class notes, or a novel, complete perhaps with a notebook cover and a spiral wire down the left side. The application creates a table of contents, and users can create sections and pages - and stuff virtually any kind of information on each page. Two very good examples of this approach are NoteShare and Notebook.

Interestingly, and perhaps because many of the applications in this category have been around for a number of years, these tend to not be true web applications. Often you can share notebooks, including full read/write access, via a URL and a simple browser interface, and you can publish a notebook at a URL. But the products are primarily for single-user, desktop use.

A good example of a notebook application that is a true web application is Zoho Notebook. (Zoho actually provides a large set of web based applications, of which the note program is just one.)

Buckets.

The other very popular note format uses the bucket or folder approach. The application may or may not support the nesting of these buckets and/or the creation of conceptual buckets, so that a given note can exist in more than one bucket. Two very good applications that use this approach are SOHO Notes and Yojimbo. These two applications are desktop-based, although most applications in this category support the synching of notes over multiple machines, using the Apple web-synching technology.

A hybrid desktop/web application is Evernote, which has elegant desktop applications for Windows machines, Macs, and a variety of handhelds and cell phones. It also has a very effective web interface. The user can sync multiple Evernote desktop instances via Evernote’s web server. Users can thus avoid ever using the web interface.

Outlines.

One specialized sort of information management application involves the creation of embedded outlines and bulleted lists. These applications, such as OmniOutliner, actually provide a full notebook functionality as well. OmniOutliner notebooks can be published on the web, but it is very definitely a desktop application.

Task lists.

An even more specialized class of information management applications support To-Do lists. Great examples are Zenbe Lists (they also provide integrated email and collaborative software) and rememberthemilk.com. These are web applications.

Photos and video.

There are a rapidly growing number of applications that allow users to collect, sort, tag, edit, and share photographs and video. Apple’s iPhoto is a great example. It is very much a desktop app, although applications in this class typically support the publication of images and video on the web, and sometimes, even read/write access via the web.

Stories, scripts, novels, and storyboards.

There are a number of highly specialized applications that support the development of fiction, including Final Draft and Montage (scripts), Scrivener and StoryMill (fiction prose), and Toon Boom storyboard (which is actually an impressive drawing program). Again, users can often publish to the web. Interestingly, many of these applications can easily be used as full blown, generic note applications, and can manage many forms of media.

Diary Applications.

Perhaps the most popular diary application on Macs is MacJournal (by the Montage and StoryMill folks). An interesting twist is that it is also an excellent blogging program. I use it to write this blog. This is, of course, one of the most widely used vehicles for sharing information on the web, and you can expect other sorts of personal information management systems to have blogging capabilities added to them.

Small, forms-based database management systems.

These applications are desktop apps. Apple’s Bento is a very good example. It actually is a sort of hybrid database/spreadsheet application. The most recent release allows multiple instances of Bento to share databases running on computers on a shared network.

Mind-Mapping.

The “circles and lines” applications have become highly specialized. The most well known one is MindManager, and there are versions for Windows machines and Macs. These are desktop apps. The vender, MindJet, recently introduced both web interfaces for sharing and updating desktop mind maps, as well as a web-based application that has a fresh, smooth interface, and provides team collaboration tools. Many forms of media can be placed in MindManager, including data from a wide variety of relational database management systems.

Screen and audio capture.

There are a number of applications that allow users to capture desktop video, along with audio voice-overs. Camtasia (which has Windows and Mac products) and Screenium are popular products.

These applications are, in a way, successors to slide applications like Microsoft Powerpoint and Apple Keynote. More and more presentations are being engineered with screen capture and audio applications, and these applications often support text and image data, as well as the insertion of video capture of the speaker. Sometimes, Powerpoint slides can be imported.

Conferencing apps.

There are several applications that provide hybrid desktop/browser live communication, including video, sound, and collaborative white-boarding. The best known one is probably Cisco WebEx, which comes in varieties for Macs and Windows machines. Skype supports a similar, limited product - which is free. One of the nice things about these products is that they come with their own voice lines. Other products, like Adobe ConnectNow, require the use of a cell phone to carry voice. With most of these products, a conference can be recorded for later use.

Finally…

Importantly, we note that in this rapidly-exploding marketplace, the borders between these various categories are being broken down, and applications often support a number of these capabilities at once. A good example is Curio, a desktop application that supports notes, lists, video, audio, white-boarding, mind-mapping, and limited web publishing.

Oct 3 2009   9:12PM GMT

Multimedia: The Problem of Subtle Semantics



Posted by: Roger “Buzz” King
3D animation, 3D modeling, advanced Web apps, automating Web searches, blob data, continuous data, databases, information, Multimedia, rich internet apps, Semantic Web, smart search engines, tagging, Text, Web 2.0, Web 3.0, web applications, Web development, Web development frameworks, XML

The challenge of the Semantic Web.

We’ve looked at the emerging Semantic Web technology in the previous postings of this blog. The idea is to have a far, far smarter Web, one where the process of finding and interpreting and making use of far flung information can be largely automated. This is in sharp contrast with today’s Web, where these things have to be done in a painful, extremely time-consuming fashion.

So that is the key challenge. It has to do with searching the kinds of information that are important to us in our daily lives. This information, as it turns out, is very difficult to process automatically. Why is this?

The complexity of modern multimedia.

I teach a very basic 3D animation class to mostly computer science students. We use Maya, arguably the most popular 3D animation application, one that is used in the making of many animated features. The interesting thing about animation is that it is truly multimedia. It can give us a lot of insight into what we need the new Web to do for us.

That’s because the number and diversity of applications that are used for drawing, documenting, modeling, animating, motion capture, texturing, video rendering, video editing, video conversion and compression, sound editing, in even small projects, can be very impressive. Correspondingly, the wide variety and complexity of media formats involved in an animation project can be overwhelming.

What happens in an animation project? The workflow might begin with vector storyboard drawings to break the story down into scenes. In a typical animation project, 3D models in a variety of proprietary formats are used. Models must be transformed as they are exported from one application and imported into the next. Multiple video renders of animated models are made, and they must be edited together, along with multiple sound files. Multiple video and audio formats might be used. 2D images are used for textures; photographs of butterfly wings can be used to make an animated butterfly very realistic, and a checkerboard image made with Photoshop can be used to make a Linoleum floor. And along the way, a variety of note taking, screen capture, and conferencing software might be used to facilitate group communication.

There is also a heavy focus on reuse in an animation project. Building every model, editing every texture, creating every environment and background, recording every sound from scratch is frequently intractable. If existing assets cannot be tailored and reused, the project would be far too expensive and time consuming, and would demand too wide a variety of professionals to always be available. This raises the multimedia stakes, as assets of widely differing forms must be constantly reconfigured and used in concert in new ways.

But what’s the real problem? We aren’t all trying to produce complex animated videos. But very interestingly, in our everyday lives we essentially face the animator’s challenge when we try to find and use information on the Web. That’s because we’re often looking for things whose meaning, whose interpretation, demands focused human thought. We are looking not for business data, but for pieces of media, and the problem is that today, most of our searching has to be based on tags or brief textual descriptions that are associated with pieces of media, and not on the true meaning of the media itself.

The needs of the business world are not our needs.

It’s the subjective nature of media assets - this is what is at the heart of the problem facing us. Existing technology for searching the web is based on keywords and very short pieces of text.

There is other technology, though, under active development, stuff that serves as the information storage backbone of most commercial websites. It’s the technology that has for decades been used in-house (not on the Web) by businesses when they process large databases. But this stuff was designed to handle traditional business data forms, like integers, character strings, real numbers, dates, timestamps, and full text.

There is more, though. All of the major database management systems, along with tools for building and searching advanced websites are being retrofitted (or in some cases, built from the ground up) to manage more than keywords and text, more than standard business data.

But up to now, the focus has not been on supporting the kinds of information you and I are most interested in. The focus has been on extending database and Web technology to support xml documents, as well as more complex data objects, like those inside a Java program, as well as other forms of data found inside programs. This includes arrays and lists and short pieces of textual data, like the names of diseases.

In other words, we’ve been busy extending our support of the business world, so they can store complex business data in databases and make that information processable over the Web. You and I have largely been left out.

Finally, we are attacking our needs.

But there now many ongoing efforts to extend database and Web technology to make it useful to us. The new focus is on supporting blob and continuous media like images, video, and audio. This is extremely hard to do.

Why? Because the strongest means by which we deduce the meeting of business data is by looking at its internal structure and the terms that are used to describe that structure. A relational table named Prescriptions, with a character attributes Patient Name, Doctor’s Name, and Medication, and with a numeric attribute Dosage, is pretty easy to interpret.

But what do we do with a photograph, which is just a grid of pixels with no internal structure? Or a long series of images, along with a sound track, put together to form a piece of video?

The U.S. military has been pumping money into image processing for several decades, and so all is not lost. There is a vast body of mathematical research and software development that allows us to write programs that can find a particular face in a crowd and search satellite photos for airplane runways. But in general, we cannot at this time write a program that can process an arbitrary photo or video clip and tell us what it means. That means we can’t quickly search vast media database for useful pieces of information.

The goal behind the Semantic Web effort is to build a new generation of websites whose information can be searched automatically, and where information from multiple sites can be automatically integrated. To do this with numeric and character based data is quite doable. But when it comes to multimedia, like images and sound and video and 3D models and engineering designs, well, we have a long way to go. The meaning - in other words, the semantics - of these forms of data are complex and subtle, and highly dependent upon an individual’s interpretation of that media.

So, we see that we have only just begun our journey to create the new Web.


Jun 11 2009   11:44AM GMT

The two duct tapes of computing: Excel and Firefox, and the New Web



Posted by: Roger “Buzz” King
Web 3.0, Web 2.0, the Semantic Web, Multimedia, Excel, browsers, models of computing, smart browsers

This blog concerns advanced Web technologies. Each posting should be readable on its own, but the series of blogs as a whole tell a continuous story.

In this posting, we look at the Duct Tape Phenomena.

Excel.

As a researcher, I have worked with biologist in the past. Big biologists, not microbiologists, the folks who tinker with DNA. The folks I worked with study macroscopic things mostly, species, in particular. They search for as-yet undocumented species. They tend to have appointments at major universities around the world, and then take extended field trips to study life. Most of them go to rain forests because that’s where biodiversity is its greatest.

Each scientist has a chunk of the world and a kind of animal they specialize in. I know the butterfly man of Costa Rica, a fellow who has documented several thousand varieties of butterflies, some of which have wing spans of several inches. I know the bug man of the Amazon, who builds long tunnel-like things from the floor of the forest up to the canopy, fills the tunnels with bug killer, and then looks among the dead for bugs that are yet unheard-of.

Here’s the interesting part, at least from a computing perspective: a lot of the scientists I came into contact with store their data in Excel. This is a phenomena that crosscuts the entire spectrum of computer users. They had to learn Excel at some point, maybe in school or at some workplace, and the next time they needed an application to do something, they found a way to make Excel do the job. For most people, learning the “right” application to use is far too much work, even if it’s hard to query Excel the way we would a database, even if Excel spreadsheets get way out of control size-wise, given the large amount of data many of us collect.

Excel, in many ways, is the duct tape of desktop and notebook computing.

Firefox (or your favorite browser).

But what about developers of desktop apps? What do they use as a design paradigm when building the interface to an app, even if it’s not meant for the Web?

Browsers.

Indeed, there is a merging of desktop GUI and web app interface technologies, and now you could sit down in front of a running app and not be sure which of the two you are seeing. In fact, the design impact is not the end of it. We actually use browsers now to interface with some desktop apps, but not often, not yet. However, at least as a user interface paradigm, the browser is becoming the duct tape of GUI design.

For developers of interfaces, Firefox has become a sort of duct tape.

The new Web.

These are the two things that underly much of computing: the need to store and compute (as with Excel) and the need to interface (as with Firefox). But when the new Web, (in the form of the Semantic Web and truly advanced Web 3.0 apps), begins to arrive, will a new paradigm emerge?

Perhaps they will be extra smart browsers that can process code written with xml and namespace and other semantic technology, so they can do more than just look for pages according to the English keywords on them.

In other words, we could imagine them as extensions of what our browsers do for us now. They’re very stupid now, really. They’re not at all smart like Excel.

How does it work now? Crawlers commissioned by search engines like Google constantly search the Web and “invert” every static page they find by building an index on every word in them. And then later, we can search this gigantic index store according to the words that appear on the pages that the crawler has found. Once we find URLs of interest, we click on them and go visit the actual pages. These searchers are far, far less than “semantic” in nature.

Our smart browsers will also have to let us build up organized libraries of specialized web content we have found, including documents, images, video, sound, animation, and such specialized data as medical treatment advice. We might maintain these in virtual space, or we might download frozen copies of pages to store on our machines. Our smart browsers could constantly look for updated versions of pages we have copied and downloaded.

These smart browsers will also have to interrelate data of a wide variety of sorts, so that a description of certain symptoms can be accurately hooked up with the specifics of a diagnosis and a medical treatment plan. Our browsers will have to isolate conflicting information, as well.

So, in the future, we’ll need browsers with smarts. We’ll look at this much more carefully in a future posting of this blog, but for now, here’s the lesson: thats the two things that applications do for us, they let us store and search things, and they let us compute things.

And what about viewing all this information? How will so much complex, multimedia information be presented? Not as simple webpages with images, text, and things you can click on. Perhaps the new browsers will lay out multimedia presentations of complex, integrated information that has been synthesized from many, many different sources.

The point.

So, what does this imply? That these two things underly computing apps of almost all sorts: 1, storing and searching, and 2, viewing and manipulating.

And they will underlie the most complex and sophisticated end-user applications of the future.

In a vague, somewhat analogous fashion, most apps are a blend of Excel and Firefox.

Things change radically over time. And things never really change at all.



Jun 2 2009   4:53AM GMT

Ambient intelligence: empowering the new Web



Posted by: Roger “Buzz” King
RFID tags, Ubiquitous Computing, ambient computing, The Internet of Things, ambient intelligence, web services, Web 2.0

This blog concerns advanced Web technologies that can be roughly described as being part of the Web 2.0 and the Semantic Web efforts. Most recently, we’ve looked at technology that will either buttress new Web development technology or take advantage of it. In particular, in the last posting of this blog, we looked at the Internet of Things and ubiquitous computing, and how they might interface with advanced Web applications to produce a combined, more powerful computing environment. We’ve also looked at New Songdo City - the u-city - and how it will at least indirectly serve as a testing ground for new Web technology.

Ambient Intelligence: A Powerful Enhancer of Advanced Web Technology.

In this blog entry, we’ll look at another new technology and how it might dovetail with the new Web. It’s called “ambient intelligence”. Like other software advances, although it is not directly related to the Web, it will dovetail beautifully with new Web technology.

We consider how ambient intelligence will make the Web radically better at serving individuals.

Ambient Intelligence: Just What Is It?

The term refers to computerized devices that tailor their behavior according to the nature of each user. First of all, though, we should make it clear that this is not a particularly new term, that it does not have a highly specific definition, and there are lots of other terms that have been used to describe similar concepts. But there is something focused that is emerging under the banner of this name.

Ambient intelligence is commonly discussed in the context of embedded devices, machines that have processors in them and that perform specific information-based tasks, as opposed to being general purpose programmable computers. Embedded computers are in cell phones, our automobiles, and “smart cards”. Sometimes, they can indeed be programmed to do almost anything, like the ones inside cell phones. But even then, it’s assumed that very few people will do so. The point is that they generally do not have displays, keyboards, or mice dedicated to their use. They are found inside small and large devices, as well as in the smarts of complex systems, like assembly lines. Mass produced, but sophisticated items like insulin meters have computers in them.

As an example, you could imagine that the vending machine you put money into tomorrow might already know that you drink nothing but 20 ounce Pepsis. Maybe every vending machine in your complex at work knows your habits. Maybe if you switch to Sprite on one machine, it will tell the rest. Maybe the machines will offer you one or the other until a new pattern seems to emerge and it appears that you will never again drink Pepsi. Or you might be able to enter your “favorites” on the corporate website, and declare what you prefer to drink. The machines will know - and so will the company that services those machines. All of this could happen without human intervention.

Ambient devices don’t have to specifically target individuals. You could imagine a computing system in an airport that can smoothly transition between human languages, customs, and regulations, to better serve a global audience. We’re very close to this sort of thing right now, actually.

Ambient Intelligence at the Fingertips of the Web.

But wait. Let’s get back to that vending machine. How do they communicate with each other to pass on the critical news that you’re a Sprite person now? How do you enter your favorites? How does the vending machine company get the news so they know what to order?

The Web. Those ambient vending machines use the Web.

On the Web, embedded devices can be engaged by web applications and
web services. (Remember that web services are programmatic interfaces to services;
i.e., they don’t have to be activated by a human using a browser.)
Embedded machines can also initiate web services, as well as trigger “push” tasks,
whereby a user on a client machine somewhere is told that something is happening and itʼs time to get to work. The embedded device and the user could be on opposite sides of the world, thanks to the Web.

RFID Technology: Tracking Things.

We’ve already looked at RFID technology.

As a reminder, the goal of RFID-based systems is help us coordinate and carefully control the use of various objects. Of particular interest are mobile objects. One of the key components behind this idea are RFID tags. RFID stands for “radio frequency identification”. A tag can be attached to almost anything. After they are deployed, an RFID reader can send out a signal, which is picked up by the RFID tags, and then respond. As things move around, as things are used in concert to perform tasks, they can be carefully tracked and managed.

An Example:

There’s another aspect of ambient intelligence. When people talk about a device that has ambient intelligence, often they are referring to a dedicated devices with a simple display, not a general purpose computer. By this quality, the soda machine example is a bit rudimentary, in that it probably doesn’t have any true native display at all, and the indirect way of accessing it, at least according to our example, is too general purpose - a website that is accessed with a full blown computer.

Consider something that is a major topic of discussion now, and a subject we will return to in this blog in the near future: electronic health records. The idea is that we would have life-long electronic medical information bases that would be accessible to medical providers (with our approval). This way, the fact that I had some disease as a child that makes us vulnerable for some other disease
later in life would become apparent to my family doctor, and the necessary screening exam would be scheduled periodically. Otherwise, how am I supposed to know about the consequences of something that happened when I was a toddler? My “EHR” would also hold prescription records, imaging data, and anything else related to my health. It would, of course, be a web-based app.

But various sorts of doctors - not to mention non-medical types like me - need information displayed and abstracted in special ways. My family doc might want to see everything is its raw form, if for no other reason than my doctor would be expected to know my medical history, if it were readily available. (And yes, if I had a chronic disease or were the caregiver for someone with a chronic disease, the immense size of the EHR would be truly overwhelming. I imagine that doctors might be afraid of being expected to process huge EHRs belonging to new patients.)

Now, consider an emergency room doctor. If I was lying on a bed in an emergency room, not conscious, having just collapsed and complaining of a terrible pain from a horrendous headache, and from nauseous, and unable to answer questions, the doctor needs data fast. The display that the ER doc uses would not be on a general purpose desktop computer, would not provide that massive raw data view, and would present information in a highly readable form.

Most importantly, that computer would have to be instantly adaptable to suit the needs of an emergency, and then later, go back to a non-emergency mode, to be of help in further treatment.

Or, it might be that the web server and not the machine in the ER, contains the ambient software. The machine in the ER might be a very simple client. But either way, the combined web application and local client would have to be capable of searching my online EHR, to look for possible problems, and to display them. It might deliver up the fact that just this morning, I had minor surgery on the baby finger on my left hand - and since I was so squeamish, I was given general anesthesia.

Boom. The ER doc figures out that my headache is from high blood pressure, which, along with nausea, is a common side effect of anesthesia, and it can hit hours later. The doc now knows that if I’m given a blood pressure reducing drug, I’ll be fine. But I might have to first be given an anti-nausea drug, and obviously, I wouldn’t be able to swallow that and keep it down, and so it would be administered at the other end of my food processing subsystem.

Wait, one more thing. What about RFID tags? Maybe I have one around my neck, and that’s how the doc figured out who I was in the first place, since I was stumbling around with no driver’s license. The machine in the ER scanned the tag - and voila.

The Reach of the Web.

If you think about it, by leveraging the Web, ambient devices can be empower in incredible ways - and in the years to come, we’ll see a new generation of such web applications emerge.

(Finally, if my medical scenario is ridiculous, and you are a medical professional, then I’m sorry.)



May 17 2009   3:45AM GMT

The Internet of Things Meets the Internet of Web Apps.



Posted by: Roger “Buzz” King
the Semantic Web, Web 2.0, Web 3.0, The Internet of Things, Ubiquitous Computing, advanced Web apps, RFID tags, online retail shopping

Injecting Smarts into the Semantic Web and Web 2.0/3.0.

In our continuing series on advanced web technology, we’ve looked at the difference between the Semantic Web and Web 2.0/3.0. We’ve also looked closely at the Semantic Web, and in particular, we’ve discussed what we mean by that word “semantic“. And with respect to Web 2.0/3.0, we’ve considered just what constitutes an advanced web app. And we’ve looked at some specific advanced apps.

But one thing has stood out above all else: the new world of web applications depends on our ability to make web apps smarter. At the core of this are a handful of key technological advances: namespaces, XML languages, full text searching, and web services. Still, as we have seen, we can only crudely mimic intelligence, which we do largely by using a complex mixture of standards, heuristics, and pre-made components.

Importantly, this issue of being smart is very old, and has been a far off goal of the folks who build software development tools since the very early days of computing. In truth, some of the things that seem new and exciting to us have actually been around for a long time, and have existed under multiple names.

But this base of intelligence-injecting technology, could it be used to give the Semantic Web and Web 2.0/3.0 a shot in the arm? Can we leverage the greater world of smart technology to make the new web even more powerful?

Let’s focus on just one technology that has been around a while, but is still vibrant and rapidly growing.

The Internet of Things.

This idea is centered around the idea that the objects in our world would serve us a lot better if computers could coordinate their use. Of particular interest are mobile objects. One of the key components behind this idea are RFID tags. RFID stands for “radio frequency identification”. A tag can be attached to almost anything. After they are deployed, an RFID reader can send out a signal, which is picked up by the RFID tags, when then respond. As things move around, as things are used in concert to perform tasks, they can be carefully tracked and managed.

Other technologies for tracking objects can be employed, too, and RFID is just one example of something that is fairly cheap and very dependable.

It’s also true that objects can respond with more than a “Yo, I’m here.” In particular, they are likely to tell us exactly where they are, and whether they are in use. But for the most part, these things tend to be fairly inert when it comes to intelligence. They might be warehouse items or objects in retail stores. Volume is a key factor. RFID tags are cheap enough that an organization can tag tens of thousands or hundreds of thousands of items.

Immobile Things, but Mobile Users.

We can use the Internet of things concept in another mode. The objects might be immobile, but the users might be highly mobile, and they might be carrying the tags. The objects might have computing capabilities in them, as well. If I work in a secure facility, and if I use a variety of computing devices in the course of the workday, I can be carefully tracked. And every machine could be engineered to allow me to perform only those functions for which I have been authorized. The computers could also track suspicious trends that involve multiple machines and multiple users over a period of time.

The Internet of Things and the Internet of Web Apps.

What does this all have to do with the Internet we are concerned with in this blog, the one that hosts next generation web apps? The two worlds could be blended together.

Consider this. When we buy things on the web, we normally use one of two retail models. If the object is software or data or in any downloadable electronic form, the website can ensure that by the end of the shopping session, our credit card has been paid and we have received the goods. This makes both the seller and the user happy.

Or, if the object is physical, like a printed book, the website will ensure that by the end of the session, our credit card has been charged, and we have been given a shipping number, a shipping date, or some other piece of information that gives us some assurance that we will get what we paid for. In this mode, the seller is likely to be quite happy, and the buyer might not be quite so happy.

But there’s another way. At the end of retail session, the buyer of a physical product could be given the ID of the particular object being purchased, and then, via the retail website, track that object nonstop from the moment the session ends until the moment it arrives. The buyer could even track the construction of a purchased object out of many subcomponents.

The Bigger Picture.

Here’s something to think about, something else that can be used in concert with the advanced web technology and the Internet of things concept. It’s called “ubiquitous computing”, and it is a concept that has been around for many years. It refers to the expansion of computing technology into every aspect of our lives.

Putting all of this technology together means that the new web is working its way into law enforcement, supply chains, manufacturing processes, retail shopping, education, etc., etc., etc.

This will have a huge impact over the next decade.



May 3 2009   3:00AM GMT

Email addresses, the new Web, and NASCAR.



Posted by: Roger “Buzz” King
NASCAR-like web ads, Web 2.0, Web 3.0, the Semantic Web, XML, namespaces, free email accounts, web services, web-based ads

The Semantic Web.

This blog concerns advanced Web technology, in particular,Web 2.0/3.0 and the Semantic Web. Each blog entry should be fully understandable on its own, but the blog as a whole tells a continuing story.

Very roughly, we’ve defined the Web 2.0/3.0 as the class of emerging web applications that are highly responsive, to the point of being competitive with desktop apps. Another characteristic is that they can manage large volumes of very complex media, like images, sound, and animation, as well as interconnected forms of media. We’ve looked at some specific advanced web applications.

Our concern here, in this blog entry, is the Semantic Web, which we have also roughly defined. The Semantic Web is something that does not yet exist, but would meet the very aggressive goal of supporting largely automatic web searches, freeing us from excruciatingly interactive, manual Google and Yahoo sessions. And we’ve seen that we would use such things as shared namespaces, intelligent full text searching, and XML-based markup languages to embed information in websites that could be used by smart browsers to perform far more accurate searches.

Web services would help a lot, too, by taking humans out of the loop when providing powerful web-based capabilities; one website can now provide a vast amount of information, for example, by silently using web services to collect information from many other web-based sources.

(By the way, we have also looked at precisely what we mean by “semantic” in the Semantic Web.)

The way we pay.

This all sounds very good. The Web would be far more useful, with automatically searchable Semantic Web-sites. But there’s a bad side to all of this, and it has to do with how we often pay for Web use.

The problem is that we often do not pay at all. At least not directly, with money. We pay by putting up with ads. Free email services, such as those hustled by Yahoo, Hotmail, AOL, and Mail.com, are generally accessed via web browsers, and we find the main pages of these email accounts stuffed with ads.

Some free email accounts even stick ads in your outgoing mail!

Often, the only way to get the ads stripped from a web mail interface is to pay a fee. We might also get more than just ad-free web mail pages; paying sometimes allows users to access their email with POP or IMAP protocols, via desktop clients (like Outlook and Apple Mail), thus avoiding ads in another way.

(As an aside, there are free email sites that either have no ads in them, or only very subtle ones. Try Gmail.com and Inbox.com. My favorite, with its clean interface and growing set of accompanying capabilities, is GMX.com.)

As it turns out, folks looking to buy ad space online find that they have a vast array of choices, and this drives down the cost of ad space. But these two things, an ever-growing list of free online services and cheap ad space, are related. This is because it is all too easy to build useful web applications. Like browsers, bulletin boards, calendar apps, blogging services, and stickies applications, email servers are cheap to build and maintain. Venders can use canned, largely free software components.

And, transmission costs on the Internet are effectively free, and the bandwidth is huge. Free email accounts often offer a gigabyte or several gigabytes of storage, because disk space is dirt cheap, too.

There is a lot of rebranding going on, too, where someone seems to be offering free email (or some other service), but it is actually being provided by a large email provider.

So, the way things have shaken out, is that free web apps like email servers look like NASCAR racing cars, covered with colorful ads. Many of these ads consist of video, and so we have to battle distracting, flashing colors so we can focus on our mail.

The trick behind online ads.

There is something happening in the online ad world: folks who provide these free, pay-for-it-with-ads services are learning to carefully target ads. There is specialized software available for this, and by plugging in some smarts, folks can make the ads that appear on your screen far more likely to be of interest to you.

How is this done? By watching what you type into search engines, by taking advantage of personal information you supply when you sign up for free email accounts and other services, and by carefully examining the content of the messages you send and receive, that’s how it’s done.

It’s important to point out that this works. The “click through” rate on ads can be radically improved, just by using some simple heuristics in choosing your ads. Folks who pay for ads love this, and it has allowed individuals who don’t even provide free web applications turn themselves in to ad space sellers. Your blog, your specialized website, can now host ads carefully targeted toward the visitors to your blog or your website.

But just wait for the Semantic Web.

But it will really kick in when the semantic web is here. The same technology that would make browsers far, far smarter about finding good URLs for you will make the targeting of ads at you extremely precise.

This slowly-emerging technology is badly needed by the folks who sell ad space and by the people who buy that ad space. That’s because you and I are starting to get used to this world of NASCAR websites. We are looking through or past or around the ads. They need to be made a lot smarter, is order to get our attention back.

But by using Semantic Web technology to radically increase click-through rates, by getting us interested in ads again, impulse shopping on the Web might skyrocket. It’s very easy to go from seeing an ad for a product you have never heard of before to having bought it.

Like little kids watching commercials for sugar-heavy cereals on Saturday cartoon shows, we will be manipulated like we have never imagined before. That’s the bad side to the Semantic Web.



Apr 26 2009   8:09PM GMT

The world of advanced Web applications: what are they?



Posted by: Roger “Buzz” King
Web 2.0, Web 3.0, the Semantic Web, XML, mashups, wikis, social networking sites, tagging, distance education, zenbe.com, evernote, GlideOS, namespaces, web services

This blog is dedicated to an ongoing discussion of Web 2.0/3.0 and the Semantic Web. The slant is on the technology itself, how it works and what’s going on inside advanced Web applications. We’ve looked at a couple different Web 2.0, in particular, Evernote and GlideOS. We’ve tried to characterize the capabilities of Web apps.

The impact of the new Web.

This posting addresses a non-technical question: What has been the impact of this technology our society?

Technological advancement can be very roughly broken into two groups: incremental and radical. Which of these is Web 2.0/3.0? Is it a radical advance?

Consider what highly responsive, multimedia web applications have done for us. They have enabled the development of:

* Wikis: These are web applications that allow us to collaboratively develop sophisticated, easily searchable information bases. These can range from dictionaries for specialized disciplines to vast databases containing DNA information. Data can be vetted by experts and/or challenged by random users.

Everybody knows about Wikipedia, but like blog and bulletin board software, wiki software can be easily installed and configured for deployment on almost any web server, whether it is publicly accessible, or used privately within a corporation or by a professional organization.

* Social networking sites: These are web applications that allow us to actively participate in a myriad of communities based on professional and personal interests. We find work, develop contacts, share music and photographs and video, and develop lifelong collaborations with people we would never have met otherwise.

They are also used by people who are in daily physical contact, but who find they can deepen their relationships by posting personal information on public sites like MySpace and Facebook. The interesting thing about these sites is that new and successful ones keep emerging,

* Tagged content vendor sites: Volunteers and paid individuals can contribute multimedia content and collaboratively tag it, using both freeform and highly sophisticated tagging protocols, such as the sophisticated MPEG-7 standard. (We will look at MPEG-7 in a future posting of this blog.) These include images and sound and video, and many taggers are highly trained professionals who can carefully categorize content according its detailed meaning. This technology makes a vast sea of otherwise-unknown assets available to us. It also makes these assets searchable, thus transforming a completely intractable task into something we easily perform.

In particular, this has radically enhanced the creative power of both professional and hobbyist animators by giving them complex scenery and character components to work with. Check out thoughtequity.com for an example of a content vendor. Take a look at daz3d.com for animation content.

* Mashups: These are portal or second tier web applications that take content from other web sources, such as Google Maps, investment information, medical advice, and scientific data. Often mashups take data from several or hundreds of other sites and create complex, highly valuable multimedia assets.

Take a look at woozor.com. It combines Google map and weather data.

* Distance learning: Universities, corporations, professional organizations, and lone instructors can develop and sell effective, multimedia educational packages that bring education to anyone who has Internet access. This allows us to retrain ourselves for new occupations, stay current in our professional skills, and find employment that is satisfying, steady, and high paying.

I teach on my university’s distance learning site, and we use video, sound, desktop video capture, slide presentations, and software demonstrations - and they can all be edited into a unified product. There are online universities now, where you can get a college degree. Take a look at jonesuniversity,com.

* Hybrid applications that support things like email, calendar, collaboration, RSS feeds, etc.

A good example of a hybrid application is zenbe.com, which provides a combined web-based email, list making, and calendar application, and in that sense is similar to many other email providers. But Zenbe also provides a collaborative tool called Zenbe Pages, which can be used by collaborators to organize their activities. A Zenbe page can have notes, calendars, lists, RSS feeds (not new ones, but existing RSS feeds) on them. Zenbe also provides quick access to Twitter, Google Talk, and Facebook.

By the way, it’s important to point out that the categories I list above are not as clear-cut as one might think. Many modern web apps contain elements from more than one of these categories.

The software building blocks.

From a programming perspective, what specific Web 2.0/3.0 software has allowed all of this to come about? We’ve discussed much of this already in previous postings of this blog. It includes XML and the exploding class of XML languages, namespaces, IDE’s (Integrated Development Environments), large code bases (such as the vast library of ready-made Java components), web service software development tools, and AJAX web page optimization technology. It also includes web development frameworks like Ruby on Rails, and newer ones, engineered toward high responsiveness, like Flex and Silverlight.

Also included are powerful media formats, codecs, players, and editors, which allow web users to do more than upload and search media; we can edit it and reform video, images, and sound, without leaving the simple world of our browsers. And of course, modern mega media apps enable us to build media assets. The list of contributing software tools goes on, but we’ll stop here.

It scales!

And there is something subtle, but important that gives advanced web technology extraordinary power: it scales. We manage shared resources that are truly gigantic in size, and are spread across countless machines around the world. We leverage global user bases, cheap server technology, and wide open Internet bandwidth to give media stores belonging to Web apps astonishing growth rates.

The bottom line.

Yep. Web 2.0/3.0, as a whole, is a truly radical advancement. It has fundamentally and globally changed society in a big way.



Apr 19 2009   2:31AM GMT

There are Web apps and then there are Web apps.



Posted by: Roger “Buzz” King
Web 2.0, Web 3.0, the Semantic Web, web applications, Filemaker, evernote, SMIL, XML, Glide

In our continuing series on Web 2.0/3.0 and the Semantic Web, we have looked at one simple, yet impressive Web application, called Evernote. There are significant advantages of Web apps; in particular, the application is available wherever you can get onto the Web, you don’t have to run and maintain complex desktop software, and your data sits on a (hopefully) secure and backed-up data server.

Web Apps.

We noted that some Web apps, including Evernote, are both Web-based and desktop-based. Seemingly, this might be a disadvantage, because now, the user does have to install and maintain the desktop version of the app. But, in exchange, you have two copies of your data, at different physical locations. You also can use the app when you are not on the Internet. And, as far as Evernote goes, the desktop app is very far from difficult to manage.

Let’s look at this a little closer. Not all Web apps are the same. One problem is that too many vendors feel compelled to brag about the Web capabilities of their projects, and so we have to be suspicious - especially when it comes to older applications that have been retrofitted with Web capabilities.

Let’s look at a few applications. Please keep in mind that the first two applications are not advertised as “Web apps”. I am describing them only as a way of categorizing the Web capabilities of applications in general.

Minimal capabilities: exporting to the Web.

Our first example is an application that runs on Macs and is very impressive. It’s called Curio, and is made by a company called Zengobi. It gives you a workspace to which you can append text notes, lists, images, video, and sound clips. It also supports diagrammatic mind-maps. It’s great for a wide class of brainstorming techniques from simple note-taking to sophisticated workflow planning. It’s all-in-one nature makes it a little imposing and chaotic at first, but it is actually quick to master - and then its freeform nature proves itself to be very powerful. It is also very elegant.

Curio’s Web capabilities are extremely limited, however. All you can do is output a Curio file as a fixed HTML page. It cannot be updated over the Web. For convenience, it can export a file directly to your “Mac” Web account, if you own one.

Modest, often tacked-on Web capabilities.

Another example application is Filemaker. (I am referring to their products called Filemaker Pro and Filemaker Pro Advanced, since they are what I have used in my classes as the University of Colorado.) I teach database management systems, and I can say lots of good things about Filemaker. It is a very quick and simply way to get a full-fledged, scalable, visually-pleasing desktop database up and running. I like it.

But its Web capabilities are typical of applications that have added Web capabilities long after the fact. What you can do with Filemaker is “publish” a database on the Web, and allow Web-based updating and searching. It in effect turns the machine hosting the database into a simple server. But most of Filemaker’s capabilities are not available via the Web interface. And, the database only exists on its original site. All data remains there.

Native, full Web capabilities.

So, what’s a true Web app? I’d say it is an application whose native interface is Web-based, and where all or virtually all of its capabilities are available via a browser. Evernote is a good example.

There is a fuzzy line between “websites” and “Web applications”, as we have previously discussed. And in fact, some people consider virtually all powerful websites to be Web apps. This includes Amazon, Blogger, and Wikipedia, as well as countless lesser-known websites.

And, with respect to the deliberately narrow criteria we’re using here, these applications are indeed Web apps.

So, what characteristics do we see in applications that are powerful, and have native, complete Web interfaces? They are likely to store data persistently in a serverized database management system like MySQL, and present the user with web forms to fill in, and return to the user dynamic Web pages populated from the database. A website that we might be willing to label “Web 2.0″ would be one that is highly responsive and manages large amounts of data.

We might call it Web 3.0 if it also manages large volumes of continuous data (like audio and video), and presents to the user a highly multimedia web interface. But these terms are vague, and drawing lines between them is to a certain degree misleading and a distraction.

Perhaps something that might be a truly Web 3.0 characteristic is that the application, rather than just delivering up video and audio, uses a combination of multiple forms of media, in concert, to interact with the user. We looked at SMIL, an XML language that allows the user to build presentations that coordinate multiple forms of media, such as images, sound, and video. The SMIL programmer can arrange media on the screen, and specify how the various pieces of media will be displayed over time.

Glide: the Web-based desktop.

But let’s look at one very, very aggressive attempt at a true Web 3.0 application. It’s called Glide, and you can get yourself a free account. This application does not support any sort of desktop-based version, and so you do have to be online to use it. It also needs a very fast Internet connection, because of the wide variety and high volume of data it allows you to manipulate.

What’s Glide? It is advertised as “the complete mobile desktop solution”, and it provides a complete, virtual, web-based computer. With it, you can edit photos, draw diagrams, store media files, send and receive email, manage a calendar, manage video, write documents, even build a website - in other words, do almost everything a non-programmer might want to do with a computer.

Its interface consists of three main windows. One is a virtual desktop, with various applications ready to use; another is a portal where the user can access the Web and develop websites; the third is a virtual hard drive, where media and files created by the various applications can be stored and accessed.

Is this the way of the future? It completely frees a user from having to buy, install, and maintain complex, expensive applications, although you still need a computer with a browser to run it. One drawback is that none of its apps, as near as I could tell, can compete with the dominant desktop applications. It is not Photoshop, it is not Dreamweaver, and it is not MS Office Outlook. But its apps are not trivial: they do the job just fine. And the entire interface is simple and visually pleasing.

There is also a way to sync your files on your desktop with the files on the Glide servers, and their documents and spreadsheets are apparently compatible (to some degree) with Microsoft’s Word and Excel. But they apparently are not planning on creating any sort of hybrid web/desktop based product. Glide’s goal is to move us all toward the Web and away from our desktops.

The Glide servers seemed fast enough to me, by the way. That’s the big question. Can it be as responsive as a desktop computer? Well, it’s as fast as my Vista machine… But slower than my iMac.

Give it a try.




Apr 13 2009   3:27AM GMT

Mega Media Apps: A Huge Challenge for Web 3.0



Posted by: Roger “Buzz” King
3D animation, Web 2.0, Web 3.0, Maya, Video, codecs, video containers, continuous data, blob data, web applications, media applications, 3D modeling

What Are Web 2.0 and Web 3.0 Apps?

In our continuing series on Web 2.0/3.0 and Semantic Web technology, we’ve discussed one particularly impressive Web 2.0 app: Evernote. The challenge is to get the best of both worlds: the interactive performance of a desktop application, and the use-it-from-anywhere convenience of the Web. Many Web applications - such as Evernote - also ensure offline usability by providing both a desktop and webpage interface, and maintaining a local version of the database, which is periodically synched with the web-resident database.

But, as cleverly engineered as it is, and as useful as it is, Evernote is still a very simple application. What about big applications? What challenges face the developers of Web 3.0 applications, ones that will manipulate large databases of continuous data, and extra-large instances of blob data? (Video and sound are continuous; an image is blob data.)

Let’s consider one of the biggest media apps out there: Maya, the high-end 3D application that is widely used to make full length animated movies. (See http://autodesk.com for Maya.)

What’s the big problem? If an application like Maya was reengineered as a Web app along the lines of Evernote, would it be usable? Might it be intractable to be continuously moving complex animation data between the server and your client machine?

3D Geometry: Just How Big Is It?

Well, the problem is not the complex geometric models that an application like Maya must store and manipulate. 3D animation applications like Maya tend to support multiple ways of creating 3D shapes, and they do indeed tend to be very data-intensive. The first image at the bottom of this page shows a Maya screen with two spheres, one built with straight line geometry and one built with curved line geometry.

As it turns out, to make the straight line model smooth, you would need to use many more lines and vertices than I have in the the model in the image. But if you think about it, the straight line model uses the geodesic dome approach; it builds a 3D sphere out of many 2D polygons - which are flat. The more polygons, the smoother the model. In the other model, we use curved lines, and so the model looks much smoother, even with not that much detail. But the mathematics are complex.

You can image that a dense scene, with a very large number of detailed, 3D models of these sorts would contain a lot of data. But no, that’s not the problem. These models can be uploaded and download very quickly. They aren’t as big as you might image - because they are not continuous data. They are blobs, either binary or of code text, and are reasonably manageable.

The Killer Problem: Video.

The problem? It’s what Maya creates at the end of the design process, when Maya renders a scene so we can watch it. It renders video. And video, whether you are looking at video shot with your home camera, or at video rendered by Maya, or video I create when I capture desktop videos on how to use Maya and post it for my animation students, well, it’s big. Really big.

Video is the killer. Video makes a lot of mega apps, and even very simple apps that happen to create video, not scale. We could manage a modest number of modest-sized video segments via a web interface, but not big chunks of video. To make videos even worse, we usually have to add a sound track.

So, the lesson is that many or most applications that create and/or edit video in any form face this challenge.

This is why we use video compression. First, you need a container, which is a way of bundling the huge series of still images that make up the video, with the sound, as so that we can move it around as a single object. (Keep in mind that often consists of at least 25 frames, or still images, per second - and that makes for big pieces of continuous data.) Popular containers for small scale projects (such as animations that will be marketed via CDs) are .mov and .avi. The first is the Apple Quicktime standard, and the second is due to Microsoft.

Once you have a container, you need a codec, which is a way of compressing and decompression video, so that it isn’t so big when you move in over the Internet or store it on a small storage device. Codec actually stands for “code” and “decode”. It cannot be overstated how powerful a codec can be; I routinely turn gigabyte videos submitted by my students into less-than-100 megabyte videos. They can be uploaded to a website and then played, and at least in a small box on a web page, they look great.

But if you want quality, if you don’t want to lose detail, and if in particular, if you are going to display a video on a large display (or at the movie theatre), you often cannot compress it enough.

That’s it. That’s the problem, and it’s one of the biggest challenges facing the makers of Web 3.0 apps, which are supposed to fluidly manipulate video segments.

A Far Bigger, Far More Universal Problem.

But perhaps the old video challenge, the one that is constantly shoved in the face of next-generation web app developers, is a distraction, something that draws us away from the real problem, the one that kills many media apps, even when they are totally desktop-based. What is it? Take a look at the animation designer’s interface to Maya, in the second image at the bottom of this page.

The problem is the size and complexity of these apps. There are made up of multiple complex windows. They have menus, palettes, and lots of little boxes that contain detailed information. Keep in mind that you only see one of the Maya windows in the image below, at the bottom of the page, and it is already too dense for a single screen, even a large one. Looking more closely at the window in this image, note that there are several places on it that contain drop down menus. Many of these menu items lead to other drop down menus. Even the main menu at the top is changed frequently during the process of creating an animation project. The designer’s GUI as a whole changes during the process of using the app.

It is very hard to fathom the incredible complexity of an interface like Maya’s until you use it. Professional video editing applications are typically simpler, but are still very complex, especially if the application supports special effects and the insertion of text. Even applications intended for the average Joe, like Photoshop Elements, are often horrifically complex.

The Bottom Line.

The problem that faces developers of all sorts of next-generation apps that must manipulate animation or sound or video or images, or that format complex documents for publication (like Adobe InDesign), or support the development of complex web pages (like Adobe Dreamweaver), is this: it is near-intractable or perhaps completely impossible to build an interface that explains to the user the process of using the application. Little wizards or chunks of documentation that contain “recipe” steps, don’t come within a thousand light-years of conveying how to use that app as a whole.

That’s it. True Web 3.0 applications would convey not just a vast, deeply embedded toolset, but the way the tools should be used. That’s the big challenge.

By the way, if you want to see a handful of videos made by my introductory animation students, go to my website at http://buzzking.squarespace.com and look at the right column, near the bottom of the page.


Apr 2 2009   5:59AM GMT

Full Text searching: cleaver heuristics for managing large web-based document collections.



Posted by: Roger “Buzz” King
XML, the Semantic Web, SMIL, web applications, Web 2.0, documents, Web 3.0, databases, MySQL, SQL Server, Multimedia, full text, full text searching

There is an explosion of technology for supporting sophisticated forms of media on websites and in web applications. In our continuing series on advanced web applications (in particular, as they pertains to the Semantic Web and Web 2.0/3.0), we’ve looked at continuous media, in particular, video and multimedia presentations. But there is a very old form of continuous media, something that is perhaps the dominant media on the Web, and that’s text.

It’s becoming a very major issue in web development.

Text.

In this blog entry, we’ll be looking at a particular form of text, called “full text”.

But just what is text to begin with? It’s character-based data, anything we can read.

And what will we want to do with it in next-generation web applications? It’s important to note that more and more vast libraries of documents are being put online. Web applications need to provide far faster and more accurate searches of documents than what we can perform with Google.

Interestingly, a successful technology, called “full text retrieval”, is already in place in the relational database systems that underlie modern web applications. It’s there working for us, and we are likely to not be aware of how clever it is.

It’s also something that should be used much more heavily by web application developers.

Let’s step back and consider three different - and increasingly more sophisticated - ways of managing character data.

Atomic Character Attributes.

First, there is the traditional relational database approach, whereby data is stored as tables made of rows of atomic, fixed sized attributes. By atomic, we mean that each attribute has no internal structure. So, a table of insurance claims might have rows with the following attributes: Claims-ID (an integer), Amount (an integer), Medical_Problem (a fixed length character string), and Subscriber_Name (a fixed length character string). Using SQL, the universal database “query” language, we might look for all rows that contain the name “Fred Jones”. Or, we might search for all rows that have claim numbers that are between 110 and 115.

Essentially, this approach limits us to comparing small strings of data to each other or to fixed values. There are some common extensions that we find in relational databases, such as being able to ask the question to find all rows where the Medical_Problem is something like “broken leg”. Then if a row actually has the value “broken legs”, we would most likely see this row in our results.

Full Text.

Second, there is the ability to search pieces of text according to their natural language (in this case, English) meaning. In this case, we consider the character data to have internal structure, and the values are not considered atomic. Often, these pieces of text are long and of variable length from one row to the next.

It is actually an extension of - but a very dramatic one - of the like operator in SQL.

It is what we call “full text” management or retrieval, and modern relational database management systems like MySQL and Microsoft SQL Server support this. This was seen long ago as a critical extension to relational database technology. Thus, we might rename our Medical_Problem field to Doctor’s_Diagnosis, and allow free form English text in this attribute, as well as allowing the value to be quite long. Then we might search for all rows where the doctor describes “fractures of the lower limbs”. Notice that none of these words might actually appear in the attribute, which might simply refer to “broken legs”.

Natural Language Processing.

This capability would clearly be very powerful, if we could do it right. The problem is that to support it fully, we would need to use highly advanced natural language processing techniques, which are very time consuming to execute, especially on huge databases of large documents. The full text approach tries to simulate true natural language searching in a far less expensive way. The real thing, by the way, might not be all that accurate anyway. Natural language is naturally ambiguous and very subtle.

True natural language searching would be our third way of processing character-based data, by the way. It is not a fully developed technology. And importantly, we usually don’t need anything that fancy.

The Clever Compromise.

So, our middle option, full text searching, is what dominates today - and it is a surprisingly accurate, and efficient, technique that operates on a small set of heuristics. It can transform a dumb webpage where we can only search for small, fixed character strings, to a rich next-generation webpage that can effectively be searched according to its meaning. It allows us to manage very large text documents in web applications - and get us surprisingly close to the semantic power of true natural language searching.

We’re not going to go into a lot of detail here, but here are some of the heuristics that are used in full text search. First, “stemming” and related techniques are used; they conjugate verbs, detect plurals of nouns, and remove prefixes and suffixes. Another technique is to use a “stop list” that lists words that should be ignored, like “the”. The system might also let us specify the “proximity” of words; this refers to how closely specific words should appear in a document. It can also be powerful to include a synonym checker. And the ability to allow for “wild cards”, in particular, letters that may vary in a passage without changing its meaning, can be quite useful. Dictionaries of technical words that pertain to specific domains (like medicine or law) are very useful. We might also provide a feedback capability, whereby users can train full text search engines to be more accurate.

This clearly doesn’t come anywhere near true natural language processing - but it is fast. It will be a growing technology on the new web, with a lot of hidden development, making this heuristic-based technique more and more effective.

Indexing.

We should note that there is a significant up front cost in preparing a document for full text searching: we need to build an index with an entry for every (non-stop) word in the text. Then, when a query is executed, we can look for words in the document by searching the index, instead of searching the full text. If there were no index, the search would be extremely time-consuming.

The Future.

As more and more governmental, educational, medical, and other complex documents become available on the web, advanced full text searching will enable us to search vast databases in a tractable fashion. Even more clever full text retrieval engines will turn dumb, “gotta Google them” document portals into true Web 3.0 and Semantic Web applications.