Managing advanced forms of media, such as images, sound, video, natural language text, and animated models have been discussed a number of times in this blog in the past. Traditional information systems, such as relational databases, have been engineered largely to handle the sorts of data we have in business applications, primarily simple numeric and character string data. To the SQL database programmer, the nice part is that the data speaks for itself. If a field is called Name, and the value is Buzz King, the semantics of “Buzz King” is pretty obvious, and it can be processed in a largely automatic fashion. The same goes for a field called Age, with a value of “97″.
Searching advanced media: far, far more difficult.
But modern media is far more complex than this. ”Blob” data like images, and continuous data, like sound, video, and natural language text, are very difficult to search and interpret automatically. There are two approaches that have been taken to resolve this dilemma.
Tagging: the simple approach.
The first is tagging. Descriptive terms, often taken from large, shared vocabularies, at attached to pieces of media. These vocabularies can be very domain-specific, dedicated to areas like medicine, law, and engineering.
Intelligent processing software: the second approach.
The second technique is the automatic processing of pieces of media using image processing, natural language, and other highly intelligent software. These applications are very sophisticated and understood only by experts. And, these applications often demand a lot of processing time, and this makes bulk processing impossible. It’s also true that the results can be haphazard. Some pieces of media can be interpreted precisely, others not so precisely – and dramatic mistakes are frequent. A tennis court might be mistaken for an airplane runway. There’s a huge trust factor involved in cranking up image or sound processing software or natural language software.
Often, we can provide feedback so that these applications can learn, over time, the way we want media to be interpreted. We can help the software learn the difference between a tennis player and a member of a ground crew on a small runway. All of this is hugely expensive, in terms of the cost of developing the software, and in terms of the physical resources needed to run the software.
A middle ground? Not really.
So, is there some middle ground? Something simple, yet more “intelligent”? Yes, and the answer is to take a sophisticated approach to what otherwise might be very simple tagging techniques. However, the core problem with tagging remains: we search and process tags – and not the actual data. It is an indirect, but fast process. The goal is to come as close as we can to simulating the results of such things as image processing, but to do it with a simple, yet comprehensive, accurate tag-based technology.
We’ve looked at some of the solutions that have been proposed. They include Dublin Core, MODS, and MPEG-7. The first is very simplistic. The second is more sophisticated, in that the terminology used is broader and far more precise. The third is very aggressive in that it supports the complex structuring of tag data elements.
So, what are we really doing?
In essence, we build a hierarchy of metadata and then instantiate it for every piece of media we want to catalogue and later search. What we are doing is creating a parallel database, one where every piece of blob or continuous data is accompanied by a possibly very large tree of structured tagging information. The parallel database has its own schema and an instance of it is created for every piece of media in the original media database.
The end result? Instead of creating some sort of media-centric query language, like an SQL-for-video, we give up on trying to search the media database itself. The query language remains largely ignorant of the nature of blob and continuous media. We can continue to refine and expand the schema of the parallel database until search results are satisfactory.
One of the most challenging applications of next-generation web technology is the support of sites that provide media assets that are used in 3D animation. Animation applications are used in television and movie productions, training and informational videos, product design and CAD, magazine ads, and on websites. Animated projects can be extremely expensive to develop, demanding highly skilled and experienced artists who are familiar with complex applications for modeling, animating, rendering, video and sound editing, storyboarding, and compositing.
This has created a rapidly-growing, high-dollar market for animation assets. There are a handful of sites that sell extensive libraries of animation assets, in particular 3D models of characters, buildings, vehicles, weapons, animals, plants, cityscapes, and the like. These models are used by a wide class of individuals, including professional animators, architects who want to place their designs in attractive surroundings, medical and scientific writers, and hobbyists.
These assets can vary from being free to costing several hundreds of dollars. They can be rudimentary, or extraordinarily detailed and real-life.
But it can be very problematic to search these sites. Why?
Lack of tagging standards.
There are standards for tagging image and video data, publications, and many other web-based resources. These include MPEG-7, the Dublin Core, and the Metadata Object Description Schema, which have been discussed previously in this blog. But when it comes to complex forms of information, such as animation assets, it is a free-for-all, and the searching process is manual, highly iterative, and painstaking, even if you already know what sites are likely to have content you are interested in.
Complexity of evaluating an asset.
Another problem is that it is time consuming to evaluate even a single asset once it has been identified. Models have to be read into animation applications and they are highly complex. Often, you can’t even download a model without buying it.
Interdependency among collections of assets.
Assets like animation models typically must be used in combination with other models and elements of animated scenes. There can be many conflicts between assets, such as highly varying vertex and line densities, differing artistic style, and in their materials and textures. To put together a reasonably matched sets of assets can be extraordinary time-consuming.
Complex and error-prone import/export processes.
There are also dozens of commonly-used model formats used by modeling and animation applications. There are indeed a handful of standards, with names like obj, FBX, and Collada. Translating between the many proprietary and standardized formats can be very error-prone (with information being lost or changed), and the process often demands that the animator have access to applications that he or she doesn’t own. And these applications can easily cost thousands of dollars, and take years to master, and so a given animator is only likely to actively use a small set of them, often only one.
The challenge of Web 3.0.
Some people define Web 3.0 as the successor to Web 2.0 technology, which is meant to produce web applications that approach desktop applications in their interactive performance. Web 3.0, some say, would extend this technology to web apps that make use of advanced media in their interfaces and/or provide access to large media bases. Perhaps the biggest challenge facing Web 3.0 developers would be to attack the problem of animation assets, in particular, tagging, organizing, interrelating, searching, evaluating, and transforming them.
More on this in future entries of this blog…]]>
The problem of searching for media assets.
We’ve already looked at advanced media, in particular video, audio, and animation data, in previous blog postings. In particular, we’ve looked at the subtle and complex nature of media asset semantics. We’ve seen that interpreting a piece of video, for example, is far, far more difficult than interpreting an integer or character field. Since the goal of the Semantic Web effort is to make the searching of the web highly automated, advanced media is becoming a huge and critical research and development focus for the builders of next-generation web development applications.
Just how do we provide an environment where media assets can be searched in a mostly automatic fashion, so that a human does not have to painfully paw through hundreds or thousands (or millions) of video chunks to find the right one? We’ve looked at emerging technologies for marking up advanced media information, and for making it usable in a variety of web applications. We’ve also looked at the dramatic challenge presented by mega apps to would-be users; the interfaces to these applications are truly massive and cannot present to the user the way in which they are meant to be used.
The problem of proprietary formats.
One specific, and very difficult problem, is the massive heterogeneity, not just of media formats, compression technologies, and container technologies, but of the applications themselves. If we are going to automate the searching of complex modeling, video, audio, and other media assets, we’re going to have to address a key question: since many media apps make use of their own proprietary data formats, how are we going to provide automated ways of searching media assets that are stored in these formats?
The problem of highly imperfect generic formats.
There are indeed many existing, as well as soon-to-emerge, standards for importing and exporting data between powerful media applications, but transformations in and out of these formats are often “lossy”, in that information is lost or changed. In fact, locating and downloading assets that are in supposedly-generic form is often very frustrating, because these assets end up not performing well. They can be difficult to edit and reuse. 3D animation models regularly blow up when animators try to import them into animation applications and the manipulate them. A hawk may look like a hawk until you try to render it with its wings flapping, and suddenly it’s a blob of geometric garbage.
One possible direction.
So, what do we do about the fact that many media assets must be manipulated by the original applications that created them? How can we facilitate reuse? It’s extremely unrealistic to expect users to master perhaps dozens of video or audio or animation applications. Filtering assets according to their file extensions is a good idea, and it is a well established practice.
But what we really need is a globally-known site that either literally or conceptually centralizes the massive network of import/export relationships, along with information about the relative success of these mappings. Are they ever lossy? If so, can they be fixed? What series of applications might we want an asset to be imported/exported through so that in the end it is in a usable format, given the applications that the user owns and has mastered?
There is much to be done. Right now, searching for and reusing media assets is a painstaking, trial-and-error-prone process.]]>
As we have seen, namespaces are a core element of the emerging Semantic Web. By posting namespaces on the Web, we can share precise vocabularies that will hopefully enable us to automate the process of searching the Web.
Searching with today’s search engines, like Google, is an inaccurate and highly iterative process. Searches are based on matching our search words with words in the documents that have been found and indexed in advance by the search engine. It can be a very painstaking process: we have to click on the URLs that are returned, and for each one, make a decision as to whether or not the page is relevant. We typically end up changing our search words gradually, as we hone our search criteria.
Namespaces are intended as a key element of a long term goal to make search engines of the future smarter. If the terms we used to formulate our searches came from widely-adopted, standardized namespaces, there would be far less painstaking iteration involved in finding the right webpages. We would accompany our search requests with links to the namespaces that define terms we are using. And in fact, searching would become at least partly automatic, with the browser able to narrow the set of returned URLs by making use of its knowledge of namespaces.
The Dublin Core.
Let’s take a look at one of the most widely known namespaces. It’s called the Dublin Core. But, as it turns out, it proved too simple and has since been eclipsed, at least in part, by a somewhat more sophisticated namespace called the Metadata Object Description Schema.
To get started, here’s another way to look at a namespace: it is used to create metadata that describes some data source. In particular, the Dublin Core was engineered to provide metadata for resources that can be found on the Web, including text-based documents, images, and video, and in particular, web pages. Want to know what a web page is all about? Look at its metadata, specified with the Dublin Core standard.
By the way, the namespace is named after Dublin, Ohio, not the other Dublin. The namespace was the result of a workshop held in Dublin in 1995. It is not an XML extension, like SMIL, the language used for building multimedia presentations. However, the Dublin Core can be used to create metadata for documents that are specified with XML or one of its many extensions.
So, what is in the Dublin Core? Basically it is a set of terms such as Contributor, Publisher, and Language. Some of the terms generally refer to very simple values, like Contributer, which is the person or organization that created a document.
To look at one of the potentially more complex Dublin Core terms, Coverage can describe the 3D (x,y,z) coordinates, or the time period, or the nation referenced by the document being described. It could refer to all of these. Note that this is not the time the document was written, or where it was written. Coverage refers specifically to the content of the document itself.
So, if we tell a smart browser of the future to find all documents that pertain to the year 1865, it will not return documents that were written in 1865, but are about the year 1012.
One drawback of the Dublin Core is that it is very loosely defined. So, it often fails in its true purpose: to provide precisely-defined terms that all of us can use, and where we can be confident they will be uniformly interpreted.
A More Sophisticated Standard: MODS.
A newer proposed standard, called the Meta Object Description Schema, or MODS, is an XML language that has been very actively promoted as a successor to the Dublin Core. MODS has more terms, and more precisely-defined terms. Since it leverages the ability of XML to express nested or embedded structures, it can convey much more information than a list of Dublin Core terms can convey.
Here’s a little piece of MODS:
This only gives a hint of the rich metadata that can be specified by using MODS. (The MODS website provides some far more detailed examples.)
Still, compare this to the Dublin Core Contributor term, which might have the value “Bugs King”. Is this a human name? Is it a pest control company?
But – even though it seems like an odd name, in the MODS example, we know that this is a person who goes by the name Bugs King.
Dublin Core might die and blow away – but it will always be recognized as a pivotal point in the development of the Semantic Web.