In the previous posting of this blog, we looked at SQL-based relational databases and why they are not well suited to managing advanced forms of media, like images, language, video, and sound.
Searching by semantics.
Here, we look closely at one specific issue related to managing complex media: How to categorize and search advanced forms of media by their “meaning” or “semantics”. This is extraordinarily difficult, and in fact, in general, it is impossible. This is why we usually rely on relatively low-level heuristics and can only simulate search-by-semantics in simplistic ways.
Consider a library of soundless video clips. Let’s assume there are many thousands of them, and they vary in length from seconds to hours. First of all, the only clips we can afford to download and actually view in real time are the ones that are only seconds or minutes in length, and we can do this only if we are somehow able to limit the search space to a small handful of candidates. Keep in mind that a video can consist of twenty to forty images per second.
So what do we do?
We could search tiny samples of our video clips, perhaps taken from the beginning, the middle, and the end of each clip, but this doesn’t actually well, either. We need something that can scale, that is automated.
The dominant technique is to extract information concerning low level attributes of the video clips (such as their format and pixel count) automatically, and then have experts add more tagging information by using widely adopted, formal namespaces. We might use a geography namespace to mark clips as having rivers and mountains in them.
These two forms of tagging information might be encoded together using the very popular MPEG-7 language. This creates a very indirect way of searching video clips. We don’t actually search them. We search the hierarchically constructed MPEG-7 tag sets that describe the videos. This at least allows us to use SQL in a reasonably straightforward way to do the searching.
Searching for specific images.
There is very good technology for processing images for fixed pixel-based subcomponents like individual faces. We can also search for video clips that have any faces at all in them.
In general, it’s easier to search for things made by people because they tend to be more angular and regular in shape. These include specific buildings and types of aircraft.
Searching for colors and shapes.
We can also search for more abstract subcomponents of images, like polygons, circles, and the like. Despite the fact that video images are pixel-based (or “raster”), there is good technology for isolating the lines that form the boundaries of subcomponents.
And we can look for colors and compare the relative location and dominance of various colors, like images where 63% of them are a particular shade of orange.
Searching for change over time.
We can also search for pattern changes in the series of images that make up a video clip.
But none of this has much to do with the real meaning or semantics of images and the video clips they form. Taking this next step is huge challenge.
How can look for a setting sun or a ball moving across a tennis court, without knowing the details of the sunset or the particular tennis court in advance?
We can use the colors and shapes approach to look for a big orange ball descending below a possibly-jagged horizontal line. We could look for a small, white or yellow spherical object move across a big green rectangle.
One way to raise the bar a bit is to use domain-specific knowledge about the images being processed. It’s a whole lot easier to spot that tennis court if we know that’s what we’re looking for. Then we can fill our searching software with lots of detailed information about the various sorts of tennis courts. We can also more easily isolate the tennis court in a larger image if we know it’s there somewhere. This gives us an extra edge, so we can perhaps find the court, even if it turns out to be brown and not green, or if the surrounding terrain is almost the same color as the court.
We of course never get away from searching by heuristics that only simulate the process of determining the true meaning of a series of images. We can never truly search by semantics.
But we can do something else: we can get humans into the loop and train our software to do a better job. We’ll look at this next.