 




<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Buzz’s Blog: On Web 3.0 and the Semantic Web &#187; MySQL</title>
	<atom:link href="http://itknowledgeexchange.techtarget.com/semantic-web/tag/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://itknowledgeexchange.techtarget.com/semantic-web</link>
	<description>Defining the necessary skills for future software professionals</description>
	<lastBuildDate>Sun, 16 Dec 2012 04:42:23 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Two easy ways to build static websites (or not so static)</title>
		<link>http://itknowledgeexchange.techtarget.com/semantic-web/two-easy-ways-to-build-static-websites-or-not-so-static/</link>
		<comments>http://itknowledgeexchange.techtarget.com/semantic-web/two-easy-ways-to-build-static-websites-or-not-so-static/#comments</comments>
		<pubDate>Fri, 01 Oct 2010 18:48:52 +0000</pubDate>
		<dc:creator>Roger King</dc:creator>
				<category><![CDATA[blogs]]></category>
		<category><![CDATA[dynamic websites]]></category>
		<category><![CDATA[Freeway]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[static websites]]></category>
		<category><![CDATA[website development]]></category>
		<category><![CDATA[WordPress]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/semantic-web/?p=261</guid>
		<description><![CDATA[This blog is devoted (mostly) to cutting edge Web and media technology.  Today and in the next posting, we will look at two website development tools that I have found surprisingly powerful, given their simplicity and elegance. One of them is WordPress, which can be run on Macs, WIndows machines, and Linux machines, and the [...]]]></description>
				<content:encoded><![CDATA[<p>This blog is devoted (mostly) to cutting edge Web and media technology.  Today and in the next posting, we will look at two website development tools that I have found surprisingly powerful, given their simplicity and elegance.</p>
<p>One of them is WordPress, which can be run on Macs, WIndows machines, and Linux machines, and the other is Freeway (a Mac-only application).  We&#8217;ll look at Freeway next time.</p>
<p><strong>WordPress Installation.</strong></p>
<p>This is the famous blog server software, and in fact the ITKE blogs, of which this is one, uses it.  What many people don&#8217;t realize is that it can be used to build more diverse websites.</p>
<p>First of all, it is extremely easy to install.  Anyone, and you certainly don&#8217;t need to be a programmer, can build their own WordPress server.  That means you have total control.  There is no need to have WordPress host your site.</p>
<p>Here is what you need:</p>
<p>A domain and someone to host your site.  I might suggest GoDaddy.com for both.  You need your hosting service to provide access to the MySQL database management system as well.  For the most part, you need to pay for your domain and hosting.</p>
<p>You also need an FTP program, of which there are very good free ones. FTP predates the web, and has long be used to move data from one machine to another one on the Internet.</p>
<p>Finally, you need the WorldPress software, which is also free.</p>
<p>If anyone needs pointers to any of these applications, just contact me.</p>
<p>I won&#8217;t go into the details here, but all you need is to download the WordPress software and make a couple of very simple changes to its configuration files.  Then it is ready to go.</p>
<p>You must also create an empty database with MySQL.  This is very simple.</p>
<p>Then you copy the WordPress software onto your hosted server by using your FTP program.</p>
<p>Again, if anyone wants more detailed help, just send me email.  My address is posted in my bio on this blog.</p>
<p><strong>Using it</strong>.</p>
<p>Now, you just use your browser to build your site.  You first choose a template and tailer it a bit.  (I use the default WordPress 3.0 template.)</p>
<p>You then create blog entries and other webpages.  You don&#8217;t have to use your FTP program again, as WorldPress will now do all your uploading for you.</p>
<p>You can do a few different things with WordPress.  You can of course create a blog.  But you can also build what WordPress calls &#8220;pages&#8221; (as oppose to blog &#8220;postings&#8221;) and they make up fixed tabs on your home page.  In other words, you don&#8217;t need to make your site primarily a blog.</p>
<p>You can also create &#8220;widgets&#8221; to post links to other sites, point to RSS feeds, etc., etc.</p>
<p>WordPress will, as it turns out, use MySQL to store your data, but you don&#8217;t even have to know it&#8217;s doing it.  WordPress takes care of all of this for you.</p>
<p><strong>Dynamic sites.</strong></p>
<p>The definition of a &#8220;dynamic&#8221; website is that your site will build tailored webpages for your users.  They tell your site what they want to see and the page is built on the spot.</p>
<p>Now, given that WordPress uses MySQL to store information that it plugs into pages that then get immediately downloaded, it technically is a tool for building dynamic websites.  And there is a lot more that can be done with WordPress beyond what we have discussed here.</p>
<p>Still, I think its big plus is that it is almost trivial to install, very easy to use, and produces very elegant websites.</p>
<p>You can look at one of mine, if you want, which I set up for my 3D animation students: http://wordsbybuzz.com.  It is part blog, part general purpose website, and you can download my animation lessons if you want.  The are built with desktop capture and audio capture software.</p>
<p><strong>Next time, Freeway.</strong></p>
<p> </p>
<p> </p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/semantic-web/two-easy-ways-to-build-static-websites-or-not-so-static/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Challenge of Complex Media in a Relational World, Part 1</title>
		<link>http://itknowledgeexchange.techtarget.com/semantic-web/the-challenge-of-complex-media-in-a-relational-world-part-1/</link>
		<comments>http://itknowledgeexchange.techtarget.com/semantic-web/the-challenge-of-complex-media-in-a-relational-world-part-1/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 17:57:19 +0000</pubDate>
		<dc:creator>Roger King</dc:creator>
				<category><![CDATA[blob data]]></category>
		<category><![CDATA[continuous data]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[Multimedia]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/semantic-web/the-challenge-of-complex-media-in-a-relational-world-part-1/</guid>
		<description><![CDATA[Relational databases: the dominant technology. Relational database management systems, such as MySQL, Oracle, MS SQL Server, DB2, and Postgresql, support the relational model. A database is broken up into tables, and each table consists of rows. Each row is a series of values. A row in a table called Insured Drivers in a motor vehicle [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Relational databases: the dominant technology.</strong></p>
<p>Relational database management systems, such as MySQL, Oracle, MS SQL Server, DB2, and Postgresql, support the relational model.  A database is broken up into tables, and each table consists of rows.  Each row is a series of values.  A row in a table called Insured Drivers in a motor vehicle database might consist of:</p>
<p>Fred, 2010 Toyota Prius, State Farm Insurance, 1112233444.</p>
<p>1112233444 might be a unique identifier that the government assigns to each driver.  This would be the “primary key” for the table Insured Drivers.  The point is that human names are not at all unique, and so in relational databases, we introduce artificial keys in order to disambiguate queries.  We still need the value Fred in the row because we want to know how to address him with a letter or email.</p>
<p><strong>Problems with relational databases.</strong></p>
<p>There are a few critical points to note with this approach.  First, such a simple way of representing data allows the database to quickly deliver large sets of rows from this table to the memory of a computer, so that they can be effectively searched in bulk.  We might want to know the names of all people who drive a Toyota Prius and are insured by State Farm, for example.</p>
<p>Another thing is that we might like to be able to put more complex items in a row.  We might want to have another value in a row, one that gives a driver’s address.  But an address has a few parts to it, and is not itself a simple value like a name or a car model or the name of an insurance company.  </p>
<p>It is important to also note, however, that relational databases do indeed support the creation of more complex values, such as an address.  But the more complex values we put in rows in tables, the harder it is to read in a large number of rows at once.  </p>
<p>In fact, we could create a value that represents a very complex object, one that refers to rows in other tables.  For example, we might want to replace the value Fred with a reference to a row in another table called Licensed Drivers, because there is a lot we might want to know about Fred, other than just his name.  But then it would become very difficult to read in lots of rows of a single table quickly.  </p>
<p>It might be that if we follow a link to another table that describes drivers, these rows might themselves have links in them, thus allowing a value in a row to actually consist of an object, like we would in Java or C++.  And in general, these links between tables could be chained together, and extend arbitrarily far.  Do we chase all of these linked references down for every row of Insured Drivers, or do we not follow any of these links so we can read in a large number of rows?  Then we would worry later about getting more information on each driver.</p>
<p>Importantly, relational databases are still very much the dominant database technology in use in businesses and other organizations, as well as on the Web.  We need to keep in mind that we have already aggressively extended them by supporting values that have internal structure (like addresses) and with the ability to create complex objects (like drivers).  How far do we go in extending them?  </p>
<p><strong>Where we stand today.</strong></p>
<p>Indeed, the extensions we have already made to relational databases have created a serious optimization problem.</p>
<p>But it’s worse than that.  Here’s something else to consider.  Relational databases were born into a world where flat business data was pretty much the only game in town.  However, relational databases are being asked to manage far more sophisticated forms of data, like photos and video clips and voice tracks.  There are a couple of problems that crop up.  First, a row with a video clip as a field could be huge.  We might only be able to read in a single row at a time and this could make searching an entire table intractable.  Worse, how do we even search for rows that contain certain pieces of video?  How can we search for all video clips that show Fred getting into a car accident?</p>
<p><strong>Where to go from here.</strong></p>
<p>In previous postings of this blog we have looked at <a href="http://itknowledgeexchange.techtarget.com/semantic-web/multimedia-what-is-it-why-do-we-care/">media databases</a>, and in particular, at techniques that can be used to <a href="http://itknowledgeexchange.techtarget.com/semantic-web/tag/dublin-core/">tag</a> complex forms of blob and continuous media (like photos and video clips).  What’s important to note, though, is that there is a major dilemma right now in the world of database software.  Can we continue to shoehorn more and more complex forms of data into relational databases, or do we need to throw in the towel and start over?</p>
<p><strong>More on this next time&#8230;</strong></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/semantic-web/the-challenge-of-complex-media-in-a-relational-world-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making information management scale: leveraging metadata on the new Web</title>
		<link>http://itknowledgeexchange.techtarget.com/semantic-web/making-information-management-scale-leveraging-metadata-on-the-new-web/</link>
		<comments>http://itknowledgeexchange.techtarget.com/semantic-web/making-information-management-scale-leveraging-metadata-on-the-new-web/#comments</comments>
		<pubDate>Sun, 11 Oct 2009 23:07:36 +0000</pubDate>
		<dc:creator>Roger King</dc:creator>
				<category><![CDATA[3D modeling]]></category>
		<category><![CDATA[automating Web searches]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[DB2]]></category>
		<category><![CDATA[information]]></category>
		<category><![CDATA[Multimedia]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[RDF]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Video]]></category>
		<category><![CDATA[Web 3.0]]></category>
		<category><![CDATA[Web development frameworks]]></category>
		<category><![CDATA[Web3.0]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/semantic-web/making-information-management-scale-leveraging-metadata-on-the-new-web/</guid>
		<description><![CDATA[Previous postings of this blog. This blog is dedicated to advanced Web development tools and concepts. Previous blog postings have focused on the emerging Semantic Web, which promises to make the Web radically easier to search and to greatly enhance the value of the vast sea of currently-disconnected information spread across the Web. We have [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Previous postings of this blog.<br />
</strong><br />
This blog is dedicated to <a href="http://itknowledgeexchange.techtarget.com/semantic-web/the-difference-between-web-2-and-the-semantic-web/">advanced Web development</a> tools and concepts. Previous blog postings have focused on the emerging <a href="http://itknowledgeexchange.techtarget.com/semantic-web/what-do-we-mean-by-semantic-web/">Semantic Web</a>, which promises to make the Web radically easier to search and to greatly enhance the value of the vast sea of currently-disconnected information spread across the Web.   We have also looked at <a href="http://itknowledgeexchange.techtarget.com/semantic-web/mega-media-apps-a-huge-challenge-for-web-30/">Web 3.0</a> efforts, which promise to make multimedia websites highly usable and capable of conveying far more information than the current generation of websites.  Previous postings describe breadth and depth of cutting edge Web technology. </p>
<p><strong>Metadata: making that ratio small.</strong></p>
<p>Here’s something that’s very important:  Much of the ongoing research and development that is loosely categorized as Semantic Web and Web 3.0 efforts is focused on a specific technical goal, one that has been at the core of information management technology since the mainframe era that was epitomized by the IBM 360 series.   That goal is to leverage metadata as much as possible.</p>
<p>It’s our best weapon against the truly staggering amount of information on the Web.  This includes traditional text-based and numeric data, as well as books, medical advice, photographs, entertainment and training videos, music and recorded books, investment information, educational materials, scientific materials, e-government information, etc., etc.  How can we possibly organize information and then search it in a way that scales?  The Web is far from a closed world.  In traditional data processing environments like banking, insurance, and credit card processing, we could get our arms around all of the data, as vast as it may have seemed.  But the world of information today is an open world, effectively infinite in size.</p>
<p>Very informally, if you look at the size of the metadata divided by the size of the data itself, the smaller that fraction the better. In traditional relational databases (built with database management systems, such as Oracle, MS SQL Server, MySQL, PostgreSQL, or DB2), the extreme focus on minimizing this ratio has enabled the fast processing of extremely large volumes of data.  The tradeoff is that the table definitions (or the “schema”), which form the heart of the metadata are very, very simplistic. </p>
<p><strong>The old days: relational database schemas.</strong></p>
<p>An insurance claim may be defined as a table with such columns as Subscriber_Name, Medical_Provider, etc., and thus, may consist of little or no more than a series of simple character and numeric fields.  But if we need to process fifty thousand of them tonight, we must be able to bring many such table rows into memory at once, and quickly move through them.  The database world was an extension of the paper world: a row in an insurance claim table was effectively an electronic successor to the traditional claim form.</p>
<p><strong>Today: a far more challenging problem.</strong></p>
<p>But on the new Web, information can be far more complex in nature, making the metadata to data ratio far larger.  We’ve looked at some of the <a href="http://itknowledgeexchange.techtarget.com/semantic-web/the-semantic-web-rdf-and-sparql-part-5/">emerging technology</a> and <a href="http://itknowledgeexchange.techtarget.com/semantic-web/ambient-intelligence-empowering-the-new-web/">technical trends</a> for embedding metadata in advanced forms of data (and for processing that metadata); this data includes books, images, video, modeling and animation, and sound.  This new generation of information formats make up our personal health records and medical records images, industrial training materials, university “distance” courses, and the like.  Each instance of these tends to be far more unique than individual insurance claim forms.  And, it takes a lot of metadata to properly convey their “meaning”.</p>
<p><strong>The challenge.</strong></p>
<p>What we’re struggling with right now is to succinctly specify the meaning of modern media assets and to automate searching based on this metadata.  This is our only hope for leveraging that ratio of metadata size divided by data size. </p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/semantic-web/making-information-management-scale-leveraging-metadata-on-the-new-web/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Semantic Web: revealing hidden data.</title>
		<link>http://itknowledgeexchange.techtarget.com/semantic-web/the-semantic-web-revealing-hidden-data/</link>
		<comments>http://itknowledgeexchange.techtarget.com/semantic-web/the-semantic-web-revealing-hidden-data/#comments</comments>
		<pubDate>Mon, 11 May 2009 03:07:55 +0000</pubDate>
		<dc:creator>Roger King</dc:creator>
				<category><![CDATA[databases]]></category>
		<category><![CDATA[DB2]]></category>
		<category><![CDATA[dynamic pages]]></category>
		<category><![CDATA[hidden web content]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[namespaces]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[static pages]]></category>
		<category><![CDATA[the Semantic Web]]></category>
		<category><![CDATA[triples]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/semantic-web/the-semantic-web-revealing-hidden-data/</guid>
		<description><![CDATA[The Hidden Web. The Semantic Web &#8211; a primary topic of this continuing blog series &#8211; will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, [...]]]></description>
				<content:encoded><![CDATA[<p><strong>The Hidden Web.</strong></p>
<p>The <a href="http://itknowledgeexchange.techtarget.com/semantic-web/the-difference-between-web-2-and-the-semantic-web/" target="_blank">Semantic Web</a> &#8211; a primary topic of this continuing blog series &#8211; will help us search the web with greater ease. One of the things it will (hopefully) do is expose a vast sea of information that is currently invisible to our web browsers. In fact, some say that right now, we can see less than 1% of what&#8217;s out there. I cannot vouch for this number, but I can say that what we cannot see right now includes large volumes of extremely valuable data.</p>
<p>Perhaps you have heard of the mysterious &#8220;Hidden Web&#8221;? So, what is this stuff and where is it?</p>
<p><strong>Forms, Databases, and Interactive Interfaces.</strong></p>
<p>The Hidden Web refers to data that is out there on the web, publicly accessible &#8211; but only via webpage interfaces that are opaque to the indexing software of search engines like Google.</p>
<p>Let&#8217;s step back for a moment. </p>
<p>The way search engines work, in case you don&#8217;t know, is by constantly searching the web, looking for new webpages. When a new page is found, it is added to the search engines index, meaning that now, when people search the web with Google, they might get the URL for that page in their search results. </p>
<p>The important thing to note is that the primary source of information that Google uses when it indexes a page is the page itself. What words are on it?</p>
<p>This sounds great for static webpages that are stored as-is on websites and delivered as-is to the Google user. </p>
<p>But suppose we want Google to find dynamic pages? A typical dynamic page has content that isn&#8217;t known until an interactive user types some words into a web<em> &#8220;</em>form&#8221;. A web form is a page where the browser user fills in blanks and then lets the browser send the completed page back to the server. There, the information in the form is used to select other information, which is plugged into a &#8220;dynamically&#8221; created page that is sent to the client machine and viewed by the browser user.</p>
<p>So, I might visit Amazon. I navigate to their search page, which is a form, and I type in the title of the book I want. That information goes back to the server. A description of this book, including its cost, is plugged into a dynamically created page, which is then downloaded to my machine so that I can read the material with my browser.</p>
<p><strong>Indexing Dynamic Pages.</strong></p>
<p>So, if I have information that is not sitting in static pages, how can I get Google to index this information? There are multiple ways. For example, if the primary job of your website is to create large volumes of dynamically created pages, you might want to create a special directory page for your site &#8211; a static page &#8211; loaded with all the right words, and that contains links to the pages and forms you want the user to discover.</p>
<p>On the future Semantic Web, you might want to make sure that those magic words come at least in part from globally accessible <a href="http://itknowledgeexchange.techtarget.com/semantic-web/namespaces-and-the-semantic-web/" target="_blank">namespaces</a>, so that people who are using next-generation browsers, and who will be using these namespaces as a source of search keywords, will find your static page. As we have discussed, namespaces will provide us with detailed sets of terms, which will be tied to specific domains. This will make the search for static pages far more efficient than it is now.</p>
<p>As an example, a namespace concerning books might have words like <em>ISBN-10</em> and <em>ISBN-13. </em>If the web designer uses these terms to describe static pages about books, and if the user of the browser can specify that they are looking for ISBN numbers, the browser will have a much more detailed idea of what is meant by those 10 and 13 digit numbers the user types in. </p>
<p>Here&#8217;s the critical part. Right now, Amazon lets you search by the these numbers on their specialized web form page, but imagine if you could at any time tell your browser to look for ISBN numbers on whatever webpages it searches.</p>
<p>An example of a namespace that is used to describe documents on the web is the <a href="http://itknowledgeexchange.techtarget.com/semantic-web/the-dublin-core-and-the-metadata-object-description-schema-a-look-at-namespaces/" target="_blank">Dublin Core</a>, by the way.</p>
<p>So, that&#8217;s one way to make your dynamic pages somewhat visible. Create a web page that is static and leads to the pages you want users to see, and to make it all the more powerful, use terms from a globally accepted namespace like the Dublin Core. This is something that is already partly doable. The Dublin Core, along with other namespaces, are in wide use.</p>
<p><strong>Where Does that Information Come From?</strong></p>
<p>Is there a better way, though? This technique will only point users to our static web directory, which will then enable interactive users to find our web forms. The users must then use our forms to get detailed data. Could the searching for dynamic pages be made more automatic?</p>
<p>Well, where does data in dynamic pages come from? Often from large databases built with such database management systems as Oracle, SQL Server, MySQL, PostgreSQL, and DB2. This is why some folks conjecture that the amount of information in the Hidden Web is vastly bigger than the web we see today. Databases can be BIG.</p>
<p>Imagine all the information on the ancient Pharaohs, genetic diseases, investments, philosophy, and countless other topics is sitting inside databases that right now are only accessible via web forms. Right now, we Google keywords like &#8220;pharaoh&#8221; and the first things we see are static, highly condensed Wikipedia pages, and perhaps some static pages posted by museums and academics.</p>
<p><strong>What Will the Semantic Web Do?</strong></p>
<p>The Semantic Web will have as a primary challenge the ability for us to ask for information, and know that the search space will contain information tucked away in databases dotted all around the globe. </p>
<p>This is a very complex problem. Right now, we need a human sitting at the keyboard of the client machine to navigate to the correct URL and then type terms into a web form. In the future, web designers will need ways of capturing information about what is contained in databases, and to specify that information in a fashion that browsers can access. And this information will have to be very detailed, sometimes very intricate. </p>
<p>The browser will also have to take information specified by the user and match it up with the information that describes databases on the web. This means that we will need some automatic way to search databases without a user interactively and incrementally screening tens or hundreds or thousands of URLs. In an earlier blog posting in this series we described one possible technique called &#8220;<a href="http://itknowledgeexchange.techtarget.com/semantic-web/what-do-we-mean-by-semantic-web/" target="_blank">triples</a>&#8221; that might, combined with namespaces, provide a partial solution to this problem.</p>
<p>We will look at this again, more closely, in a future blog posting.</p>
<p><br class="final-break" /></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/semantic-web/the-semantic-web-revealing-hidden-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Full Text searching: cleaver heuristics for managing large web-based document collections.</title>
		<link>http://itknowledgeexchange.techtarget.com/semantic-web/full-text-searching-cleaver-heuristics-for-managing-large-web-based-document-collections/</link>
		<comments>http://itknowledgeexchange.techtarget.com/semantic-web/full-text-searching-cleaver-heuristics-for-managing-large-web-based-document-collections/#comments</comments>
		<pubDate>Thu, 02 Apr 2009 05:59:17 +0000</pubDate>
		<dc:creator>Roger King</dc:creator>
				<category><![CDATA[databases]]></category>
		<category><![CDATA[documents]]></category>
		<category><![CDATA[full text]]></category>
		<category><![CDATA[full text searching]]></category>
		<category><![CDATA[Multimedia]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[SMIL]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[the Semantic Web]]></category>
		<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[Web 3.0]]></category>
		<category><![CDATA[web applications]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/semantic-web/full-text-searching-cleaver-heuristics-for-managing-large-web-based-document-collections/</guid>
		<description><![CDATA[There is an explosion of technology for supporting sophisticated forms of media on websites and in web applications. In our continuing series on advanced web applications (in particular, as they pertains to the Semantic Web and Web 2.0/3.0), we&#8217;ve looked at continuous media, in particular, video and multimedia presentations. But there is a very old [...]]]></description>
				<content:encoded><![CDATA[<p>There is an explosion of technology for supporting sophisticated forms of <a href="http://itknowledgeexchange.techtarget.com/semantic-web/multimedia-what-is-it-why-do-we-care/" target="_blank">media</a> on websites and in web applications. In our continuing series on <a href="http://itknowledgeexchange.techtarget.com/semantic-web/the-difference-between-web-2-and-the-semantic-web/" target="_blank">advanced web</a> applications (in particular, as they pertains to the Semantic Web and Web 2.0/3.0), we&#8217;ve looked at continuous media, in particular, <a href="http://itknowledgeexchange.techtarget.com/semantic-web/sql-and-xml-declarative-is-exciting/" target="_blank">video and multimedia presentations</a>. But there is a very old form of continuous media, something that is perhaps the dominant media on the Web, and that&#8217;s text.</p>
<p>It&#8217;s becoming a very major issue in web development.</p>
<p><strong>Text.</strong></p>
<p>In this blog entry, we&#8217;ll be looking at a particular form of text, called &#8220;full text&#8221;.</p>
<p>But just what is text to begin with? It&#8217;s character-based data, anything we can read.</p>
<p>And what will we want to do with it in next-generation web applications? It&#8217;s important to note that more and more vast libraries of documents are being put online. Web applications need to provide far faster and more accurate searches of documents than what we can perform with Google.</p>
<p>Interestingly, a successful technology, called &#8220;full text retrieval&#8221;, is already in place in the relational database systems that underlie modern web applications. It&#8217;s there working for us, and we are likely to not be aware of how clever it is.</p>
<p>It&#8217;s also something that should be used much more heavily by web application developers.</p>
<p>Let&#8217;s step back and consider three different &#8211; and increasingly more sophisticated &#8211; ways of managing character data.</p>
<p><strong>Atomic Character Attributes.</strong></p>
<p>First, there is the traditional relational database approach, whereby data is stored as tables made of rows of atomic, fixed sized attributes. By atomic, we mean that each attribute has no internal structure. So, a table of insurance claims might have rows with the following attributes: Claims-ID (an integer), Amount (an integer), Medical_Problem (a fixed length character string), and Subscriber_Name (a fixed length character string). Using SQL, the universal database &#8220;query&#8221; language, we might look for all rows that contain the name &#8220;Fred Jones&#8221;. Or, we might search for all rows that have claim numbers that are between 110 and 115.</p>
<p>Essentially, this approach limits us to comparing small strings of data to each other or to fixed values. There are some common extensions that we find in relational databases, such as being able to ask the question to find all rows where the Medical_Problem is something <em>like</em> &#8220;broken leg&#8221;. Then if a row actually has the value &#8220;broken legs&#8221;, we would most likely see this row in our results.</p>
<p><strong>Full Text.</strong></p>
<p>Second, there is the ability to search pieces of text according to their natural language (in this case, English) meaning. In this case, we consider the character data to have internal structure, and the values are not considered atomic. Often, these pieces of text are long and of variable length from one row to the next.</p>
<p>It is actually an extension of &#8211; but a very dramatic one &#8211; of the<em> like</em> operator in SQL.</p>
<p>It is what we call &#8220;full text&#8221; management or retrieval, and modern relational database management systems like MySQL and Microsoft SQL Server support this. This was seen long ago as a critical extension to relational database technology. Thus, we might rename our Medical_Problem field to Doctor&#8217;s_Diagnosis, and allow free form English text in this attribute, as well as allowing the value to be quite long. Then we might search for all rows where the doctor describes &#8220;fractures of the lower limbs&#8221;. Notice that none of these words might actually appear in the attribute, which might simply refer to &#8220;broken legs&#8221;.</p>
<p><strong>Natural Language Processing.</strong></p>
<p>This capability would clearly be very powerful, if we could do it right. The problem is that to support it fully, we would need to use highly advanced natural language processing techniques, which are very time consuming to execute, especially on huge databases of large documents. The full text approach tries to simulate true natural language searching in a far less expensive way. The real thing, by the way, might not be all that accurate anyway. Natural language is naturally ambiguous and very subtle.</p>
<p>True natural language searching would be our third way of processing character-based data, by the way. It is not a fully developed technology. And importantly, we usually don&#8217;t need anything that fancy.</p>
<p><strong>The Clever Compromise.</strong></p>
<p>So, our middle option, full text searching, is what dominates today &#8211; and it is a surprisingly accurate, and efficient, technique that operates on a small set of heuristics. It can transform a dumb webpage where we can only search for small, fixed character strings, to a rich next-generation webpage that can effectively be searched according to its meaning. It allows us to manage very large text documents in web applications &#8211; and get us surprisingly close to the semantic power of true natural language searching.</p>
<p>We&#8217;re not going to go into a lot of detail here, but here are some of the heuristics that are used in full text search. First, &#8220;stemming&#8221; and related techniques are used; they conjugate verbs, detect plurals of nouns, and remove prefixes and suffixes. Another technique is to use a &#8220;stop list&#8221; that lists words that should be ignored, like &#8220;the&#8221;. The system might also let us specify the &#8220;proximity&#8221; of words; this refers to how closely specific words should appear in a document. It can also be powerful to include a synonym checker. And the ability to allow for &#8220;wild cards&#8221;, in particular, letters that may vary in a passage without changing its meaning, can be quite useful. Dictionaries of technical words that pertain to specific domains (like medicine or law) are very useful. We might also provide a feedback capability, whereby users can train full text search engines to be more accurate.</p>
<p>This clearly doesn&#8217;t come anywhere near true natural language processing &#8211; but it is fast. It will be a growing technology on the new web, with a lot of hidden development, making this heuristic-based technique more and more effective.</p>
<p><strong>Indexing.</strong></p>
<p>We should note that there is a significant up front cost in preparing a document for full text searching: we need to build an index with an entry for every (non-stop) word in the text. Then, when a query is executed, we can look for words in the document by searching the index, instead of searching the full text. If there were no index, the search would be extremely time-consuming.</p>
<p><strong>The Future.</strong></p>
<p>As more and more governmental, educational, medical, and other complex documents become available on the web, advanced full text searching will enable us to search vast databases in a tractable fashion. Even more clever full text retrieval engines will turn dumb, &#8220;gotta Google them&#8221; document portals into true Web 3.0 and Semantic Web applications.</p>
<p><br class="final-break" /></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/semantic-web/full-text-searching-cleaver-heuristics-for-managing-large-web-based-document-collections/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
