 




<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Enterprise IT Consultant Views on Technologies and Trends &#187; GFS</title>
	<atom:link href="http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/tag/gfs/feed/" rel="self" type="application/rss+xml" />
	<link>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends</link>
	<description>Everything from Mainframes to Cloud</description>
	<lastBuildDate>Fri, 10 May 2013 20:03:12 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Google BigTable &#8211; distributed data storage</title>
		<link>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-bigtable-distributed-data-storage/</link>
		<comments>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-bigtable-distributed-data-storage/#comments</comments>
		<pubDate>Fri, 17 Sep 2010 07:36:17 +0000</pubDate>
		<dc:creator>Sasirekha R</dc:creator>
				<category><![CDATA[Bigtable]]></category>
		<category><![CDATA[distributed data storage]]></category>
		<category><![CDATA[GFS]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-bigtable-distributed-data-storage/</guid>
		<description><![CDATA[Google BigTable &#8211; distributed data storage Google&#8217;s data volumes are in petabytes and the typical requirement can be translated to joining two tables where tables are distributed over 100.000 nodes, and relational databases are not the right fit for them. Google&#8217;s motivation for developing its own solutions is driven by its need for massive scalability, [...]]]></description>
				<content:encoded><![CDATA[<h1><span style="color: #800000;">Google BigTable &#8211; distributed data storage</span></h1>
<p><a href="http://searchstorage.techtarget.com/news/2240183724/EMC-World-2013-Google-storage-coming-to-your-data-center" target="_blank">Google&#8217;s data volumes</a> are in petabytes and the typical requirement can be translated to joining two tables where tables are distributed over 100.000 nodes, and relational databases are not the right fit for them. Google&#8217;s motivation for developing its own solutions is driven by its need for massive scalability, better control of performance characteristics, and ability run on commodity hardware so that each new service or increase in load result in a small incremental cost.</p>
<p>Google has developed BigTable (a data organization),  built on top of Google&#8217;s other services &#8211; specifically GFS, Scheduler, Chubby Lock Service, and MapReduce &#8211; that has been in active use since February 2005. These data models are simple and give the users <em>dynamic control</em> over data layout and format.<span id="more-65"></span></p>
<p>Each table is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key (up to 64KB in size), column key and a timestamp.  Value in the map (or data) is treated as uninterpreted array of bytes (though clients can serialize various forms of structured and semi-structured data into these).</p>
<p>The table maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned as <em>tablets </em>of approximately 100-200 MB.  Each machine stores about 100 tablets or so. This makes reads of short row ranges efficient as they involve only a small number of machines. Similarly the tablets allow fast rebuilding and fine grain load balancing.</p>
<p>Column stores arbitrary name-value pairs in the form of <strong><em>column-family</em></strong>: label, string. The possible set of column families for a table is fixed at the time of table creation. The number of distinct column families in a table is expected to be small and rarely change.  The actual columns (i.e., labels) within the column family can be created dynamically at any time.</p>
<p>Similar to column oriented databases, column families are stored close together resulting in efficient data access. The sales analysis may read only data pertaining of location column family while the market analysis can used only the product column family. Access control, and disk and memory accounting are done at the column-family level.</p>
<p>Each cell (row, column) can contain <strong>multiple versions</strong> of the data, indexed by timestamps. The data is stored in decreasing timestamp order so that the most recent can be read first. Timestamp when assigned automatically represent real time in microseconds and applications that need to avoid collisions must generate unique timestamps themselves. Using Automatic garbage collection feature of cell versions, the client can specify that only the last <em>n</em> versions of a cell or only the new-enough versions (say last <em>d</em> days) be retained.</p>
<p>BigTable implementation involves three major components: a library linked to every client, one Master server and many tablet servers (that can be added dynamically).  The Master is responsible for assigning tablets to tablet servers, detecting addition and expiration of tablet servers, tablet-server load balancing, garbage collection, handling schema changes like table and column family creations. Each tablet server manages the read and requests of a set of tablets (ten to thousand tablets per server).</p>
<p>BigTable uses a three-level hierarchy similar to that of a B+ tree to store tablet location information.</p>
<p>1. The first level is a file containing the location of the <em>root tablet (1<sup>st</sup> metadata tablet that is never split)</em>, stored in Chubby.</p>
<p>2. The root tablet contains the locations of all tablets in a special METADATA table</p>
<p>3. Each Metadata tablets contain the location of a set of user tablets (each metadata row is around 1KB).</p>
<p>The clients do not move through the Master, and the client library caches tablet locations and also does the prefetch of tablet locations (reads metadata for more than one tablet). This results in clients not having to rely on the master for tablet location ensuring that the master is lightly loaded in practice and do not turn out to be a bottleneck.</p>
<p>Functions for creating and deleting tables and column families are provided by APIs. APIs also provide functions for changing cluster, table and column family metadata.  Applications can write or delete values, look up values for individual rows, or iterate a process over a subset of the data.  Single-row transactions that support atomic read-modify-write on data stored under a single row key are provided.</p>
<p>Currently, it does not support general transactions across row keys. Instead an interface that enables batching writes across row keys is provided. The table can be used in conjunction with <em>MapReduce</em> framework for running large-scale parallel computations. The wrappers that allow Bigtable to be used both as an input source and an output target for MapReduce are provided.</p>
<p>In addition the execution of client-supplied scripts in the address spaces of the servers are supported. For Bigtable, the Sawzall (language developed at Google) based scripts allow various forms of data transformation, filtering and summarization. At present, the client scripts are not allowed to write back into Bigtable.</p>
<p>The Google SSTable format is used internally to store Bigtable data. An SSTable can also be completely mapped into memory which allows lookups and scans without touching disk. Updates are committed to a commit log (that stores redo records). Of these, the recently committed ones are stored in memory in a sorted buffer called a memtable. The older updates are stored in a sequence of SSTables. In effect, the persistent state of a tablet is stored in GFS. A read operation is executed on a merged view of the sequence of SSTables and the memtable.</p>
<p>The only mutable data structure that is accessed by both reads and writes is the memtable. The master removes obsolete SSTables as a <strong>mark-and-sweep garbage collection</strong> over the set of SSTables, where the METADATA table contains the set of roots.</p>
<p>Tablet servers use <strong>two levels of caching &#8211; Scan Cache and Block Cache</strong> &#8211; to improve read performance. Scan Cache is useful for applications that tend to read the same data repeatedly. Block Cache is for applications that tend to read data that is close to the recently used data (e.g., sequential reads).</p>
<p>There is a <strong>single commit log per tablet server</strong>, where the mutations for different tablets are mingled in the same physical log. Using one log provides significant performance benefits &#8211; by reducing number of disk seeks, effective group commit optimization. The trade-off is that it complicates recovery (when a tablet server dies and the tablets are moved to a large number of servers) for single tablet.</p>
<p>Clients can control whether the SSTables are compressed and also the <strong>compression format</strong> used. Compression is applied to each SSTable block separately &#8211; though is not ideal  for saving space -  enables reading of small portions of an SSTable without decompressing the entire file.</p>
<p>BigTable is now used by a number of Google applications &#8211; Google Reader, Google Maps, Google Book Search, &#8220;My Search History&#8221;, Google Earth, Blogger.com, Google Code hosting, Orkut, YouTube and Gmail.</p>
<p>BigTable is currently not distributed or used outside of Google, although Google offers access to it as part of their Google App Engine. NoSQL databases like Google&#8217;s Datastore, Amazon&#8217;s SimpleDB, Apache&#8217;s Cassandra, Kosmix&#8217;s KDI, HBase for Hadoop are based on similar data models.</p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-bigtable-distributed-data-storage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google File System (GFS) &#8211; massively parallel and fault tolerant distributed file system</title>
		<link>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-file-system-gfs-massively-parallel-and-fault-tolerant-distributed-file-system/</link>
		<comments>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-file-system-gfs-massively-parallel-and-fault-tolerant-distributed-file-system/#comments</comments>
		<pubDate>Thu, 09 Sep 2010 09:23:03 +0000</pubDate>
		<dc:creator>Sasirekha R</dc:creator>
				<category><![CDATA[commodity servers]]></category>
		<category><![CDATA[fault tolerant]]></category>
		<category><![CDATA[GFS]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[large files]]></category>
		<category><![CDATA[Massively parallel]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/?p=61</guid>
		<description><![CDATA[Google File System (GFS) &#8211; massively parallel and fault tolerant distributed file system Google File System (GFS) is optimized for handling Google&#8217;s core data storage and usage needs involving generating and retaining enormous amount of data. GFS is designed to manage significantly large files using a large distributer cluster of commodity servers connected by a [...]]]></description>
				<content:encoded><![CDATA[<h1><span style="color: #800000">Google File System (GFS) &#8211; massively parallel and fault tolerant distributed file system</span></h1>
<p>Google File System (GFS) is optimized for handling Google&#8217;s core data storage and usage needs involving generating and retaining enormous amount of data. GFS is designed to manage significantly large files using a large distributer cluster of commodity servers connected by a high speed network. It is designed to expect and tolerate hardware failures even while reading/writing the file and support parallel reads, writes and appends by multiple client programs.</p>
<p>GFS splits large files into chunks of 64MB that are <em>extremely rarely overwritten</em> or shrunk, but typically appended to or read. These chunks are stored in cheap commodity servers (also nodes) called chunk servers and hence necessitated the design to take precautions against high failure rate of individual nodes and the subsequent data loss. Another design decision is to go for high data throughputs, even if it comes at the cost of latency.<span id="more-61"></span></p>
<p>The nodes are divided into two types &#8211; <em>one Master node</em> and a large number of Chunk Servers. Each chunk is replicated at least three times on a different physical rack as well as a different network to handles various possible failures. For files that have high demand or require more redundancy, the replication can be higher than three.</p>
<p>Each chunk is assigned a unique 64-bit label and the logical mappings of files to constituent chunks are maintained. The Master server stores all the metadata associated with the chunks:</p>
<p>1. Mapping the 64-bit labels to chunk locations and the files they are part of</p>
<p>2. Details of the location of the copies of the chunk and which of them is primary</p>
<p>3. Details of the processes that are reading or writing to a particular chunk</p>
<p>The metadata is kept current by the Master by periodically receiving updates (or &#8220;Heart-beat messages&#8221;) from each chunk server.</p>
<p>To read a file, the client program sends the full path and offset to GFS (Master) which returns the metadata for one of the replicas of the chunk. The client directly reads data from the designated chunk server. The client does not cache the data that is read as most reads are large, but caches the metadata instead so that it need not contact the Master every time.</p>
<p>In case of append, the GFS sends back the metadata for all the replicas of the chunk where the data is to be found. And the data append happens as follows:</p>
<ol>
<li>Client sends the data to be appended to all the chunk servers</li>
<li>Client informs the primary chunk server once all acknowledge receipt of data</li>
<li>The primary chunk first appends its copy of the data into an offset of its choice (it may be beyond the EOF as multiple writers may be appending the files simultaneously)</li>
<li>The primary then forwards the request to all replicas</li>
<li>The replicas in turn try to write the data at the same offset as primary or return failure</li>
<li>In case of failure, the primary rewrite the data at a different offset and retries the process</li>
</ol>
<p>Modifications, which are expected to be extremely rare, are handled by permitting time-limited, expiring &#8220;leases&#8221;. The Master server grants permission to the process for a finite period of time (provided there is no other pending lease) during which no other process will be granted permission to modify the chunk. The modifying chunk server (which always is the primary chunk holder) first modifies it copy and then propagates the changes to all the replicas. The changes are not saved until all the chunk servers acknowledge, thereby guaranteeing the completion and atomicity of the operation.</p>
<p>As there is a large amount of redundant data, google makes heavy use of compression which in turn gives significant benefits &#8211; with compressed data occupying as low as 10% of the original &#8211; as the data is predominantly text based.</p>
<p>GFS differs from traditional distributed file systems as it allows all nodes to have direct concurrent access to the same shared block storage. Unlike many filesystems, GFS is not implemented in the kernel of an operating system but is instead provided as a userspace library. Google BigTable is a distributed storage system that is built on GFS.</p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/google-file-system-gfs-massively-parallel-and-fault-tolerant-distributed-file-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
