Google File System (GFS) – massively parallel and fault tolerant distributed file system
Google File System (GFS) is optimized for handling Google’s core data storage and usage needs involving generating and retaining enormous amount of data. GFS is designed to manage significantly large files using a large distributer cluster of commodity servers connected by a high speed network. It is designed to expect and tolerate hardware failures even while reading/writing the file and support parallel reads, writes and appends by multiple client programs.
GFS splits large files into chunks of 64MB that are extremely rarely overwritten or shrunk, but typically appended to or read. These chunks are stored in cheap commodity servers (also nodes) called chunk servers and hence necessitated the design to take precautions against high failure rate of individual nodes and the subsequent data loss. Another design decision is to go for high data throughputs, even if it comes at the cost of latency.
The nodes are divided into two types – one Master node and a large number of Chunk Servers. Each chunk is replicated at least three times on a different physical rack as well as a different network to handles various possible failures. For files that have high demand or require more redundancy, the replication can be higher than three.
Each chunk is assigned a unique 64-bit label and the logical mappings of files to constituent chunks are maintained. The Master server stores all the metadata associated with the chunks:
1. Mapping the 64-bit labels to chunk locations and the files they are part of
2. Details of the location of the copies of the chunk and which of them is primary
3. Details of the processes that are reading or writing to a particular chunk
The metadata is kept current by the Master by periodically receiving updates (or “Heart-beat messages”) from each chunk server.
To read a file, the client program sends the full path and offset to GFS (Master) which returns the metadata for one of the replicas of the chunk. The client directly reads data from the designated chunk server. The client does not cache the data that is read as most reads are large, but caches the metadata instead so that it need not contact the Master every time.
In case of append, the GFS sends back the metadata for all the replicas of the chunk where the data is to be found. And the data append happens as follows:
- Client sends the data to be appended to all the chunk servers
- Client informs the primary chunk server once all acknowledge receipt of data
- The primary chunk first appends its copy of the data into an offset of its choice (it may be beyond the EOF as multiple writers may be appending the files simultaneously)
- The primary then forwards the request to all replicas
- The replicas in turn try to write the data at the same offset as primary or return failure
- In case of failure, the primary rewrite the data at a different offset and retries the process
Modifications, which are expected to be extremely rare, are handled by permitting time-limited, expiring “leases”. The Master server grants permission to the process for a finite period of time (provided there is no other pending lease) during which no other process will be granted permission to modify the chunk. The modifying chunk server (which always is the primary chunk holder) first modifies it copy and then propagates the changes to all the replicas. The changes are not saved until all the chunk servers acknowledge, thereby guaranteeing the completion and atomicity of the operation.
As there is a large amount of redundant data, google makes heavy use of compression which in turn gives significant benefits – with compressed data occupying as low as 10% of the original – as the data is predominantly text based.
GFS differs from traditional distributed file systems as it allows all nodes to have direct concurrent access to the same shared block storage. Unlike many filesystems, GFS is not implemented in the kernel of an operating system but is instead provided as a userspace library. Google BigTable is a distributed storage system that is built on GFS.