In DB2 version 10 for z/OS, IBM introduces a new access type called Hash Access and access method called Hash space. This new option of organizing tables using hash improves performance of queries that access individual rows using equal to predicate (say getting data using customer number or product number).
DB2 uses an internal hash algorithm with the Hash space to reference the location of the data rows. Thus using hash provides the advantage of selecting a hash access path which (in most cases) means only one I/O to retrieve a row from the table in turn reducing the CPU usage and improved response time makes it a very compelling proposition.
Using Hash access also means no need for maintaining the data sequence or clustering the index (for that matter, index clustering is not allowed if the table is hash organized). This results in efficient insert processing and avoids data sharing contention in maintaining a clustering sequence or clustering index.
While creating a table, the hash access can be enabled by adding the organization-clause in the CREATE TABLE statement.
“Organize by Hash” specifies that a hash is to be used for organizing the data of the table. The list of column names defines the hash key based on which the placement of a row is determined. Mentioning UNIQUE prevents the table from containing more than one row with the same value of the hash key (this applies even for the NULL value).
The amount of fixed hash space to be pre-allocated (default value is 64M) for the table can be specified. For tables partitioned by range, this space is for each partition. The value specified can be in KB, MB or even GB.
Hash organization is optimal when a table is of a stable or predictable size. When a table is organized by hash, DB2 automatically creates an overflow index. If the table exceeds the specified hash space, the extra rows are placed in the overflow index. The rows in the index are not hash access enabled and DB2 scans through the index to retrieve them.
An existing table can be altered to use hash access organization by specifying the “ADD ORGANIZE BY HASH” in the ALTER TABLE command. The table has to be reorganized and incompatible functions like index clustering would be disabled in the process. In case of existing tables by specifying AUTOESTSPACE(YES), we can let DB2 automatically estimate the best size for the hash space using the real-time statistics.
Using hash, obviously requires additional disk storage. Hash access can be monitored and the storage space used can be tuned for improved performance. The ACCESSTYPE column in the plan table has a value of ‘H’, ‘HN’, or ‘MH’, hash access is used to access the data. HASHACCESS (in SYSTABLESPACESTATS) indicates the number of times that hash access paths have been used to access the table. Using HASHLASTUSED, we can find out if DB2 has used the hash access path recently.
If HASHACCESS value is very low (even after several queries having accessed the table) or if DB2 has not used the hash access path recently – hash organization from the table can probably be removed thereby saving storage space.
The size of the hash space can be altered, if required, using the ALTER TABLE statement. Hash space must be increased:
Hash organization is only available on Universal Table Spaces (UTS) that are both segmented and partitioned. Hash access cannot be used in the following cases:
o LOB and XML table spaces
o Tables defined with APPEND YES.
o Materialized Query tables
o Tables in basic row format (for hash, using reordered row format – RRF – is required)
Some of the restrictions in using hash organized tables include:
o Parallelism is not used for parallel groups when hash access is used
o Hash access is not used to access fact or dimension tables in a qualified star join
o For queries that use multi-row fetch, hash access is chosen only if the query contains an IN clause that returns multiple rows.
Hash table organization would be most useful where unique keys are accessed using equal predicates – customer id, product id, document id – and the order of the data is immaterial.]]>
Database Sharding (made popular by Google) is a method of horizontal partitioning in a database and the approach is highly scalable and provides improved throughput and overall performance. Database Sharding can be defined as a “shared-nothing” partitioning scheme for large databases across a number of servers each with their own CPU, memory and disk. Simply put, break the database into smaller chunks called “shards” (each becoming a smaller database by itself) and spread those across a number of distributed servers.
The main advantage of Database Sharding approach is improved scalability, growing in a near-linear fashion as more servers are added to the network. Other benefits of having smaller databases are:
Sharding typically uses a distributed hash table (DHT) that provides a lookup service similar to a hash table where any participating node an efficiently retrieve the value associated with a given key.
The characteristics emphasized by DHT are Decentralization (without any central co-ordination), Scalability (function efficiently with thousands of nodes), Fault tolerance (reliable even with nodes continuously joining, leaving and failing). These goals are achieved using a key technique so that any one node needs to coordinate with only a few other nodes in the system – so that only a limited amount of work needs to be done for each change in membership. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.
Most DHTs use consistent hashing to map keys to nodes. This technique employs a function δ(k1,k2) which defines an abstract notion of the distance from key k1 to key k2. Each node is assigned a single key called its identifier (ID). A node with ID ix owns all the keys km for which ix is the closest ID, measured according to δ(km,in). Consistent hashing has the property that removal or addition of one node changes only the set of keys owned by the nodes with adjacent IDs, and leaves all other nodes unaffected.
Conceptually, Sharding broadly falls under three categories:
1. Vertical partitioning – All the data related to a specific feature of a product are stored on the same machines. Storing infrequently used or very wide columns on a physically different device is an example. It is also referred to as “row splitting” – as the row is split by its columns.
2. Key-based partitioning – In this, the part of the data itself is used to do the partitioning. Most common approach is to use a one-way hashing algorithm to map the data to be accessed to one of the shards that store it. Natural hashes can as well be used – in case of numbers as key, the key mod N (number of shards), in case of dates, it could be based on time interval, could be based on first letter of the name, and for amounts it could be based on the range of value. Similarly a list of values can be used to assign a partition – e.g., list of countries grouped into continents.
3. Directory-based partitioning – In this scheme, a lookup table that keeps track of which data is stored in which shard is maintained in the cluster. This approach has two drawbacks – the directory can become a single point of failure and there is a performance overhead as the directory has to be accessed every time to locate the shard.
Composite partitioning that allows a combination of the partitioning schemes are also used, say applying a range partitioning first and then a hash partitioning.
The distributed nature of multiple shard databases increases the criticality of a well-designed fault-tolerant and reliable approach which makes the following necessary:
Distributed queries, that can perform faster and use parallel processing of interim results on each shard server, need the ability of the system to handle them in a seamless manner for the application (MapReduce is one such example).
Various Sharding schemes exist and each has inherent characteristics and performance advantages when applied to a specific problem domain. Database Sharding to be effective needs to be application specific. A single application can use more than one shard scheme, each applied to a specific portion of the application to achieve optimum results.]]>