Posted by: Denny Cherry
Federated Database, Scaling Out
There are several techniques which can be used to federate your database. The first of which, which we will be talking about is the Pyramid Federation (I have no idea if that what it’s actually called, but that what I’ve named it). In a pyramid federation we have a single server which holds the bulk of the tables. Then a set of servers sits beneath this server holding the large table which has been spread over the federation. Normally no data is replicated between all these servers, however data can be replicated if this will improve query performance. That’s a decision which you’ll have to make depending on your system design and platform load.
This type of system is great, if you have just a few tables which you need to federate because of their size, and length of time to return queries issued against them.
The basic layout of the system is that we have a single front end server which holds all our other tables. Using a data warehouse as an example, we keep all out dimensions on this front end server. While our large fact tables and then spread of the 3 servers which make up our backend system. While a user could connect to any of the four servers in the system, the only server which all the data will be available from is the front end server which holds the dimensions. The servers don’t need to know that the table is spread out across three physical servers, as they will query a few as they normally would a single table on the system.
In our example our head server will be SQL_Main, and our three back end servers will be SQL0, SQL1 and SQL2. The table which we have spread over the federation is called FACT_Sales, and we have designed it to hold many, many years worth of sales data totaling in the several billions of rows. Each of the SQL0-SQL2 servers will hold 1/3 of the data for the table. We use the MOD (%) function to decide which SQL Server the data is stored on. (We’ll cover this later, I promise.)
On our SQL_Main server we have a view called FACT_Sales. This view will be setup something like this.
CREATE VIEW FACT_Sales
As you can see from the view definition, the view the fairly simple, we simply query the three remote servers for all the data, matching whatever parameters we pass to the view when we call it. When we create the FACT_Sales tables on the SQL0-SQL2 servers an additional column should be created. As we are using the SalesId value (which is populated by our sales system, not the data warehouse) to figure out which server the row should be stored on, we place a SalesMod column on the table. We will also place a constraint on this column so that the table on SQL0 can only have a SalesMod value of 0, and the table on SQL1 can only have a SalesMod value of 1, and the table on SQL2 can only have a SalesMod value of 2. Loading the data can be done in two ways.
1. The first option is to simply bulk load all the data into the FACT_Sales view, and let the SQL Servers figure out where everything needs to go. This technique will work just fine for smaller sets of data. Just make sure to include a column as part of the select from the sales system(s) which has the formula of SalesId%2, this will give you the value of the SalesMod column to split the data between servers.
2. The second option is to split the data into three select statements through our ETL process and load each of the backend servers separately. If you have a larger amount of data to process this may be faster as there is one less server processing the data, and therefore one less network hop to work with. In addition when using the first technique all data must be written through the linked servers to the backend database servers, and linked servers are not the most efficient way to move a large amount of data.
When using this technique to federate your database, you must be very careful with your queries. You may find that if your dimensions are large, and you are using the dimensions to filter your data, you can end up with some extremely inefficient queries. If this happens you may wish to replicate some of the dimensions from the SQL_Main server to the three back end servers, and reference these replicated dimensions in your query. This will make your query much more complex, but if some correctly if can help the SQL optimizer make much more effective decisions. An example query could be using the DIM_DateTime to filter your records.
JOIN DIM_DateTime on FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND DIM_DateTime.Year = 2006
This could, under some circumstances, cause the SQL Optimizer to make some “interesting” decisions. A more effective query plan could result from a query something like this. Adjusting the indexes of the FACT tables will usually resolve this issue, however in some cases it may not. SQL Profiler will be your best friend when attempting to resolve these issues, as it will allow you to see exactly what commands the SQL Server you are connected to is sending to the remote server.
LEFT OUTER JOIN SQL0.MyDataWarehouse.dbo.DIM_DateTime d0 ON FACT_SalesData.DateTimeId = d0.DateTimeId
AND d0.Year = 2006
AND FACT_SalesData.SalesMod = 0
LEFT OUTER JOIN SQL1.MyDataWarehouse.dbo.DIM_DateTime d1 ON FACT_SalesData.DateTimeId = d1.DateTimeId
AND d1.Year = 2006
AND FACT_SalesData.SalesMod = 1
LEFT OUTER JOIN SQL2.MyDataWarehouse.dbo.DIM_DateTime d2 ON FACT_SalesData.DateTimeId = d2.DateTimeId
AND d2.Year = 2006
AND FACT_Salesdata.SalesMod = 2
By joining to all three servers DateTime dimension, and specifying that it should join to the local servers FACT_SalesData values only SQL Server should restrict the queries to the local server, and return the subset of data that we are looking for. It may however be necessary to manually break up the queries against each server within their own UNION ALL blocks.
FROM SQL0.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL0.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
FROM SQL1.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL1.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
FROM SQL2.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL2.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
Be sure to use the UNION ALL clause, and not the UNION clause so that the head SQL Server doesn’t try and do a distinct on these values. They are all going to be distinct between each server, and the MOD will be different on each one.
Depending on each situation you’ll need to make some decisions on which query technique gives you the best performance based on your specific data layout and dimension size. Different queries in your environment may have different query requirements.
When working with a federated database platform it is especially important to have an experienced query writer writing the bulk of the queries against the database platform, to reduce as much as possible the poor execution plans written against the database engine.
Look for my next post on database federation where we look into another technique for federating your database.