Well today is the one year birthday of my blog. It’s been a very fun and exciting year. Here’s hoping that the next year will be just as fun.
The company I work for just finished our office move into our newer slightly larger office. I’ve managed to snag myself a Window seat this time around. I just wanted to share the view with everyone. This is the view out my desk window.
Some times its great to live in Southern California.
Orange County now has its own SQL Server User Group. It is being run as a part of the Orange County .NET Users Group. The first speaker was my friend Lynn Langit last month on August 14th. I’ve been asked to be the next speaker on October 16th. The meeting starts at 6pm and runs until 9pm. The meeting is being held at New Horizons in Anaheim at 1900 S. State College Blvd near the corner of Katella Ave.
I’ll be talking about Clustering SQL Server. It is going to be similar to the session about Clustering SQL Server which I’ve given at the prior SoCal Code Camps. If you would like a preview of the slide deck it is available for download. If I change the slide deck any I’ll update the version available for download.
If you are not able to make it to the session feel free to post your questions below and I’ll answer them as quickly as I can.
With SQL Injection Attacks being all the rage these days, I’ve been asked a couple of times for T/SQL code to clean up the database.
So I threw this code together to clean up the data. This code will clean all the character and uni-code columns in all the user defined tables in the system. You’ll need to be dbo or sysadmin to run this without error. If you have TEXT or NTEXT columns it will through an error for those columns. Cleaning TEXT and NTEXT columns is a little more complex as you can’t use the REPLACE function on a TEXT or NTEXT datatype.
DECLARE @sql NVARCHAR(4000) DECLARE @InsertedValue NVARCHAR(1000) SET @InsertedValue = 'The Script tags which were inserted' DECLARE cur CURSOR FOR select 'update [' + sysusers.name + '].[' + sysobjects.name + '] set [' + syscolumns.name + '] = replace([' + syscolumns.name + '], ''' + @InsertedValue + ''', '''')' from syscolumns join sysobjects on syscolumns.id = sysobjects.id and sysobjects.xtype = 'U' join sysusers on sysobjects.uid = sysusers.uid where syscolumns.xtype in (35, 98, 99, 167, 175, 231, 239, 241, 231) OPEN cur FETCH NEXT FROM cur INTO @sql WHILE @@FETCH_STATUS = 0 BEGIN exec (@sql) FETCH NEXT FROM cur INTO @sql END CLOSE cur DEALLOCATE cur
Hopefully you find this useful. If you need code for TEXT or NTEXT columns just post a comment and I’ll throw something together.
This code will work on SQL 2000 and up (it’ll probably work on SQL 7 as well, but I don’t have a SQL 7 machine to test against).
This morning I released another tip on SearchSQLServer.com called “SQL Server consolidation: Why it’s an optimization technique“. It’s a quick overview on consolidating SQL Server Instances onto a single server along with some of the gotchas.
When dealing with a high load replication environment network latency and network traffic can become as issue. There are a few settings you can play with, within the subscription agents themselves such as the –PacketSize, -UseInprocLoader, and –SubscriptionStreams. While working with these settings may or may not help (personally I’ve had very limited success getting them to improve the data flow rate) there are some other techniques you can use.
The best technique to you is to include your stored procedures as articles within the publication. This will allow SQL Server to instead of replicating the entire record from one server to another, it will simply run the stored procedure on the subscriber with the same input settings as it was run with on the publisher. If you have very complex stored procedures, you’ll want to evaluate the increase load on the subscribers CPU against the reduction of network traffic between the distributor and subscriber.
This technique is critical to use to reduce the network traffic if you have a multi-tier replication topology. This is where your subscribers are also publishers to subscribers of their own. This is because of the way that the replication writes data to the subscriber. Replication uses stored procedures on the subscriber to update the values of the record. Assuming a table which looks like this:
CREATE TABLE Employee
(EmployeeId INT IDENTITY(1,1),
The stored procedure which Microsoft uses for replication will look like this.
CREATE PROCEDURE MS_Employee_Update
SET EmployeeId = CASE WHEN @Columns & 1 = 1 THEN @EmployeId ELSE EmployeeId END,
LastName = CASE WHEN @Columns & 2 = 2 THEN @LastName ELSE LastName END,
FirstName = CASE WHEN @Columns & 4 = 4 THEN @FirstName ELSE FirstName END,
DepartmentId = CASE WHEN @Columns & 8 = 8 THEN @DepartmentId ELSE DepartmentId END
WHERE EmployeeId = @KeyField
While this technique works great for a two tier replication topology, we suddenly have a problem when we try to replicate from the second tier to the third tier. On the second tier all the columns have been updated, no matter which columns have actually been updated. This will cause an undo amount of replication traffic between the second and third tier servers as every column must now be replicated instead of just the values which have changed.
When you replicate the stored procedures as well as the tables when the stored procedure is fired to update the data on the publisher, that same stored procedure will be fired on the subscriber thereby reducing the amount of data which needs to be sent from the second tier to the third tier as again only the stored procedure execution is replicated. On very wide tables this reduction of data can reduce your network traffic load up to 70% (or higher depending on the data being changed and the width of the table).
You can verify that SQL Server is indeed using the stored procedures by using SQL Profiler to profile the distribution process on the subscriber. If you see your stored procedure names being fired then it’s using this technique. If you see procedures named by Microsoft, then it is not.
This technique will only work, if your application has no inline SQL. If you use a combination of stored procedures and inline SQL within the application, the data which is changed by stored procedure execution will be replicated using this technique; while the data which is changed by the inline code will be replicated using the traditional technique. If you have dynamic SQL within your stored procedures the SQL Server will not be able to use the replicated procedure to update the data on the subscriber. The dynamic SQL will prevent the SQL Server from recognizing that the stored procedure is what caused the data change, and the data will be replicated using the normal technique.
If you have a very high load on your database, but no specifically massive tables you may want to look into a fully load balanced solution for your database. This solution falls somewhere between your normal replication solution, and a federation solution. It’s not truly a federation as no one table is spread across several servers; every server in the federation holds every record of every table.
When laid out on paper this type of replication looks very similar to the Pyramid Federation technique we talked about earlier. This type of setup is ideal for an OLTP environment. We use transactional replication to move the data from one server to another as quickly as possible (you’ll want a pretty fast distributor to handle this). Because all writes are done to a single server (to prevent any potential identity column issues) this solution requires the most code change at the application layer as all commands which write data to the database must go to one connection string which talks to the publisher, which all the reads go to the load balancer which reads data from the federation of subscribers.
Because the publisher and distributor are both single points of failure in this setup, it’s recommended that they both be clustered so that they can survive a hardware failure. There is no need to cluster the subscribers as they are redundant by the fact that there are several of them behind a load balancer.
This concludes my mini-series on database federation. I hope that you have found it useful. As always questions or comments are welcome in the comments section below. (There’s no alerts when a response it posted, it’s something that I’ve already brought up with the ITKE staff, so check back for a response.)
Another type of database federation is what I call the full replication federation. This is where you place all the dimension tables (sticking with our data warehouse example from last time) on all servers of the federation. In addition to having the dimension tables on all the servers in the federation, we also allow all the users to connect to all the servers in the federation. This effectively creates an Active/Active solution as users should be connecting to the SQL Servers through a load balancer. As the dimensions are going to be read only as far as the users are concerned it doesn’t matter which server they connect to.
I call this the full replication federation as we setup replication on all tables which aren’t our large table which has been federated.
As we are connecting to all the servers, we can’t have our view and table named the same thing. In this case we want to have our actual table and view with different names. I prefer to simply use a different schema to hide the table where I want it. This changes our view to look more like this (using a three server federation).
CREATE VIEW dbo.FACT_Sales AS
I like to put the local database name in the view script, so that the same script can be easily deployed to each server. You can at your discretion remove the local server and database name.
You can now query the Data.FACT_Sales table on all three servers by simply querying the view on the local table.
You may end up with some of the same “interesting” optimizer query plans as when using the Pyramid federation technique, and the same solutions which we discussed in the “The Pyramid Federation ” post will still apply.
I’d like to say thanks to the San Diego SQL Server Users Group for inviting me to speak last night.
I had a great time speaking to the group, and just like last time the questions were all excellent.
The slide deck for last night session on Federated Databases is now available. I believe that it will also be made available on the San Diego SQL Server Users Group website.
For those that missed the session, it is one of the sessions which I’m giving at the SoCal Code Camp on October 25 and 26 at USC in Los Angeles. Based on the time the presentation took last night, I’ll be expanding it a bit to better fill the two hours I’ve allocated for it at the Code Camp. If you are going to attend the Code Camp, be sure to mark the interested check box after you register so that the Code Camp staff knows how large of a room to put all the session in.
There are several techniques which can be used to federate your database. The first of which, which we will be talking about is the Pyramid Federation (I have no idea if that what it’s actually called, but that what I’ve named it). In a pyramid federation we have a single server which holds the bulk of the tables. Then a set of servers sits beneath this server holding the large table which has been spread over the federation. Normally no data is replicated between all these servers, however data can be replicated if this will improve query performance. That’s a decision which you’ll have to make depending on your system design and platform load.
This type of system is great, if you have just a few tables which you need to federate because of their size, and length of time to return queries issued against them.
The basic layout of the system is that we have a single front end server which holds all our other tables. Using a data warehouse as an example, we keep all out dimensions on this front end server. While our large fact tables and then spread of the 3 servers which make up our backend system. While a user could connect to any of the four servers in the system, the only server which all the data will be available from is the front end server which holds the dimensions. The servers don’t need to know that the table is spread out across three physical servers, as they will query a few as they normally would a single table on the system.
In our example our head server will be SQL_Main, and our three back end servers will be SQL0, SQL1 and SQL2. The table which we have spread over the federation is called FACT_Sales, and we have designed it to hold many, many years worth of sales data totaling in the several billions of rows. Each of the SQL0-SQL2 servers will hold 1/3 of the data for the table. We use the MOD (%) function to decide which SQL Server the data is stored on. (We’ll cover this later, I promise.)
On our SQL_Main server we have a view called FACT_Sales. This view will be setup something like this.
CREATE VIEW FACT_Sales
As you can see from the view definition, the view the fairly simple, we simply query the three remote servers for all the data, matching whatever parameters we pass to the view when we call it. When we create the FACT_Sales tables on the SQL0-SQL2 servers an additional column should be created. As we are using the SalesId value (which is populated by our sales system, not the data warehouse) to figure out which server the row should be stored on, we place a SalesMod column on the table. We will also place a constraint on this column so that the table on SQL0 can only have a SalesMod value of 0, and the table on SQL1 can only have a SalesMod value of 1, and the table on SQL2 can only have a SalesMod value of 2. Loading the data can be done in two ways.
1. The first option is to simply bulk load all the data into the FACT_Sales view, and let the SQL Servers figure out where everything needs to go. This technique will work just fine for smaller sets of data. Just make sure to include a column as part of the select from the sales system(s) which has the formula of SalesId%2, this will give you the value of the SalesMod column to split the data between servers.
2. The second option is to split the data into three select statements through our ETL process and load each of the backend servers separately. If you have a larger amount of data to process this may be faster as there is one less server processing the data, and therefore one less network hop to work with. In addition when using the first technique all data must be written through the linked servers to the backend database servers, and linked servers are not the most efficient way to move a large amount of data.
When using this technique to federate your database, you must be very careful with your queries. You may find that if your dimensions are large, and you are using the dimensions to filter your data, you can end up with some extremely inefficient queries. If this happens you may wish to replicate some of the dimensions from the SQL_Main server to the three back end servers, and reference these replicated dimensions in your query. This will make your query much more complex, but if some correctly if can help the SQL optimizer make much more effective decisions. An example query could be using the DIM_DateTime to filter your records.
JOIN DIM_DateTime on FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND DIM_DateTime.Year = 2006
This could, under some circumstances, cause the SQL Optimizer to make some “interesting” decisions. A more effective query plan could result from a query something like this. Adjusting the indexes of the FACT tables will usually resolve this issue, however in some cases it may not. SQL Profiler will be your best friend when attempting to resolve these issues, as it will allow you to see exactly what commands the SQL Server you are connected to is sending to the remote server.
LEFT OUTER JOIN SQL0.MyDataWarehouse.dbo.DIM_DateTime d0 ON FACT_SalesData.DateTimeId = d0.DateTimeId
AND d0.Year = 2006
AND FACT_SalesData.SalesMod = 0
LEFT OUTER JOIN SQL1.MyDataWarehouse.dbo.DIM_DateTime d1 ON FACT_SalesData.DateTimeId = d1.DateTimeId
AND d1.Year = 2006
AND FACT_SalesData.SalesMod = 1
LEFT OUTER JOIN SQL2.MyDataWarehouse.dbo.DIM_DateTime d2 ON FACT_SalesData.DateTimeId = d2.DateTimeId
AND d2.Year = 2006
AND FACT_Salesdata.SalesMod = 2
By joining to all three servers DateTime dimension, and specifying that it should join to the local servers FACT_SalesData values only SQL Server should restrict the queries to the local server, and return the subset of data that we are looking for. It may however be necessary to manually break up the queries against each server within their own UNION ALL blocks.
FROM SQL0.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL0.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
FROM SQL1.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL1.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
FROM SQL2.MyDataWarehouse.dbo.FACT_SalesData FACT_SalesData
LEFT OUTER JOIN SQL2.MyDataWarehouse.dbo.DIM_DateTime DIM_DateTime ON FACT_SalesData.DateTimeId = DIM_DateTime.DateTimeId
AND d0.Year = 2006
Be sure to use the UNION ALL clause, and not the UNION clause so that the head SQL Server doesn’t try and do a distinct on these values. They are all going to be distinct between each server, and the MOD will be different on each one.
Depending on each situation you’ll need to make some decisions on which query technique gives you the best performance based on your specific data layout and dimension size. Different queries in your environment may have different query requirements.
When working with a federated database platform it is especially important to have an experienced query writer writing the bulk of the queries against the database platform, to reduce as much as possible the poor execution plans written against the database engine.
Look for my next post on database federation where we look into another technique for federating your database.