SQL Server with Mr. Denny:

Tables

Jul 21 2008   11:00AM GMT

A better way to index text data



Posted by: mrdenny
Index Performance, T/SQL, SELECT statement, Query tuning, Tables, Uni-code

Indexing text data (varchar, nvarchar, char, etc) is a good way to make it faster to find the data you are looking for.  However these indexes can end up being very hard on the disks behind the index, as well as the memory of the server.  This is because of the large amount of data being put in the index.

As an example, let’s say that we have a table like this.

CREATE TABLE Employee
(EmployeeID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
EmailAddress VARCHAR(255))

Now assume that you want to be able to search by the EmailAddress field.  We will then want to index the EmailAddress field with a non-clustered index.  If we work for a company like AMD, then our email addresses will be pretty short (f.lastname@amd.com).  However if we work for a company like I work for then the email addresses are a bit longer (flastname@awarenesstechnologies.com).  Now when we index this column we will be putting the entire email address into the index, taking up a lot of space within the index; especially compared to a numeric value such as an integer.  This becomes doubly true if you are using a uni-code data type as each character requires two bytes of storage instead of the usual one.

This also becomes a problem if you are working on a system with URLs in the field to be indexes.  Depending on the length of the URL, the values may be longer than is allowed in an index which could then give you sorting problems on the indexes.

There are a couple of variations on this technique which I’ve seen.  The one I’ve used the most is to use the CHECKSUM function as part of a calculated column, and then index the calculated column.  This way you simply get the CHECKSUM of the value you want to find, and search the calculated column.  As we are now have an index made up of integers the index can fit a lot more data on each physical data page reducing the IO cost of the index seek as well as saving space on the disk.

So doing this turns our table into something like this.

CREATE TABLE Employee
(EmployeeID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
EmailAddress VARCHAR(255),
EmailAddressCheckSum AS CHECKSUM(EmailAddress))

Now I wouldn’t recommend using this technique for each table you create.   I usually only recommend a technique like this when the value to be indexes won’t fit within the bounds of the index, or the table will be very large and searched often so the memory saved is worth the extra CPU time of having to hash the values before doing the lookup.

Now there are a couple of gotchas with this technique.  If you are check summing domain names, some characters don’t check sum correctly.  Also check summing a Unicode version of a string will give you a different result than the non-unicode version of the same string.

You can see that with these three SELECT statements.

SELECT CHECKSUM(’google.com’), CHECKSUM(’g-oogle.com’)
SELECT CHECKSUM(’google.com’), CHECKSUM(N’google.com’)
SELECT CHECKSUM(N’google.com’), CHECKSUM(N’g-oogle.com’)

As you can see the first one you get two different values as you would expect ( 1560309903 and 1560342303 respectively).  With the second query you get two very different values between the Unicode and character strings (1560309903 and -1136321484 respectively).  Based on the first query you would expect to get two different values for the third query, but you don’t.  With the Unicode strings the - appears to not count as part of the CHECKSUM giving you the same CHECKSUM value for both strings (-1136321484).

Another version of this technique which Kevin Kline talked about recently uses the HASHBYTES function of SQL Server 2005 to get the hash of a column and use that.  In his blog he’s talking about using it for partitioning a table, but that same technique can be used here as well.

CREATE TABLE Employee
(EmployeeID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
EmailAddress VARCHAR(255),
EmailAddressCheckSum AS HASHBYTES(’SHA1′, EmailAddress)

This will however give you a longer string, therefor taking up more space within the index. However if working with long Unicode strings this may be a better option for you to use.

Denny

Jul 14 2008   11:00AM GMT

Using batching to do large operations without filling the transaction log



Posted by: mrdenny
SQL, DELETE statement, Query tuning, Tables

Deleting large amounts of data from a table is usually an important task, but if you don’t have a maintenance window to work within then you can end up causing locking and blocking against the other processes which are accessing the database.  Not to mention you will cause a very large amount of data into the transaction log no matter what your transaction logging level is set to.

Say you have a table with a date column and you need to delete a million plus records.  Doing this in a single transaction will put all million transactions into your transaction log, plus cause any other processes which are trying to access the table to be blocked.

 However if we batch the transaction into many smaller transactions our transaction log will not fill up as we can backup the log using our normal log backup methods throughout the process, or if we use SIMPLE recovery on our database then transactions will be removed from the log automatically.

In SQL 2000 and below you have to set the ROWCOUNT session variable to a number above 0, which would cause SQL to delete the first n records that it comes across.  In SQL 2005 we can use the TOP parameter as part of our DELETE command having the same effect, but without having to reset the session variable.

In SQL 2000 or below you can use a syntax like this one.

DECLARE @c BIT, @d DATETIME
SET @d = dateadd(yy, -1, getdate())
SET @c = 0
SET ROWCOUNT = 1000
WHILE @c = 0
BEGIN
DELETE FROM Table
WHERE CreateDate

If you are using SQL 2005 you can use this very similar syntax.

DECLARE @c BIT, @d datetime
SET @d = dateadd(yy, -1, getdate())
SET @c = 0
WHILE @c = 0
BEGIN
DELETE TOP (1000) FROM Table
WHERE CreateDate

Both pieces of code are very similar. Declare a variable which tells the loop when to exit. Then start deleting the data. If no records are deleted, then set the variable to 1 causing us to exit the loop. Now this will usually take a little bit longer to complete than a single delete statement, but the system will continue to be responsive during the process.

The number of records which you are deleting should be adjusted based on the width of your records, the load on the database at the time of deletion, and the speed of your hard drives which hold the data files and transaction logs. I usually start at 1000 records and see how the system responds. For tables which a just a few numbers I’ll put it up as high as 50k or 100k records. For very wide tables I’ll drop it down to 100 or 500 records if the system can’t handle 1000 records.

Denny

Update:
Sorry this post didn’t look very good at first. The blog site seams to have eaten it, and I didn’t notice until just now.


Apr 10 2008   10:00AM GMT

Back To Basics: Tables, without them we have nothing to do



Posted by: mrdenny
SQL, Back To Basics, Tables

Tables are the core of any database platform.  Without tables we would be able to process data, but we would have no way to store the data.  In the simplest terms tables look like Excel sheets.  They both have columns and rows.  When you view a table in the client tools it looks much like an Excel sheet does.  While the basic concept is the same tables are very, very different that Excel sheets.  SQL Server is optomized to store large quanties and data and search through that data as quickly as possible.

Unlike Excel, tables have indexes.  Indexes are copies of the column which comprise the index.  The reason that we use indexes is to speed up the processing of the query.  Whle the index is a copy of the column, the copy within the index in sorted in order while the table is not.  Because the data is sorted SQL Server can search through the data much easier and there for faster.

To make this easier to understand think of an Excel sheet with 1000 rows of data in it.  Each row has a single value in column A.  Each value is a random number between 1 and 1000.  Now try to find all the rows with the value of 25.  You need to search down the column looking for the data.  This is how your table work.  When you tell SQL Server to search the table for the value of 25 using this statement SQL has to look at every record in the table.

SELECT *
FROM Table
WHERE Column1 = 25

Now when you create an index on this column think of sorting the Excel sheet in order.  Now find the records with the value of 25.  You can simply scan down the column and find the records and not look any further.  This is the same thing that SQL Server does when you us the same command but with the index.  SQL Server uses the index automatically, so no changes to code are required when you create the index.

Denny