SQL Server with Mr. Denny:

DataManagement

May 12 2008   9:00AM GMT

Data Normalization, with storage being so cheap what’s the point?



Posted by: mrdenny
Data Normalization, Data integrity, SQL

Back in the old days one of the big reasons that people enforced such strict rules on data normalization was because it would greatly increase the amount of information you could fit on the hard drive.  Back when everything was measured in kilobytes or megabytes storing redundant information simply wasn’t an option.  You had to be able to cram as much information into that little database.

These days space isn’t the issue.  Storage is very cheep these days, and if you have the money you can get absolutely massive amounts of storage.  1 TB hard drives are easily found, and when you start talking about RAIDing drives together the storage limits start to become just insane.  100 TB of storage just isn’t what it used to be.

The problem now becomes that with so much data for the database platform to search through we need to make sure that the indexes which SQL Server is searching as as small as possible so that we can get those indexes read from the disk and into memory as quickly as possible.

Say we have an Employee table which looks like this.

CREATE TABLE Employee
(EmployeeId INT,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
Department NVARCHAR(100),
Manager_First_Name NVARCHAR(50),
Manager_Last_Name NVARCHAR(50),
Location NVARCHAR(50),
Active VARCHAR(5))

Now this table is clearly not setup as a normalized table.  We are assuming that this table has best case indexing.  If we have 1000 employees records and we are going to search the Department column the index must be read from disk.  Now assuming that the department names are nice long bureaucratic names which average 30 characters each we have about 30000 bytes (more if you take the fill factor into account) which needs to be read from the disk.  Now if we had a relational version of the table.

CREATE TABLE Employee
(EmployeeId INT,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
DepartmentID INT,
ManagerEmployeeID INT,
LocationId INT,
Active BIT)

When we search the Employee table we now are only loading an integer value from the disk into memory. This means that we only have to load 8000 bytes of data from the disk into memory. Now knowing that SQL Server reads data from the disk in 8k blocks even with a 50% fill factor the index only takes 2 blocks on the disk and can then be completed in two read operations to the disk. Now our original table with no fill factor will require 30 data pages (20.297 to be specific), about 60 if we use the same fill factor.

With tables of this size this isn’t much. But as your database grows the number of additional data pages which would have to be read for a simple search of the table grows exponentially.

Denny

Apr 17 2008   8:00AM GMT

SQL 2008 one click database encryption gives a false sense of security



Posted by: mrdenny
Compliance, DataManagement, SQL Server 2008, Encryption

While I think that the one click database encryption that Microsoft has built into SQL Server 2008 is a good idea, but I’m not sure how useful it actually is.  It’s touted as giving you data encryption of the entire database without any code change.

What this actually means is that if someone stops the SQL Server and copies the data files to another server and attaches the files to that SQL Server the files won’t be of any use.  This however isn’t the common way that data is stolen from a SQL Server.  A much more common method is someone logs into the database using a perfectly legit account and exports the table or tables which contain the useful information to Excel, Access, csv, etc.  This method of database encryption does nothing to protect against this as if you query the table directly you are displayed the data in clear text.

All this database encryption does (as it appears to me) is that it protects the data at rest (ie. when the SQL Server service is stopped) and does nothing else to protect the data.  If all you are looking for is a compliance check off saying that you data is encrypted for SOX, HIPPA, the California Online Privacy Protection Act of 2003, etc then this will probably do the trick.  If you want actual data encryption where the data is not viewable when you query the table unless you know the key to unlock the data this isn’t it.

If you do want that quick and easy encryption for a compliance check off be sure to check your data storage requirements and IO requirements.  Encrypted data is typically larger than unencrypted data and you may need to increase your allocated storage or IO requirements in order to have the encryption completed.  Also remember that the higher level of encryption you use the more CPU time is required to decrypt the data.

Feel free to share your opinions below.  I’m interested to here what others have to say on the topic.

Denny


Mar 27 2008   6:54PM GMT

New Article: SQL Server tempdb best practices increase performance



Posted by: mrdenny
Article, System Configuration, DataManagement

I’ve recently published a new tip on SearchSQLServer.com called “SQL Server tempdb best practices increase performance“.

Denny


Jan 11 2008   6:40PM GMT

New Article: Determining SQL Server database storage requirements



Posted by: mrdenny
Article, Config, DataManagement, SQL, SQL Server 2005, SQL Server 2008

I’ve published a new tip over on SearchSQLServer.com entitled “Determining SQL Server database storage requirements“.

Denny


Jan 3 2008   8:00AM GMT

Using a foreach loop to process an XML document.



Posted by: mrdenny
SQL, DataManagement, XML, SQL Server 2005, SQL Server 2008, SSIS

Receintly I was working on a project where I needed to use a foreach loop with an SSIS project, but couldn’t for the life of me get it to properly process the XML document which I was giving it.  Well with some major work and digging I was able to get it working correctly, but it took me for ever to get all the little setting correct so I figured that I’d throw the info up here for anyone else who is looking for it.

Some background on what the process is that I’m working on.  Basically I’ve got a table with catagorized data in it.  I need to export all the data from the table info one file per catagory (don’t ask, I didn’t design it, I’ve just got to automate it; and it actually makes sence in the grant schem of things).  Well I figured that the easiest way to do this was to use a foreach loop and give it an XML document with the list of catagories to process.  (This was better than looping through and getting the next value from the database.)

 So needless to say, I get started on my little process.

The query which I’m using within an Execute SQL Task is below.  The Execute SQL Task puts the XML data into an SSIS variable called v_CategoryList with a data type of string.

SELECT WebsiteBlockCategoryId as id
FROM dbo.WebsiteBlockCategory cat with (nolock)
where WebsiteBlockCategoryId  <> 0
for XML AUTO

The XML document looks like this. (SSIS seams to be wrapping it within <ROOT></ROOT> tags for me which is why I’m not doing it my self.)  My XML document is actually must longer than this, but you get the idea.  It’s a very basic XML document.

<ROOT>
 <cat id=”18″ />
 <cat id=”19″ />
 <cat id=”20″ />
 <cat id=”21″ />
 <cat id=”22″ />
 <cat id=”23″ />
 <cat id=”24″ />
 <cat id=”25″ />
 <cat id=”26″ />
 <cat id=”27″ />
 <cat id=”28″ />
</ROOT>

As you can see on the screenshot of the forloop properties I’ve set the source to be my variable and the EnumerationType to Element Collection.  Since I know where the data is within the XML document I use DirectInput for both the outer and Inner XPath strings.  For the Outer XPath string I’m using “//cat”.  Because I’m not putting in the ROOT level name it doesn’t matter what gets put in there as long as there is a parent level.    For the Inner XPath string I’ve got the Element name with an @ sign in front of it “@id”. 

Collection Screen

Over on the Variables page of the UI I’ve got my v_CategoryId variable mapped to Index 0 of my document.  If you need to output more than one element from your XML document set your Inner XPath to “*”.  This “should” allow you to bring back all the elements and refer to them by index number starting with 0.  I havn’t actually tried this, as I’ve always only needed a single element hense the “should”.

Variable Mappings Screen

And to think, that little thing took me a couple of days to get working right.  I can only imagine trying to do this in SQL 2000 and DTS.

Denny


Dec 13 2007   8:00AM GMT

So what’s the difference between null and a blank field



Posted by: mrdenny
SQL, Data Types, T/SQL, Data integrity, NULL

There is always a discussion going on somewhere over having NULL values in a database.  Some people (such as myself) are fine with NULL values.  Some people don’t like having NULL values in there database for some reason.  I’ve heard all sorts of reasons; “The screw up performance”, “They are hard to deal with”, “I like using a default value better”, “Default values are easier to work with”.

The performance one is just bunk.  NULL values are indexed just like any other value.

The rest of the reasons to not use NULL values ignore the fact that a NULL value is a perfectly legit value for some fields.  Not all fields should allow a NULL value to be placed in them.  If you are building a shopping cart and have a line items table, the Quantity and Price fields probably shouldn’t allow nulls as you should know the price and quantity when creating the invoice.  However in the same shopping card application the users phone number may need to allow NULL values.

Something which is key to remember about NULL values is that NULL doesn’t mean nothing.  It means unknown.  There may be a value, there may not, we simply do not know what that value is.  Putting in some default value will then put a value on the field when there shouldn’t be.  A great example which I read recently involved a hypothetical database at a blood bank.  A person comes in needing blood.  They are entered into the system but the recipients blood type isn’t known.  If you put in a default value of a blank value when you go to query the supply table, if there are pints of blood which have not been typed they will also have a blank value (as that is your standard default value).  You’ll get a match when you run the query and possibly be giving the recipient blood which is of the wrong type.  Now if you were using NULL values in the database the query would return no matching values as NULL <> NULL (under the default settings of Microsoft SQL Server).  Granted we should be able to trust that the person give the recipient the blood would check the typing, but would you rather trust a person or the computer.  People are overworked and usually underpaid, and nurses and doctors make simple mistakes just like the rest of us.  Personally I’d rather know that the system wouldn’t find a match rather than have to worry that I’d be getting the wrong type of blood.

If you really want to display that default value when the data is returned that’s what the ISNULL system function is for.  This way you don’t have to write long case statements around each field which allows NULL.

NULL values have there place, just like data does.  NULL values don’t mean that we don’t care about the data, it means that we simply don’t know the value of the data as of yet.  It shows that we care about the system and the people using it.  How annoying is it to look at a form which is full of default values and having to read through all of them instead of simply looking at the ones which are blank.


Nov 15 2007   7:23PM GMT

New tip published: Update SQL Server table statistics for performance kick



Posted by: mrdenny
Article, SQL, Statistics, T/SQL

I’ve published a new tip over at SearchSQLServer.com entitled Update SQL Server table statistics for performance kick.

 Denny


Oct 25 2007   6:03PM GMT

New tip published about table statistics



Posted by: mrdenny
Statistics, SQL, T/SQL, Article

My next tip on SQL Server Statistics has been published on SearchSQLServer.com entitled Update SQL Server table statistics for performance kick.


Oct 24 2007   5:09AM GMT

Upgrade for sp_who2 called sp_who3



Posted by: mrdenny
T/SQL, DataManagement, SQL

I’ve published this before over on tek-tips.com, but I figured that I’d republish it here as well.  I’ve written an update for sp_who2 which I call sp_who3.  It can be most useful when trying to diagnose slow running queries as it can provide a wealth of information in a single screen.

 Usage:
exec sp_who3
exec sp_who3 active
exec sp_who3 blocked
exec sp_who3 72 /*Any active spid*/

Download: SQL 7 / SQL 2000 / SQL 2005

When using no parameter the output will match the output of sp_who2.

When using the “active” keyword the output will match the output of sp_who2 active.

When using the “blocked” keyword the output will have the same columns as sp_who3 active but show only the blocking and blocked processes.

This procedure has been used at companies like IGN / GameSpy, MySpace, Awareness Technologies, and Triad Financial to name a few.


Oct 10 2007   1:58AM GMT

Getting the right disk config for your SQL Server



Posted by: mrdenny
SQL, DataManagement, Storage

Many DBAs so not have the knowledge needed in the storage area to be able to ask the right questions to there storage engineers, and often storage engineers do not know the right questions to ask a DBA to get the answers that they need.  Because of this database storage ends up being configured correctly for a file server, but not for a database.  This can lead to slow response time within the database, which can lead to blocking and eventually timeouts on the front end.

Correctly configured and sized storage is essential to running a database at peek performance.  Unfortunately when configuring storage even the smallest misconfiguration can be detrimental to the entire database and it’s front end application.

As some initial reading please take a look at this tip I wrote for SearchSQLServer.com called Optimize disk configuration in SQL Server.  There is also an excellent tip by Hilary Cotter called SAN considerations for your SQL Server environment.

Today I’ll be focusing on databases which use SAN storage, but some of it will still apply to databases using local storage.

When configuring your storage be sure to look beyond just the storage capacity of the drives.  You also need to consider the Input/Output operations (IOs) that each drive and array can support.  If you have enough space, but not enough IOs your database will not work at the speed you need.  With drive sizes growing as quickly as they are, it is very easy to get into trouble and not have enough IO capacity.  If you have a database which needs 500 Gigs of Storage, you could fit that database onto a single 750 Gig drive.  Would that hard drive have enough speed to handle the IO requirements of your application, probably not.

Unfortunately there is no good tool available to help you figure out how many IOs you will need in order to support your database.  I recommend planning high, but not crazy high.  If you are expecting 10000 transactions per hour you do not need to be able to support 10 million transactions per hour, however the number of transactions and IOs which are needed will change over time as the business grows.  Because of this it is recommended to routinely evaluate the IO requirements for the database and ensure that you can meet them.  If not, it may be time for an upgrade.  Before you go requesting budget for an upgrade do not forget to tune your indexes.  As tables grow index and statistic usage can change and may need to be reviewed before making changes to the storage.

While there is a correlation between transactions and IOs, there is no direct formula from one to the other.  Some transactions may work completely from the buffer cache and cause no transactions, while some may cause tens of thousands of reads while exporting data for loading into your data warehouse.

When working in a SAN environment be sure to take into account not just the requirements of your database, but also of the other LUNs which will be sharing your RAID group.  A classic example is where there is a RAID 0+1 RAID group made up of 4 disks, each with the potential of handling 120 IOs per second for a total of 240 IOs per second.  There are three databases which need storage on this RAID Group.  Each has a requirement of 100 IOs per second.  But because they are created at different times the IO requirements aren’t being tracked very well.  After the second LUN is created everything should still continue to work fine.  When the SAN administrator gets in his request for a third LUN to be created he looks at his RAID group and sees that there is plenty of space on the RAID Group.  With no IO information from the DBA he doesn’t know that this LUN will overload the RAID Group and cause a slowdown of all three LUNs.  Because the DBA doesn’t know that the three databases are all on the same RAID group he doesn’t know that there will be a problem either.  But as soon as the third database goes into production and starts getting a load all three applications start to feel the slowdown.

 Because the proper questions were not asked by either the DBA or the SAN administrator we now have applications which are not behaving correctly.

One question that I am often asked is “How many IOs can a disk actually handle?”.  Unfortunately there is no good answer to that question.  Each drive vendor will be slightly different.  Something which will also effect the answer to this question is what is called the IO profile of the drive.  Sequential IO operations (reading Sector 1,  then Sector 2, then Sector 3, etc) is much faster than random IO operations (reading Sector 158, then Sector 21, then Sector 851, etc).  Unfortunately most database traffic is random by nature.  OLTP applications involve systems accessing small bits of the database all over the physical file.  SQL will try to correct for this by caching as much data as it can into memory so that it doesn’t have to go to the disk very often, but most disk access will still be very random in nature.  Operations such as OLAP data loads and OLTP table scans are typically sequential, providing that your tables are laid out correctly in separate file groups.

There are other things which can effect the randomness of your disk access such as table scans, low buffer cache hit ratios, data file location, page file location, other applications using SQL Servers disks, etc.