Posted by: Denny Cherry
Microsoft Windows, Performance Problems, SQL Server, Storage, TSQL2sDay
So today is #TSQL2sDay, and this months topic is Disk IO. Storage is something I love working in, so I figured why not, I’ll post something.
Hopefully everyone knows that your storage solution is the most critical when it comes to keeping your database up and running day after day. If your databases are two slow, then your database will be slow and there just isn’t anything you can do about that besides adding more disks to the storage system.
However figuring out that the problem is actually slow disks can be a problem in itself, especially if you work in a large company and don’t have access to the servers themselves or the storage array that is hosting the databases.
Within SQL Server you have a couple of places you can look. The easiest is to look at the sys.dm_io_pending_io_requests DMV. If you have a lot of rows being returned that say that the IO is pending then you may have a problem as SQL may be trying to push more IO than your storage solution can handle.
Another DMV you can look at is the sys.dm_io_virtual_file_stats DMV. This will tell you how many IO requests have been processed since the instance was started, as well as how many of those requests were stalled when writing them. Using these numbers requires doing some math to see how your doing over the current runtime of the instance. As the instance has been up for longer and longer these numbers can get harder and harder to make sense of.
Within Windows you’ll be using our good friend Performance Monitor to see what’s going on. There’s a few counters which are really critical to looking at. They include the Reads and Writes per second, the seconds per read and write, and the queuing counters.
The reads and writes per second will tell you how many requests are going between the server and the disks per second. If these numbers are very high and stay there they you are pushing the disks very hard. If this is the case, stop here, and make sure that you don’t have any indexes that need to be built, and that all your statistics are up to date.
The seconds per read and write are very critical numbers. This tells us how fast the disks are processing each read and write request. These numbers should be very very low, somewhere in the .00n range with the smaller the number the better. If you are seeing numbers which are higher than .010 then you may be pushing your disks to hard. Anything over 1 second and your SQL Server is probably dying for more disks.
The disk queuing numbers are also very important. These will tell you how many commands are backing up while the disks are processing the other commands which were given to them. The general number is that the queue shouldn’t ever reach n*2 the number of disks that you have which are actively serving the data. So if you have a 10 disk RAID 10 array, the queue should go no higher than 10 as there are only 5 disks serving the data, but those same 10 disks in a RAID 5 array are OK with a disk queue of about 18 as there are 9 disks actively serving the data.
Now this doesn’t mean that you should always have a disk queue. Disks work best when data is sent to or read from them is bursts, not in a constant massive stream. This means that you want to aim for an average queue length of 0, with occasional spikes up.
On the array
If you are working with your basic local disk array, then there isn’t much you’ll be able to look at past the server unless your RAID card has metrics which it can expose to you through the diagnostic tools which come with it.
However if you are working in with a SAN solution your SAN will have some diagnostics available to your SAN admin. These diagnostic numbers will give you the full story as you’ll be able to see what Windows sees from the Server, as well as what the array is seeing.
When looking at the array itself you can now see not only what the performance on the LUN which is presented to Windows, but what the performance of each specific disk which is under the LUN is doing. This will allow you to for example see if a specific spindle under the LUN is causing the slow down, perhaps because it is failing.
Getting the full picture is very important when it comes to looking at storage performance issues. This means looking at the performance numbers from all sides so that you can get a full understanding on exactly where the performance problem may be coming from.
I forgot to put in the link to Mike’s post about T-SQL Tuesday, so here it is.