Nice article in SQLServerCentral, but I’m going to have to disagree with you on the point of using RAID 5 for the data files. First, though, the excepted clause – I’m not one of those that is vehemently opposed to RAID 5, and I’m not saying not RAID 5 for all data files, but based on your article and in particular the section that discusses I/O.
Ok, what do I not agree on. Well, firstly, IMHE, in all but the cases where there is less than 10% writing going to disk, calculations based on known (or estimated) I/O requirements demonstrate that RAID 5 requires more spindles than RAID 10. It is a myth that RAID 10 requires more spindles when you base the calculations on I/O requirements and the read/write ratio is greater than 90:10 (r:w) in favour of writes. I could go into the calculations, but these calculations are well known and I ask that you test it yourself. I’d be interested to know if you find the results different to what I am stating here.
Secondly, you talk about how expensive disks are. Ok, I’m detracting slightly from your main point here on improving performance, but I think its important. Well, this will depend on the worth of the data to the business. Also, this will depend on the cost to the business if you should lose one of the RAID 5 spindles, both in the degredation to performance and risk to the business if another spindle should fail and you lose the whole array. Think it won’t happen? I’ve seen this happen twice in my time as a DBA (turned out disks had a common fault in the manufactured batch). In fairness, it will also depend on whether you’re using a SAN compared to DAS, as SAN spindles are too expensive – you listening EMC?). Whenever I have talked to the guys who approve and pay for these thing, and we have gone through the advantages/disadvantages/cost, rarely have I had anyone stay on RAID 5 – in fact, without exception I think everyone has accepted RAID 10 – at the end of the day, the RAID format is a form of insurance and the level of cover is based on the one you choose.
Again, though, this may not fit everyone, but for many DBAs it does.
Just my thoughts and experiences.
You have brought up some good points. When I wrote my article I was working under the assumption of SAN Storage. One thing to keep in mind with SAN storage is that if you are writing directly to the disk then you have bigger problems then selecting the correct RAID level. (This applies to DAS storage as well, but with DAS as there is much less cache writing directly to the disks is more likly.)
With SAN Storage all writes are first staged to the cache on the SAN controller. At which point the SQL Server is told that the write has been completed. After that SQL continues on while the SAN destages the data from the cache to the disk after the fact. While RAID 10 will allow the SAN to destage the data faster than a RAID 5 array it won’t effect the write performance of the SQL Server.
Howdy. Question about your statements on the 64k offset. When we set up our SAN, I looked online and the consensus was that a 32k cluster size (block size?) on the disks was fine in the vast majority of cases. So, our SAN is set up that way. Is this not the case? And, if we’re on a 32k disk cluster size, does the offset matter? We’ll have to do two IO ops for a cluster anyhow, so I wouldn’t think that the offset would matter.
If your storage is setup with a 32k cluster then the offset probably isn’t going to hurt you any. However if you can change it up to 64k you’ll get better performance from your SQL Server.
Sorry I haven’t got back sooner…
I was also speaking of SANs and not just DAS. A common misconception is that the large caches on the SANs (4GB+ on the ones we we’re using) would solve all the disk writing performance for ‘high performance’ systems. This is not necessarily the case. LUNs do not have access to the whole 4GB, a segment of cache is allocated to a LUN. On top of this these segments have to be split between read and write (usually set somewhere around 70% read and 30% writes, though we adjust to our estimates of what we expect), thus making the cache available for writes even smaller. And in a high performance system, the cache can be quickly saturated when large amounts of ‘write’ data are being pushed through the databases.
I was fortunate to work on a system that was streaming multi-GBs (100+) into the the database 24 hrs per day. We were also having to delete approximately the same amount each day too – and without the luxury of table partitioning, which would have made our life a lot easier. But the best part was that there was the same databases with the same amount of data, running on the same hardware and software, but…and this was the key bit …a different SAN subsystem layout. Once we had ramped up to the volumes mentioned above, our partners in the US, who had utilised RAID 5 and large Meta-LUN layouts, had horrendous performance problems with the bottleneck showing as writes, whereas, we we able to cope with the load. They have now changed to a layout similar to ours and the performance problems have all but disappeared.
Again, just to re-emphasise, I’m talking of systems requiring ‘high’ performance in an enterprise environment, with GBs and even TBs of data being pumped into and/or changed in the database(s). And again, the calculations showed that for I/O thoughput RAID 5 required more disks that for RAID 10.
Not a problem. If your buffer cache hit ratio is high, and your SAN supports it, try reducing the amount of cache you are dedicating to reads. If you decrease this, and increase the amount of cache for writes you will do more if not all writes to cache allowing the SAN to destage the writes in a more logical manor.
When I was at EMC World they were telling us about this technique which makes sense if you think about it. If the database finds what it needs in memory 99% of the time, that 1% of the time that it’s going to disk, that data is probably not going to be in cache as its probably been a while since the database read from disk. Not to mention the last time the database read from disk, it was probably a different sector so it wouldn’t be in cache anyway.
By increasing the write cache and reducing the read cache to almost nothing you are giving the SAN a much larger buffer to write to before if needs to start flushing to disk. It can then make more intelligent decisions as to which clusters to write to disk from cache first so that the heads have to move the least amount when doing all this writing.
The potential down side is if you get into a force flush situation where the write cache fills up it will take longer to get to the point where the SAN has gotten the cache below the high watermark so that it will start accepting writes again.
Based on what you have said about your database, I wouldn’t put it on RAID 5 either. With that much load, RAID 10 will be the better option, but I would still want to look at the cache settings to see about getting even more performance out of it.
There is no end all be all solution for storage. Each config and each system is different, and each database running on that storage works differently. Most OLTP database are mostly read, with a very small amount of writing going on. Your database seams to be very much the other way around. I would agree with you 100% that once you get past the 9:1 (r:w) then RAID 10 is going to be the better option. However most OLTP databases are way less then 9:1, and unfortunately the articles can only be so long so not every possible situation can be covered. When writing them I have to go with what the bulk of people will be dealing with the bulk of the time. Those of us that are dealing with systems where the ratio is off (my main OLTP system is about 7:3 (w:r)) have different situations than are described in the bulk of the articles out there.
In order to position a more fare position out there, I can pitch another article on how to configure storage when dealing with a system which is more writes than reads. I think that you’ll find I present a completely different set of conclusions in an article based on that sort of configuration.
Hmmmmm…seems that I’m not explaining myself very well, for which I apologise.
As I mentioned in my last comment post, we did adjust the cache read/writes to meet our needs. If I remember correctly, following testing we found that the best balance for our system at that time was 80% writes to 20% reads – we review every 2 – 3 months (which is easy to do) just to ensure things haven’t changed massively.
The setting of 70% reads to 30% writes is just a general default setting by the manufacturers based on a general out-of-the-box setup for all users, to be modified as required. In fact, EMC in their Best Practices suggest that LUNs with very random read environments and those with sequential streams greater than 12 (as often the case in DSS) can benefit from having Read Caching switched off. What need to be also borne in mind is that SAN caching algorithm is a general based one of al sort of evirnoments, such as File management, Application File management, and Database File management. What it chooses to hold/select in cache may not be what SQL Server’s caching algorithm chooses.
I’m a bit lost on your explanation of the ratio for read/writes whan you talk of making a selection for RAID 5 or RAID 10? You say that you agree that if you get past the 9:1 (r:w) then RAID 10 is going to be better. But, for ‘the bulk of people dealing with the bulk of the time’, and you quote your system as about 7:3 (w:r), you seem to suggest that RAID 10 would not be the right option. I’m not sure if I’m reading this correctly, but just to confirm my
understanding of how this works – if writes are [B]greater than[/B] 10% of the activity (ie, 8:2 (r:w), 7:3 (r:w) …3:7 (r:w), etc), then based on I/O throughput calculations more disks are required for a RAID 5 setup than for RAID 10. The EMC guys even demonstrate this in the document mentioned below on pages 36 – 39.
Please don’t take this as an argument to your article. I think it is a very useful document and you have broached an area often neglected. It’s just that I raise these points because your article was about squeezing out performance gains in the milliseconds and procs running 2,000 times per min. Even here, the EMC engineers agree that RAID 10 is the option over RAID 5 where lots of small random writes, such as may be seen in an OLTP environment (http://bbs.doit.com.cn/attachment.php?aid=6757 – ‘EMC CLARiiON Best Practices for Fibre Channel Storage: CLARiiON Release 22 Firmware Update – Best Practices Planning’, page 22).
I hope that I’ve explained my concerns with regard to your article and our different comments a bit better.
I came across this and had a somewhat related question. I had database backup in SQL 2005 going to SATA drives that we found out was causing numerous forced flushes on our CX700. This affected a few other SQL servers, namely a biztalk message box server and a reindexing process on another server. Both are very write heavy processes. What happened was that the log files grew to fill up the whole drive. Both databases were set to simple recovery and had been fine for months up until we rescheduled this backup job. Upon further research, I saw that in the sysprocesses table there was a process or two that had an open transaction. Once they were killed, the log shrank immediately down to zero. Have you ever seen this before?
I eventually resolved this by migrating the LUNs to FC drives, but the fact that the log kept growing troubles me.
The full transaction log and the forced flushes are probably going to be two different problems.
The forced flushes will be caused by the SATA Drives not being able to accept the writes of the backup process quickly enough. Once the cache on the SAN fills the SAN performes a force flush to empty the cache which causes all operations to write directly to disk until the force flush has been complete. It sounds like you took the correct action to correct this problem already which was to move the backup LUN to a fibre channel raid group rather than a SATA raid group.
The open transaction would be what is causing the log file to fill. This could have been a side effect of the backup job, as while the backup is running SQL Server won’t checkpoint changes from the log into the database which does cause the log to fill. If the database backup was taking a lot longer on the SATA disks I can see this causing the log file at fill as more transactions are now being queued into the log and not flushed to disk.
Does the log continue to grow even after the backup has been moved from SATA to FC drives?
Thanks for the response. Just to be clear, the backup job happens on a completely different SQL server than where the log file problems occur. There’s no problems with the log file where the backup is occurring. On another server, called VLDB, there’s a reindex process that occurs as the same time as the backup. That’s been running for 3 years without a problem. The log file never grew above 30 GB. On this day, however, it grew to 100 GB and filled up the drive. This only happened once, so I don’t have a lot of extra troubleshooting information on it, but believe it’s related to the backup job on the first server since they were running at the same time. On a third server, the biztalk messagebox, I had the same problem of the log file growing and found that the time of the last_batch of the session with the open transaction was always during the time of the backup on the first server. I changed the time of the backup on the first server and the time of the last_batch of the session with the open transaction always followed it. That led me to believe that the issue was on the SAN and sure enough, there were forced flushes during that time. The issue is resolved now with the faster drives, but the powers that be want to find out how it caused the log to grow out of control, if that is indeed the cause.
OK, I see where I misunderstood before.
I suppose that it is possible that the force flush on the SAN, caused the checkpoint operation to timeout which would have prevented the SQL Server from flushing the transactions from the log causing the log to fill. Especially if the disks that the database volumes were on were busy. The database that is being backed up, is the database hosted on the same physical disks as the machines that had the log problems?
Without being able to reproduce the problem I’d just be guessing.
Once you get into a force flush state all the normal rules go out the window as the system is simply trying to keep up at that point.
That’s a good question. The disks are actually on separate RAID groups, though for sure, the disks were busy. The thing is, after the backup job finishes, the log file still continued to grow on those databases.
Thanks for your assistance! I learned a lot reading from this thread.
If the log growth was caused by the forced flush I can see the logs continuing to grow after the backup has completed until the SAN is able to get out of the force flush state. Remember that the EMC will stay in a forced flush state until the cache has been cleared which is one of the reasons that a forced flush can be so dangerous.