SQL Server has three different storage terms which you need to know when dealing with physical file storage. These are files, file groups and disks. While each one of these is totally different, then work together giving the SQL Server database engine a very flexible way to define where the data is actually stored.
Disks are the physical storage within the server. If you look at the front of the server they will be the little devices which are inserted into the front of the server and typically have lights on them that flash. In larger environments where there are lots of servers you may have a storage array (also called a SAN). In it’s simplest terms the storage array is a bunch of disks that are in one large box. The space on these small disks is put together in various ways and parts of this space are allocated to specific servers. While the storage array may have anywhere from a couple of terabytes to thousands of terabytes the specific servers will only have access to a very small percentage of this space. In either case, either using local disks in the front of the server or a large storage array the concept is the same as far as SQL Server is concerned. The storage that is connected to the server is presented as a drive letter in Windows.
If you were to log onto the console of the server you would see an computer which would look very familiar to you. This is because the computer running SQL Server in the server room or data center is running Windows just like the computer which is on your desk and just like the computer at your home (assuming that you aren’t using Linux or a Mac). If you opened the My Computer on a server you would see hard drives just like you do on your home computer. On the server there are probably more hard drives than your workstation or home computer and the drives on the server probably have a lot more space than the ones on your workstation, but the idea is still very much the same. Any files that are needed are written to the disks and then accessed by users or by programs, with Microsoft SQL Server simply being a large complex program.
The files which the SQL Server database uses are called MDF, NDF and LDF files (for the most part). These are physical files which sit inside the disks (see above). A database will have at least two files, one of which is the MDF file which holds all of the data and the second which is the LDF file which holds the transaction log. In it’s simplest form the transaction log is a complete record of all the changes which have happened to the database since the database was first created. We manage the size of the transaction log through backups (which I’ll talk about in a later post) so for now we’ll just say that the transaction log holds the record of everything that has changed within the MDF data file. If a row is added that is written to the transaction log first. If a row is deleted that change is written to the transaction log before it is actually deleted from the database.
The NDF database files are simply additional data files. While the database must have one MDF and one LDF data file the use of NDF data files is optional. There is no performance benefit to using NDF data files instead of just a MDF data file (so this is only the case 99.9% of the time but that’s good enough for this post). The only time you’ll get a performance benefit from using NDF data files is if the MDF data file is on one physical hard drive and the NDF is on a different physical hard drive.
Each specific file exists on one and only one hard drive. If you need to store the data on multiple hard drives then you’ll need multiple data files.
File groups are simply logical groupings of files. We use file groups so that specific tables can be larger than any one specific hard drive. If we didn’t have file groups when we created a table and we told the SQL Server where to store that table we would have to tell it which physical data file to store the data in. This would mean that the tables would be limited by the size of the largest disk which was available. Because we want to be able to spread our data across multiple physical files we have file groups. The file groups allow for multiple files to be put into a single group. SQL Server will then create the table within that file group putting portions of the database on all the files within the file group. This allows us to have tables which are very large as we are only limited by the number of disks we can attach to the server and the size of those disks.
When we create a new table (or index, or service broker queue, or any other object which is a physical object which holds data) we specify which file group we want to create the object within. This allows us to leverage any files which are members of the file group. If there are multiple files the object will be spread over all the physical files. If there is just one file today when we create the table but then we add another file to the file group later there’s nothing that we need to do. The SQL Server will automatically begin using the new file as well as the old file. When you add a new file to a file group SQL Server doesn’t rebalance the data across the database files, you have to do this manually by rebuilding all the indexes, but new rows would be written to the new file.
When SQL Server is writing data to the files within the file group it uses something called proportional fill. This means that the SQL Server will do it’s best to keep the amount of free space within the data files the same. While it isn’t perfect at doing there as there are a lot of things that can effect it’s ability to do this, the SQL Server will do it’s best.
Hopefully this helped explain these three complimentary yet different concepts.
The recovery models tell the SQL Server how much transaction log data to keep within the transaction log, and how much data recovery there is. SQL Server has three different recovery models which are SIMPLE, BULK_LOGGED and FULL.
Simple Recovery Model
Simple recovery model is the most basic recovery model and it provides the least amount of protection. The simple recovery model supports only full and differential backups meaning that you can only restore the database to the state that the database was in when either a full backup or differential backup was restored. Typically the simple recovery model is only used for databases like data warehouses or other databases when the data within the database can be reloaded from another source easily. The reason that this is acceptable for data warehouses is because data warehouses are not the production system of record for the data as they are loaded from the production system of record on some schedule, either nightly, weekly, hourly, etc.
Many people that assume that using the snapshot recovery model means that the transaction log isn’t kept. This is not the case. For most transactions the normal transaction log information is written to the transaction log. The different between the simple recovery model and the other recovery models is that with the simple recovery model the SQL Server database doesn’t wait for the transaction log to be backed up before removing the data from the transaction log. Instead it waits for the data pages in the MDF and NDF files to be fully written to disk. Once that happens the SQL Server will mark the rows within the transaction log as being able to be reused.
Full Recovery Model
In the full recovery model the database engine keeps all the transactions within the transaction log file (the LDF file) until they are backed up to a transaction log backup. With the full recovery model because we are taking transaction log backups we are able to restore to any moment in time so long as that moment in time started after the database was created and is before now and that we still have the backups for that period of time.
If for example we take full backups on Monday morning at midnight and transaction log backups every hour at 10 minutes passed the hour, in the event of a database crash we can restore the SQL Server database to any point in time provided that we still have the full backup from the Monday before and all the transaction log backups taken between that full backup up to the transaction log backup which was taken after the problem happened. With this example if we wanted to restore the database to Tuesday afternoon at 1:45:00pm we could do that by restoring the full backup from Monday then restoring the transaction log backups starting at ten minutes after midnight on Monday up through the 2:10:00pm backup on Tuesday. While restoring the final transaction log backup we would simply tell the restore database command that we wanted to stop the restore process at 1:45:00pm on that Tuesday and the restore would simply stop at that point.
Bulk Logged Recovery Model
The bulk logged recovery model is very similar to the full recovery model. Most of the transactions are fully logged within the database’s transaction log. There are a few commands which aren’t such as bulk inserts using the BCP command line tool or the BULK INSERT T-SQL statement as well as a few others. All normal INSERT, UPDATE and DELETE statements are fully logged just like in the full recovery model. The bulk logged recovery model allows for point in time restores just like the full recovery model does. The difference here is that when there is a bulk operation which is bulk logged using the bulk logged recovery model not all the data changes are logged, only the allocation of space within the data files is logged. This presents a problem when the transaction log backup is being taken as the SQL Server needs to backup the records. In order to do this because we don’t have the full transaction log information for these commands the SQL Server database engine simply copies the data pages which were allocated to the transaction log file. Because of this the specific times that the database can be restored to will be limited to times when the bulk operations were not running. So if the bulk operation runs from midnight until 1am every night there wouldn’t be any way to restore to any specific point in time during that one hour window. If you need the ability to restore to specific times within that window and have the data be perfect the full recovery model must be used.
Statistics are magical little objects within the database engine that have the ability to make your queries run fast or painfully slow. The reason that statistics are so important is because they tell the database engine what data exists within the database table and how much data exists. The problem with statistics comes from how often they are updated. By default in all versions and editions of SQL Server the statistics are updated when 20%+500 rows within the database table change. So if a database table has 10000 rows in it we need to change 2500 rows (2000 rows is 20% plus an additional 500 rows) for the statistics to be updated. With smaller tables like this having out of date statistics usually doesn’t cause to many problems. The problems really come into play with larger tables. For example if there are 50,000,000 rows in a table for the statistics to be automatically updated we would need to change 10,000,500 rows. Odds are it is going to take quite a while to change this number of rows. To fix this we can manually tell the SQL Server to update the statistics by using the UPDATE STATISTICS command.
Within the statistic there are up to 200 values which are sampled from the column. The statistic shown below contains a few different columns. The statistic shows a series of values from the column which the statistic is built on. It also contains the count of the number of rows between that row and the next in the statistic. From this information the SQL Server is able to build the execution plan which is used to access the data. When the data within the statistic is out of date the SQL Server doesn’t make the correct assumptions about how much data there is and what the best way to access that data is. When the statistic gets updated the SQL Server is able to make better assumptions so the execution plan becomes better so the SQL Server is able to get the data faster.
I get job postings emailed to me all the time from various recruiters. Usually they are, we’ll call them OK. But sometimes the requirements of just getting to the interview are just stupid. Every once and a while, usually about twice a year, I get an email that says “remote candidates were welcome, but the candidate would have to pay their own way to get to the job interview”. Now when you are talking about paying for gas to get from your house to their office that’s fine. However these jobs are often not in the city, or even the state that I live in. So let me get this straight, you want me to pay to fly out to see you, so that you can tell me that you don’t want me to work for you. That’s really not how this works.
If you’ve exhausted the talent in your local city and you need to get talent from out of town, then it’s on you to pay for the travel to get the person to the interview.
Even if there was a position open at a company that was in my local area, if it said this I probably wouldn’t even consider the job. This tells me that as a company you don’t respect my time and my resources. From this I assume that you won’t want to pay for any of my training so that I can better support the companies systems. I can assume that you’ll expect me to work projects on nights and weekends (I’ve got no problems with nights and weekends for emergency system down issues, but not for projects that weren’t properly planned).
If you are a company that puts these sorts of silly statements in your job descriptions, and you are wondering why you can’t get any candidates stuff like this is try.
When dealing with SQL Server databases we have to deal with locking, and blocking within our application databases. All to often we talk about blocking as being a bad thing. How ever in reality blocking isn’t a bad thing. The SQL Server uses blocking to ensure that only one person is accessing some part of the database at a time. Specifically blocking is used to ensure that when someone is writing data that no one else can read that specific data.
While this presents as a royal pain in that users queries run slower than expected, the reality is that we don’t want users accessing incorrect data, and we don’t want to allow two users to change the same bit of data. Because of this we have locking, which then leads to blocking within the database.
All of this is done to ensure that data integrity is maintained while the users are using the application so that they can ensure that the data within the database is accurate and correct.
Without locking and blocking we wouldn’t have data that we could trust.
The NOLOCK indexing hint gets used way, way to frequently. The place that I hate seeing it the most is in financial applications, where I see it way to often.
Developers who are working on financial applications need to understand just how important not using NOLOCK is. Using NOLOCK isn’t just a go faster button, it changes the way that SQL Server lets the user read the data which they are trying to access. With the NOLOCK hint in place the user is allowed to read pages which other users already have locked for changes. This allows the` users query to get incorrect data from the query.
If the user is running a long running query that is accessing lots of rows which are in the process of being accessed, the user could get duplicate rows, or missing rows. This can obviously cause all sorts of problems with the users report as the data won’t actually be accurate. In reports that internal staff are running this is not good, if this your external users which are getting incorrect data, such as account debits and credits being processed while the user is requesting data they could suddenly get all sorts of invalid data.
If you are working with a financial application and you are seeing NOLOCK hints in there you’ll want to work on getting rid of them, and for the ones which must remain for some reason to make sure that the business users understand exactly how the data that they are looking at is totally incorrect and shouldn’t be trusted.
If the application is using the NOLOCK hint to solve performance problems so problems need to be resolved in other ways. Typically by fixing indexing problems that exist on the tables which are causing some sort of index or table scans.
As our VMware environments become larger and larger with more and more hosts and guests more thought needs to be given to the vCenter database that is typically running within a SQL Server database.
With the vCenter database running within Microsoft SQL Server (which is the default) their will be lots of locking and blocking happening as the queries which the vCenter server runs aggregates the data into the summary tables. The larger the environment the more data that needs to be aggregated every 5 minutes, hours, daily, etc.
Then problem here is that in order for these aggregations to run the source and destination tables have to be locked. This is normal data integrity within the SQL Server database engine.
Thankfully there is a way to get out of this situation. That is to enable a setting called Snapshot Isolation level for the vCenter database. This setting changes the way that SQL Server handles concurrency by allowing people to write to the database while at the same time allowing people to read the old versions of the data pages therefor preventing locks. The SQL Server does this by making a copy of the data page when it is being modified and putting that copy into the tempdb database. Any user that attempts to run queries against the original page will instead be given the old version from the tempdb database.
If you’ve seen problems with the vCenter client locking up and not returning performance data when the aggregation jobs are running, this will make these problems go away.
Turning this feature on is pretty simple. In SQL Server Management Studio simply right click on the vCenter database and find the “Allow Snapshot Isolation” setting on the options tab. Change the setting from False to True and click OK (this is the AdventureWorks2012 database, but you’ll get the idea).
If you’d rather change the settings via T-SQL it’s done via the ALTER DATABASE command shown below.
ALTER DATABASE [vCenter] SET ALLOW_SNAPSHOT_ISOLATION ON
Hopefully this will help fix some performance problems within the vCenter database.
This week I’ve found some great things for you to read. These are a few of my favorites that I’ve found this week.
- Bootstrapping SQL Server bloggers and blog readers with Twitter!
- Whiteboard Wednesday #1: Top Visualization Mistakes
- The Accidental Architect
- Personally Identifiable Information (PII) and Data Encryption
- Traversing the Facebook Graph using Data Explorer
- This weeks SQL Server person to follow on Twitter is: sqlpass also known as PASS
Hopefully you find these articles as useful as I did.
Don’t forget to follow me on Twitter where my username is @mrdenny
As we know with Microsoft SQL Server everything is processed from disk and loaded into the buffer pool for processing by the query engine. So what happens to the buffer pool when backups are taken?
The answer is that nothing happens to the buffer pool.
When SQL Server is backing up data from the disk, SQL Server simply takes the data from the data files and writes it to the backup file. During the backup process the dirty pages are written to the disk by the checkpoint process being triggered by the backup database process.
Because the backup process simply reads the data files and writes them to the backup location there’s no need to cache the data in the buffer pool as this data isn’t being queried by a normal SQL query.
The other day I was looking at parallel query plans on a customers system and I noticed that the bulk of the parallel queries on the system where coming from Spotlight for SQL Server.
The query in question is used by spotlight to figure out when the most recent full, differential and log database backups were taken on the server. The query itself is pretty short, but it was showing a query cost of 140 on this system. A quick index created within the MSDB database solved this problem reducing the cost of the query down to 14. The query cost was reduced because a clustered index scan of the backupset table was changed into a nonclustered index scan of a much smaller index.
The index I created was:
CREATE INDEX mrdenny_databasename_type_backupfinishdate on backupset
(database_name, type, backup_finish_date)
with (fillfactor=70, online=on, data_compression=page)
Now if you aren’t running Enterprise edition you’ll want to turn the online index building off, and you may need to turn the data compression off depending on the edition and version of SQL Server that you are running.
If you are running SpotLight for SQL Server I’d recommend adding this index as this will fix the performance of one of the queries which SpotLight for SQL Server is running against the database engine pretty frequently. I’d recommend adding this index to all the SQL Server’s which SpotLight for SQL Server monitors.