Thursday was the final day of VMworld. Like with all conferences I’m sad that it’s over, but I’m damn glad that it’s over. These things are just exhausting to go to.
Today was an interesting (and short) day. On the final day of VMworld, VMware has decided that they won’t give a product related keynote. Instead they will have a keynote that is pretty much unrelated to VMware and their technologies. So today’s keynote was all about the human brain. There were three PHDs who were speaking at the keynote, about 20 minutes per. The first (and the best) was Dr. David Eagleman (website | @davideagleman) who is a neuroscientist from Baylor College of Medicine. He gave a very interesting presentation why people think that time slows down during traumatic events such as falling from a tree or bike. He and his team came up with an experiment where they basically threw a person off a building (it was actually a big scaffolding looking thing) into a net so they could test if their brain actually thought time was slowing down, if it just felt like it after the fact.
The second Dr. V.S. Ramachandran (website) and third speakers Dr. Deb Roy (website | @dkroy) while good speakers, simply weren’t as good as Dr. Eagleman as he was a very hard act to follow. Unfortunately I don’t actually remember what Dr. Ramachandran spoke about. Dr. Roy talked to us about the interactions between people and the way that those interactions can be tracked at the micro level (within the home) and at the macro level (worldwide).
At the micro level he installed camera and microphones in his home and recorded everything for 36 months starting shortly after his first child was born. His research team then developed some software that tracked movement through the house and matched it to his child’s learning to speak and they were able to visually map on a diagram of the house in what parts of the house different words were learned. For example the word “water” was learned in and near the kitchen and bathroom while the work “bye” was learned near the front door.
At the macro level he founded a company which tracks just about every TV show in the TV and analyses Twitter, Google+, Facebook, etc. traffic to see what people are talking about online so that studios and marketing companies can see how people are reacting to specific items online when they see them on TV. It was interesting (all be it a little creepy) to see.
As far as sessions went today, there were only three slots, and I only had two sessions scheduled. The first that I attended was a Resource Management Deep Dive into vSphere 4.1 and 5.0. During this session they really went into how vSphere allocates resources (CPU, Memory and IO) to various VMs on the host depending on how reservations, guarantees, resource pools, etc. are all configured. I’m not going to try to talk too much about this at the moment. It’s going to take me a couple of times listening to the recording online to catch everything.
One thing that I did want to share that I didn’t know was how much data the DPM (Distributed Power Management) uses when it’s deciding to power down hosts at night and bring them back up in the morning. When vCenter is deciding to power down a host is looks at the last 40 minutes of data to decide if there is little enough load to bring down a host. As for bringing a host back online it only looks at the last 5 minutes. vCenter will never power a host down if it will lead to a performance problem. When deciding to power hosts down performance is considered first with the after effect being that power is saved. Power will always be used to get performance.
The second session was one on performance tuning of the vCenter database itself, which I figured would be pretty interesting. It was interesting, but also frustrating as the speaker didn’t really know much about SQL Server (the default database to host the vCenter database). Some of the information presented was pretty interesting about how the tables are laid out and what is stored in which table, and I’ve got a much better understanding about how the performance data gets loaded into the database and how the rollups are done now. I also now know that I need to put together some scripts to jump start the process if it gets backed up as well as put together a best practices document for DBAs (and VMware folks that don’t know SQL at all) so that they can get better performance on their vCenter databases.
If you need to find the historical performance data within your vCenter database look into the tables which start with vpx_hist_stat. There are 4 of these tables vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3 and vpx_hist_stat4. The historical data is rolled up by daily, weekly, monthly and annually into those four tables respectively. You’ll also want to look into the vpx_sample_time tables of which there are also 4 tables vpx_sample_time1, vpx_sample_time2, vpx_sample_time3 and vpx_sample_time4.
Apparently vCenter 4.0 and below has a habit of causing deadlocks when loading data, especially in larger environments. The fixes that were provided are pretty much all hit or miss when it comes to if they will work, and his description of the cause of the problem was pretty vague. The jest of what I got was that the data loading is deadlocking with the code which handles the rollups and causing problems. Something which could be tried to fix this would be to enable snapshot isolation mode for the database. Personally I think this would have a better chance of fixing the problem then the crappy work around which he provided (which I’m not listing here on purpose).
The work around which VMware came up with for this problem, introduced in vCenter 4.0 can have its own problem is large environments. This problem is that data is missing for servers at random intervals. This is because VMware came up with the idea of creating three staging tables and using each staging table for 5 minutes, then processing the data from that staging table into the vpx_hist_stat and vpx_sample_time tables while then moving on to the next staging table. However if it takes too long to process the data from the first table, and the third table has been used it is now time to move back to the first table and data is now lost as it can’t write the data into the first table. VMware needs to do some major redesign of this entire process for the next release to come up with a better solution that won’t allow for data loss. There are plenty of was to do it that won’t cause problems. Don’t you love it when developers that don’t know databases very well try and come up with screwy solutions to problems?
Based on what was talked about in this session there are a lot of SQL Scripts that need to be written to help people improve performance of their vCenter databases. Guess I’ve got a lot of script writing to do.
Today was day 3 of VMware. All the sessions that I attended today were pretty much a recap of the things which I covered earlier in the week. I went to these more in depth sessions because the information learned today will help me with my day to day deployments of VMware as well as helping me learn more about the specific items within VMware that I need to look at to ensure that VMware is running smoothly day to day.
The big event of tonight was the attendee party with a performance by The Killers which was opened by Recycled Percussion. It was a great concert especially as I’m not a mega fan of The Killers, and hadn’t head of Recycled Percussion before tonight. After the show and party VMware had after parties at the hotel pools which was a blast. I meet and had some great conversations with some great guys from a variety of places and companies.
While today doesn’t make a great blog post, it was another amazing day at VMworld 2011 and I can’t wait for tomorrow.
So today was day 2 of VMworld 2011 and today was a great day at the conference. We had a great keynote with some demos which were pretty funny (I really hope that they were supposed to be funny). Granted I was a little late to the keynote so I missed the first few minutes, but I over slept damn-it breakfast is the most important meal of the day.
The first thing I was was a project called Project Octopus. This allows your users to access the same files via Windows, Mac or Linux PCs, phones, tablets, etc. It also allows users to edit any files which they have access to on any device. This is done via HTML 5 so as long as the device supports HTML 5 (which most everything new does) you can access full Windows applications on the machine. In the demo the user was sent an Excel file via IM which they then opened on an iPad and they were able to edit it in a fully functional copy of Excel 2010. There was a small application installed on the iPad which then connected to the server via the web browser, uploaded the file to the server (or opened the file from the server, not really sure here, but either way) then the user was able to edit the Excel sheet and save it back to the server.
The next product which we were shown was called VMware Go. Go is a software as a service offering where the user signs into the site and then they are able to via the webpage scan an IP subnet looking for servers which are capable of running vSphere 5.0 on them. The user can then select which Windows servers they would like to deploy vSphere 5.0 to. vSphere 5.0 is then deployed to the servers. I’m not sure what happens to the Windows OS and services which are already installed on the servers, so this could be very dangerous if pushed to the wrong server by accident.
A new product which I’m really excited about is aimed directly at the small / medium business (SMB) market and will allow you to take two servers with only local storage and configure them in a highly available vSphere 5.0 cluster. This new product is called Virtual Storage Appliance (VSA). The way this is done is that the VSA which is a virtual appliance which is installed on all the hosts (it supports two and three node clusters only). When installed and configured it will take the local storage and present it to the cluster as shared storage. Redundancy for this solution is done by using software based replication and setting up each VM to be replicated to another host in the cluster. This way the cluster can always survive a single node failure without loosing the ability to run any guest on the cluster.
There are some big changes coming to vSphere Site Recovery Manager (SRM) 5.0 which is no longer called VMware vSphere SRM. One of the biggest is the ability to automatically fail back after a site has failed and restored automatically. In prior versions of SRM failover was a one way operation, in order to fail back to the first site you would have to totally reconfigure SRM and then trigger failover. With the new 5.0 version of SRM you simply configure the failback as part of the policies then when the second site comes back online SRM will failback as configured.
Another cool thing you can do with SRM 5.0 now is the ability to DR your site to a cloud provider instead of to your own backup data center. This allows you to run your primary site on your hardware, but rent your DR systems from a cloud service provider that is certified as a SRM site. Currently there are only a couple of options, but as time goes on there will be more options available.
I went to a couple of sessions today, the most informative of which was about the new features of vSphere 5.0. VMware is upgrading the VMFS version from 3 to 5, but this time it is a non-distruptive upgrade unlike the upgrade from VMFS 2 to 3. The new version of ESXi is much thinner than the prior 4.1 version leaving more resources available for the guest machines.
vSphere will only officially supports 32 hosts in a cluster, however there was been clusters tested with over 100 nodes, but still only 32 are supported. Something which will make a lot of Linux shops happy is vCenter no longer requires Windows as the OS for the vCenter server. It can now be installed on a Linux OS (they didn’t specify which Linux flavor). There is an embedded database which supports up to 5 hosts and 50 VMs. For installs which are larger than this you’ll need to install an instance of Oracle. Currently only Oracle is supported and eventually other databases will be supported. Another limitation of running vCenter on Linux is that you can’t run the vCenter in linked mode. Linked mode is where you have a vCenter server one at each site and they are linked so that you have redundancy at the vCenter level.
There is a new web based client which will be included with vSphere 5.0. This won’t be a fully featured featured UI, but it will support most of the features. The nice thing about this new web client is that it will work on Windows, Mac, and Linux. Eventually the web client will become the default client for vSphere and vCenter but this isn’t the case yet.
The last change I want to talk about today is the fact that vMotion now supports slower links. In vSphere 4.1 and below using vMotion required using a network which had a 5ms or lower network latency. In vSphere 5.0 this limit is increased to 10ms latency which allows you to vMotion over city wide networks.
See you tomorrow for VMware Day 3.
Today was day 1 of VMware and I had a blast, even though I was only able to attend for part of the day. I flew into Vegas this morning instead of spending the night last night. I didn’t hit any sessions today, but I did catch the keynote which was given by Paul Maritz the CEO of VMware.
The keynote was interesting, but didn’t provide a whole lot of new information. Paul officially announced that vSphere 5.0 was released along with VMware View 5.0.
vSphere 5.0 is the third new major annual release of the vSphere product. 2009 had vSphere 4.0, 2010 had vSphere 4.1 giving 2011 vSphere 5.0. vSphere 5.0 has 200 new features (which weren’t listed). VMware has put 1 million man hours into building the new vSphere 5.0 platform and another 2 million man hours into testing the new version.
There were a few pieces of information which were talked about as far as new features which were basically boiled down to a few key points. The first is probably the most important as with vSphere 5.0 VMware expects that they will be able to run almost every production workload. Virtual machines running under vSphere 5.0 can now have up to 32 vCPUs and 1 TB of RAM each. VMware has added in some storage load balancing features that I’m really hoping that I can learn more about as the week continues as well as the automatic storage tiering which looks very interesting. There is also a storage load balancing feature which I’m quite interested to learn more about.
There were some interesting stats which Paul talked about as well. There were 19000 attendees which actually made it to the conference. There were over 20000 people registered but some people got stuck on the east coast thanks to the weather. Some additional stats include that analysts currently estimate that worldwide there 50% of production workloads are running under a hypervisor today. This means that every 6 seconds a new VM is built (which is faster than people are being born). It is estimated that there are over 20 million virtual machines running under VMware’s hypervisor platform. If these hosts were put end to end they would be twice the length of the Great Wall Of China. More machines are being moved from host to host via vMotion than there are airplanes in the sky.
Needless to say there is a lot of great information which I’m hoping to learn and share with you.
Tom LaRock is starting up a video series for Confio called Afternoon Ignite. He has asked me to be his first victim guest on the show. We will be talking about pretty much what ever comes to mind which will probably involve performance tuning, VMs, PASS, SQL Excursions, and Bacon.
Feel free to join us via GoTo Meeting at 11am Pacific / 2pm Eastern and check out the excitement.
So the other day I had to restore the SQL Server replication publisher. When I restored it I made sure to use the KEEP_REPLICATION option on the restore (also available in the SSMS UI) so replication should come back online. However when I restarted the log reader I the following error message.
The log scan number (6367:10747:6) passed to log scan in database ‘%d’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup. (Source: MSSQLServer, Error number: 9003)
Needless to say this error looks pretty damn scarey. In reality it isn’t actually that bad. What this error is basically saying is that the LSN that is returned from the database is older than the one logged in the replication database. The best part is that the fix is pretty easy. Simply run the stored procedure “sp_replrestart” in the published database.
As the SQL PASS summit is a big confusing event for people attending the summit for the first time, I’m taking it upon myself to do something about this.
On Tuesday September 6th, 2011 at 10am Pacific Time (1pm Eastern, 5pm GMT) I will be putting on a webcast to give first timers (and people that have attended before) some critical information about Seattle and the summit that they should probably have before actually getting to Seattle for the summit.
No registration is needed, just sign into the live meeting when the webcast is supposed to start. You can also go to the meeting lobby that day and I should be able to approve you in that way, but I think signing into the live meeting directly would be the best bet. I have made a calendar invite that you can download and import into your calendar. The invite is in iCal format.
I will be recording the session and I’ll make it available for viewing after the session, but this is your first chance to ask some questions about the summit during the Q & A at the end of the session.
All I ask is that you pass the information about this session along to others who are attending the PASS summit as they will hopefully get some useful information about the Seattle and the PASS summit as well.
I look forward to your attending my session on September 6th, and I’ll see you at the PASS summit.
When working with SQL in a cluster, the account rights on both nodes of the cluster need to be the same
Recently I was working with a clients SQL Server cluster. The managed service provide had installed some Windows patches causing the SQL Cluster to fail over to the other node. No big deal, everything appeared to be working as normal.
After a couple of days we noticed something a little strange. There was a very strange wait type which was showing a LOT of wait type. This wait type was PREEMPTIVE_OS_GETPROCADDRESS which means that SQL Server is waiting on something outside of the database engine to respond. When I looked into the spid which was doing the waiting I saw that it was running the extended stored procedure xp_delete_file. What this file does, in case you aren’t aware is remove old SQL Server backups from the hard drive of the server based on parameters that you specified.
First thing that I did was look at the permissions of the files, they appeared to be setup correctly. the local admin group had full control, users had no rights, owner has full control. Knowing that the SQL Account should be a member of the administrators group on these servers (I didn’t set the machine up, so don’t get me started on minimum permissions). However when I looked in the admin group for this node of the cluster, the SQL Account wasn’t a member of the admin group. I jumped on to the other node and it was in that machines.
The reason that this was a problem is because of the way that NTFS handles permissions on new files when the user is an owner of the folder and has full control rights. Because the folder is owned by the local admin group, and the SQL Server was a member of the local admin group when the files were created they inherited the rights from the folder which were admins had full control, users had no rights, and owner had full control. Except that in this case ownership of the folder and the files was built in\Administrators which also carried down to the files. So when the SQL Account came through on the second machine looking to delete files it didn’t have the rights because it wasn’t in the built in\Administrators group any more.
Fortunately fixing this problem was pretty easy. I simply put the SQL Account in the local admin group on the misconfigured node and scheduled a short outage to restart SQL on that node so that it could pickup the new permissions. Then the long waits went away and the older backups were able to be deleted as they should be.
If you’d like to read more about why you don’t normally want to have the SQL Server running with admin rights and what the minimum needed rights means might I recommend you check out my security book Securing SQL Server (paperback | kindle | website) available on Amazon.com and other online retailers.
I’m very please to tell you all about my new storage blog on sqlmag.com titled “Troubleshooting SQL Server Storage Problems“. On this new blog (there’s just the one post for now, but that will change shortly) I’ll be talking all about SQL Server and Storage and how these things should be working together.
My sqlmag.com blog is all about helping you solve your storage problems, so the blog will work best with your questions, issues and problems. So please post your questions on the blog, post them here, or email them to me and I’ll get them answered (I will only include your name and/or company if you ask me to) so that not only can we get your questions answered and your problems fixed, but we can help other peoples problems solved in the process.
Last night the NETDA User Group in Redmond was night enough to ask me to present to their group while I was up here this week. It was great talking to the group, and it was great giving a presentation at the Microsoft Corporate office.
As promised here’s the slide deck and sample code which was shown. Everything is included except for the linq to SQL code that I captured from my production environment.
I had a great time talking to the user group, hopefully I’ll be able to present there again in the future.