Posted by: Denny Cherry
Performance Problems, vCenter, VMware, VMworld, VMworld 2011
Thursday was the final day of VMworld. Like with all conferences I’m sad that it’s over, but I’m damn glad that it’s over. These things are just exhausting to go to.
Today was an interesting (and short) day. On the final day of VMworld, VMware has decided that they won’t give a product related keynote. Instead they will have a keynote that is pretty much unrelated to VMware and their technologies. So today’s keynote was all about the human brain. There were three PHDs who were speaking at the keynote, about 20 minutes per. The first (and the best) was Dr. David Eagleman (website | @davideagleman) who is a neuroscientist from Baylor College of Medicine. He gave a very interesting presentation why people think that time slows down during traumatic events such as falling from a tree or bike. He and his team came up with an experiment where they basically threw a person off a building (it was actually a big scaffolding looking thing) into a net so they could test if their brain actually thought time was slowing down, if it just felt like it after the fact.
The second Dr. V.S. Ramachandran (website) and third speakers Dr. Deb Roy (website | @dkroy) while good speakers, simply weren’t as good as Dr. Eagleman as he was a very hard act to follow. Unfortunately I don’t actually remember what Dr. Ramachandran spoke about. Dr. Roy talked to us about the interactions between people and the way that those interactions can be tracked at the micro level (within the home) and at the macro level (worldwide).
At the micro level he installed camera and microphones in his home and recorded everything for 36 months starting shortly after his first child was born. His research team then developed some software that tracked movement through the house and matched it to his child’s learning to speak and they were able to visually map on a diagram of the house in what parts of the house different words were learned. For example the word “water” was learned in and near the kitchen and bathroom while the work “bye” was learned near the front door.
At the macro level he founded a company which tracks just about every TV show in the TV and analyses Twitter, Google+, Facebook, etc. traffic to see what people are talking about online so that studios and marketing companies can see how people are reacting to specific items online when they see them on TV. It was interesting (all be it a little creepy) to see.
As far as sessions went today, there were only three slots, and I only had two sessions scheduled. The first that I attended was a Resource Management Deep Dive into vSphere 4.1 and 5.0. During this session they really went into how vSphere allocates resources (CPU, Memory and IO) to various VMs on the host depending on how reservations, guarantees, resource pools, etc. are all configured. I’m not going to try to talk too much about this at the moment. It’s going to take me a couple of times listening to the recording online to catch everything.
One thing that I did want to share that I didn’t know was how much data the DPM (Distributed Power Management) uses when it’s deciding to power down hosts at night and bring them back up in the morning. When vCenter is deciding to power down a host is looks at the last 40 minutes of data to decide if there is little enough load to bring down a host. As for bringing a host back online it only looks at the last 5 minutes. vCenter will never power a host down if it will lead to a performance problem. When deciding to power hosts down performance is considered first with the after effect being that power is saved. Power will always be used to get performance.
The second session was one on performance tuning of the vCenter database itself, which I figured would be pretty interesting. It was interesting, but also frustrating as the speaker didn’t really know much about SQL Server (the default database to host the vCenter database). Some of the information presented was pretty interesting about how the tables are laid out and what is stored in which table, and I’ve got a much better understanding about how the performance data gets loaded into the database and how the rollups are done now. I also now know that I need to put together some scripts to jump start the process if it gets backed up as well as put together a best practices document for DBAs (and VMware folks that don’t know SQL at all) so that they can get better performance on their vCenter databases.
If you need to find the historical performance data within your vCenter database look into the tables which start with vpx_hist_stat. There are 4 of these tables vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3 and vpx_hist_stat4. The historical data is rolled up by daily, weekly, monthly and annually into those four tables respectively. You’ll also want to look into the vpx_sample_time tables of which there are also 4 tables vpx_sample_time1, vpx_sample_time2, vpx_sample_time3 and vpx_sample_time4.
Apparently vCenter 4.0 and below has a habit of causing deadlocks when loading data, especially in larger environments. The fixes that were provided are pretty much all hit or miss when it comes to if they will work, and his description of the cause of the problem was pretty vague. The jest of what I got was that the data loading is deadlocking with the code which handles the rollups and causing problems. Something which could be tried to fix this would be to enable snapshot isolation mode for the database. Personally I think this would have a better chance of fixing the problem then the crappy work around which he provided (which I’m not listing here on purpose).
The work around which VMware came up with for this problem, introduced in vCenter 4.0 can have its own problem is large environments. This problem is that data is missing for servers at random intervals. This is because VMware came up with the idea of creating three staging tables and using each staging table for 5 minutes, then processing the data from that staging table into the vpx_hist_stat and vpx_sample_time tables while then moving on to the next staging table. However if it takes too long to process the data from the first table, and the third table has been used it is now time to move back to the first table and data is now lost as it can’t write the data into the first table. VMware needs to do some major redesign of this entire process for the next release to come up with a better solution that won’t allow for data loss. There are plenty of was to do it that won’t cause problems. Don’t you love it when developers that don’t know databases very well try and come up with screwy solutions to problems?
Based on what was talked about in this session there are a lot of SQL Scripts that need to be written to help people improve performance of their vCenter databases. Guess I’ve got a lot of script writing to do.