If you follow me on twitter then you’ve heard bits and pieces of this already on my twitter stream.
In preparation for the onslaught of security patches which were released on October 14, 2009 I went ahead and patched all our severs the day before as it had been a couple of weeks since I had triggered patching and I wanted to get everything else installed so that there wouldn’t be any dependency issues with installing the new patches. All of our servers, about 60 in all, patched correctly except for two of the three servers which host our web application which our customers use. Now I say that two of the three servers didn’t patch correctly, but the third server wasn’t allowed to reboot so I don’t know if that one patched corrected or not at the time. But with two machines of the three offline, I wasn’t willing to chance it.
These machines are not exactly complex machines. They are Windows 2008 Web Edition virtual machines (under VMware vSphere 4.0), with IIS 7 installed. Each machine has 5 private IPs as each of the sites has an SSL certificate on it. The only other software which is installed is the VMware Tools and the Trend Micro Anti Virus. Our ASP.NET site is sitting on the machine, but it requires no extra DLLs to be registered so doesn’t really count as being “installed”.
Unfortunately we don’t have Software Assurance on our servers as it was outside of our budget so we have to settle for the paid support with Microsoft. As I didn’t really want to front the few hundred dollars to call Microsoft at midnight, and knowing that we had a decent support contract with Dell from when we purchased our servers from them, I figured that Dell would be a good place to call, then they could get Microsoft involved.
The symptoms for the server were that when trying to boot normally the machine would hang at the Applying Computer Settings. If the network ports were disconnected then the machine would boot but nothing would function correctly as no services would start.
When I booted into safe mode there was an error in the event log which says “The COM+ Event System detected a bad return code during its internal processing. HRESULT was 8007043c from line 45 of d:\vistasp1_gdr\com\complus\src\events\tier1\eventsystemobj.cpp. Please contact Microsoft Product Support Services to report this error.” To me this is a pretty clear that there’s some major problem with the machine.
Dell support wanted to look at some stuff, so I was OK with that. I figured how long can this take? Well 5 hours after that we had gotten no where. Some of the highlights of the stuff that we did.
- Uninstall all the patches that were installed (there were 5 including IE8 on this server)
- Attempt to reinstall VMware Tools, except that the installer wouldn’t start.
- Check the and IP and DNS settings
- Disable IPv6
- Check to make sure we can ping the domain controller
- Run SFC against the system.
That’s pretty much all that was done except to reboot the machine a bunch of times and watch it sit and spin at the applying computer settings section.
At one point I asked her if she knew what COM+ was. A valid question considering the error message that I saw in the error log. After several minutes of silence (I assume I was on mute and she was searching the net for the answer) she told me something about it having to do with the .NET architecture. While .NET can use COM to do stuff, um no. Points for trying, but no. She was very upset and seamed quite offended that I called her bluff on this.
After 5 hours of this BS i finely had enough and unloaded on the support person and the Technical Account Manager at Dell and they got Microsoft involved. But because this wasn’t an OEM license, but a Microsoft Volume License they had to open a ticket under the normal hotfix broke my machine queue. This lovely queue has an 8 hour callback window to call you in. At this point with one machine running I went to get some sleep while I waited for the call back.
Working with Microsoft
It took a couple of tries to get in touch with the actual Microsoft engineer. For some reason when he called me at 10:30 in the morning it didn’t wake me up (I got to bed around 7 or so). I got up at 2 when my alarm went off so that I could get to my doctors appointment. I emailed the engineer and explained that the phone didn’t wake me, and asked for a call back at around 5. The engineer told me not a problem, he’d schedule a call for 5, but with another engineer as he was off work before then. No problem, I wouldn’t want to work late either if I didn’t have to.
Come 5pm no call. Come 6pm no call. Come 7pm no call. So I call into the technical router and I’m told at there is an 8 hour call back Window. I tell him it’s been 12 hours. He then tells me that its after hours and I’ll need to wait until the next day. At this point I pretty much go ballistic on him and get to a supervisor. The supervisor looked at the case listened to be rant, understood that I was pissed off, and an MVP to boot, and found someone to help me.
Once I got on the phone with the Engineers it took Microsoft 2 hours and 3 guys to find the problem, but the end result is that it’s a very simple problem, with an even simpler solution.
To understand the problem you have to dig into how the OS boots. When the OS starts the drivers and services are grouped together, and those groups are started as a group. In this case we are concerned with the HTTP.SYS driver and the Cryptographic Services service. They are grouped together in a single group. The problem comes into play when you have websites which are secured by SSL certificates and the SSL certificate is hosted in IIS.
In some cases the HTTP.SYS driver will start and attempt to contact the Cryptographic Services service, which hasn’t yet started. Because of this the HTTP.SYS driver waits for the Cryptographic Services service to start. Now the problems comes into play with the services database which Windows maintains. When a service begins to start is takes a lock on this database. When the service finishes starting it releases the lock. So HTTP.SYS begins to start it takes a lock, then looks for the Cryptographic Services service. When it can’t find the Cryptographic Services service it waits for it to be found. But because the HTTP.SYS driver hasn’t finished starting it holds a lock on the service database. We are now in a deadlock situation. HTTP.SYS is waiting for Cryptographic Services to start, but it can’t without the HTTP.SYS service to finish starting.
The resolution for this is sadly simple, it is much simpler than the 7 hours of time on the phone it took to get to this point. Set a startup dependency between the HTTP.SYS driver and the Cryptographic Services service. Doing this requires a manual registry change, so all the normal warnings about editing the registry apply.
To fix the problem, open RegEdit. Navigate to HKLM > System > Current Control Set > Services > HTTP. Create a new Multi-String Value named “DependOnService” (don’t forget that Caps counts). Edit the new value and type in CryptSvc (again Caps counts). Reboot the server.
Now when the server boots up the HTTP service won’t attempt to start until after the Cryptographic Services service has finished starting.
So the end solution to fix the problem required adding 23 characters to the registry in the right location.
Why did it happen
I was told that sometimes this just happens. For some reason the Cryptographic Services service just takes longer to load. In this case, I had installed a bunch of hot fixes, several of which were security hot fixes. Which would have made changes to the Cryptographic Services service which would have slowed down it’s starting up during boot while it tried to finish dealing with any install pieces. This allowed the HTTP.SYS driver to start first and started this whole mess.
A simple reboot didn’t fix the problem because the Cryptographic Services service wasn’t ever allowed to do what it needed to so it wouldn’t ever start because every time it tried to load the HTTP.SYS didn’t let it finish.
Finding more information
Where can you find more information about this? That’s an excellent question. At the moment you can’t. Microsoft has seen this several times now on Windows 2008, but they haven’t yet released the MSKB which says how to resolve the issue.
As best as I can tell this is the only info on the web about this so far.
The reason that I didn’t see the same problem on any other servers? Well that’s because this cluster of servers is the only one that is hosting the SSL certificates them selves. All the other clusters of servers the SSL certificates are being hosted by out Cisco Load Ballancer so they don’t have this problem, because if there’s no certificates being used by IIS then HTTP.SYS driver doesn’t try and talk to the Cryptographic Services service.
If you are having this same strange kind of issue some things you can look at to see if you might be having the problem.
First check to see if there is a lock being held on the services database. This can be done with the sc command line tool. From a command line run:
If the following services aren’t running, then your HTTP.SYS driver may not have started completly.
- Print Spooler
- Routing and Remote Access
- SSDP Discovery
- UPnP Device Host
- Windows Event Collector
- World Wide Web Publishing Service (IIS)
If you have SSL certificates on Windows 2008 it is recommended that you put this registry setting in place in order to prevent this problem from coming up until Microsoft is able to address it within Windows.
Hopefully this saves someone some of the headache that I had to deal with.
While it turned out that the problems wasn’t with COM+ like I suspect, I’m guessing that while the COM+ Event Service I’m guessing that it does actually depend on it for encrypting any communication which it does which would be what caused the error to be thrown (I’m guessing here).
Now if anyone from Dell would like to contact me to suck up to me and make me happy with Dell again, you’ve got my contact information.