So your ESXi host is stuck at a PSOD or the “Purple Screen of Death”, what do you do? Well one would figure its hardware, but it also could be software related. Well I am going to tell you how to download and review the error logs. Mind you the way I am going to explain it is if the host can boot up and be connected to either vCenter or VI Client. I will also show you a command you can run from the service console if you just want the support logs to send to VMware.
Onto the Information.
First you want to have the host back up and running, it could be unstable at the moment, but you should have enough time to pull the support logs.
Highlight the host in question. Click on File (top left of the VI Client), then click on “Export” then “Export System Logs”
Your next screen will allow you to select the system logs you would like to export, I just select them all. Once you click next you can select where you want to export them to. Click next to start the export.
Use a program like 7-Zip to extract the newly created file to a temporary location, once it is extracted you need to extract again, I know, they doubled up the compression, more so to keep the normal folk out! 🙂
Once everything is extracted you should see the following folders.
The most important one is the “Core” folder which contains the kernel dump, the PSOD will purge what was in memory to a file called vmkernel-zdump.1 or something to that affect and place it in that directory.
You will have to use something like NotePad++ to open the vmkernel-zdump file, once you do, you can pretty much search for “error” or “fail” or “panic” and you should find your issue. In my example, there is a memory bank error, see below.
2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.
2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”
2012-12-17T13:07:26.367Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.
2012-12-17T13:07:26.367Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”
2012-12-17T13:07:28.528Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.
2012-12-17T13:07:28.528Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”
2012-12-17T13:07:33.595Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.
2012-12-17T13:07:33.595Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”
Once you know what you are looking for, you can go ahead and run a memory diagnostics on your host and find the offending memory modules.
If you cannot see the vmkernel-zdump file follow the steps below. Thanks to gsilver in the forums for this info.
If you don’t have a vmkernel-zdump in /root, you’ll need to retrieve it first. Look at your disk and find the “Unknown” partition (in my case /dev/cciss/c0d0p9
fdisk -l /dev/cciss/c0d0
Disk /dev/cciss/c0d0: 146.7 GB, 146778685440 bytes 255 heads, 63 sectors/track, 17844 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System /dev/cciss/c0d0p1 * 1 65 522081 83 Linux /dev/cciss/c0d0p2 66 1370 10482412+ 83 Linux /dev/cciss/c0d0p3 1371 1631 2096482+ 82 Linux swap /dev/cciss/c0d0p4 1632 17844 130230922+ f Win95 Ext’d (LBA) /dev/cciss/c0d0p5 1632 1892 2096451 83 Linux /dev/cciss/c0d0p6 1893 2153 2096451 83 Linux /dev/cciss/c0d0p7 2154 2414 2096451 83 Linux /dev/cciss/c0d0p8 2415 2479 522081 83 Linux /dev/cciss/c0d0p9 2480 2493 112423+ fc Unknown
Then get the dump
vmkdump -d /dev/cciss/c0d0p9
Then dump the binary dump to a useful log:
vmkdump -l vmkernel-zdump.1
Then you can analyze it:
tail -20 vmkernel-log.1
There you have it, ESXi troubleshooting at its finest! Good Luck! Any questions, you know where to find me.
Links Used to find this information.