The Real (and Virtual) Adventures of Nathan the IT Guy

Dec 18 2012   2:22PM GMT

How to troubleshoot a Purple Screen of Death on an ESXi Host

Nathan Simon Nathan Simon Profile: Nathan Simon

So your ESXi host is stuck at a PSOD or the “Purple Screen of Death”, what do you do? Well one would figure its hardware, but it also could be software related. Well I am going to tell you how to download and review the error logs. Mind you the way I am going to explain it is if the host can boot up and be connected to either vCenter or VI Client. I will also show you a command you can run from the service console if you just want the support logs to send to VMware.

Onto the Information.

First you want to have the host back up and running, it could be unstable at the moment, but you should have enough time to pull the support logs.

Highlight the host in question. Click on File (top left of the VI Client), then click on “Export” then “Export System Logs”

Your next screen will allow you to select the system logs you would like to export, I just select them all. Once you click next you can select where you want to export them to. Click next to start the export.

Use a program like 7-Zip to extract the newly created file to a temporary location, once it is extracted you need to extract again, I know, they doubled up the compression, more so to keep the normal folk out! :)

Once everything is extracted you should see the following folders.

The most important one is the “Core” folder which contains the kernel dump, the PSOD will purge what was in memory to a file called vmkernel-zdump.1 or something to that affect and place it in that directory.

You will have to use something like NotePad++ to open the vmkernel-zdump file, once you do, you can pretty much search for “error” or “fail” or “panic” and you should find your issue. In my example, there is a memory bank error, see below.

2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.

2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”

2012-12-17T13:07:26.367Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.

2012-12-17T13:07:26.367Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”

2012-12-17T13:07:28.528Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.

2012-12-17T13:07:28.528Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”

2012-12-17T13:07:33.595Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled.

2012-12-17T13:07:33.595Z cpu19:8211)MCE: 1282: Status bits: “Memory Controller Read Error.”

Once you know what you are looking for, you can go ahead and run a memory diagnostics on your host and find the offending memory modules.

If you cannot see the vmkernel-zdump file follow  the steps below. Thanks to gsilver in the forums for this info.

If you don’t have a vmkernel-zdump in /root, you’ll need to retrieve it first.  Look at your disk and find the “Unknown” partition (in my case /dev/cciss/c0d0p9

fdisk -l /dev/cciss/c0d0

Disk /dev/cciss/c0d0: 146.7 GB, 146778685440 bytes 255 heads, 63 sectors/track, 17844 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot    Start       End    Blocks   Id  System /dev/cciss/c0d0p1   *         1        65    522081   83  Linux /dev/cciss/c0d0p2            66      1370  10482412+  83  Linux /dev/cciss/c0d0p3          1371      1631   2096482+  82  Linux swap /dev/cciss/c0d0p4          1632     17844 130230922+   f  Win95 Ext’d (LBA) /dev/cciss/c0d0p5          1632      1892   2096451   83  Linux /dev/cciss/c0d0p6          1893      2153   2096451   83  Linux /dev/cciss/c0d0p7          2154      2414   2096451   83  Linux /dev/cciss/c0d0p8          2415      2479    522081   83  Linux /dev/cciss/c0d0p9          2480      2493    112423+  fc  Unknown

Then get the dump

vmkdump -d /dev/cciss/c0d0p9

Then dump the binary dump to a useful log:

vmkdump -l vmkernel-zdump.1

Then you can analyze it:

tail -20 vmkernel-log.1

There you have it, ESXi troubleshooting at its finest! Good Luck! Any questions, you know where to find me.

Links Used to find this information.

Collecting diagnostic information for VMware ESX/ESXi using the vSphere Client

NotePad++

 Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: