Hello,
I have two EMC Clariion CX700's (which, for convenience we call CX700-1 and CX700-2) and one recently added CX3-40 in my SAN environment using fiber-switch fabric. Over the weekend, I experienced a hardware failure on SPA on CX700-1. After having it replaced by my vendor's technician and getting the SP to reboot successfully, all was fine until 2 hours later when I got another unwelcomed alert, this time it was about an uncorrectable sector (Data Sector Invalidated):
"Event Number 840
Severity Warning Host CX700-1_SPA
Storage Array APMxx SPA Device Bus 1 Enclosure 3 Disk 3
Description Data Sector Invalidated"
Since then, I've been getting the same "Data Sector Invalidated - Bus 1 Enclosure 3 Disk 3" error at least once every day. The only solution to this as I gathered from my vendor is to unbind and then rebind the LUNs on this disk as there is a good chance some of my data within the LUNs have been corrupted -- hence the uncorrectable sector error. :(
Has anyone else have this experience before? Is this a common occurrence from the Clariion CX-series product? As of now, I am still uncertain how the sector became uncorrectable and what lead/caused the data corruption.
Unrelated to this, I am also having performance issues (in particular, forced flushing) every other week. As a long-term solution, I am getting quite concerned about my SAN environment. If possible, can anyone share some of their experiences with the EMC Clariion CX-series line?
Any input is greatly appreciated!
Thank you.
Software/Hardware used:
ASKED:
November 15, 2006 5:44 PM
UPDATED:
September 25, 2008 3:51 PM
While I haven’t had your exact problem, I had an issue with Flare 16 on my CX700 that prevented degraded RAID groups from invoking my hot spares. I had to unbind and bind the hot spares to get the array to invoke one of them. This was fixed by upgrading to Flare 19.
Are your performance issues occuring during your BBU cache check? When the BBU discharges and recharges, I believe the SP write cache is disabled. My check was set for Saturday at 01:00AM, the same time as a large snapshot operation. Moving the check to a later time cleared up that issue.
All in all, I’ve been happy with the performance on my CX700; in the past two years, it’s been a real workhorse for us.
Thank you for the responses.
PSFischer: I am actually running flare 19 on my two CX700′s and flare 22 on my CX3-40. I believe these flare levels are the latest from EMC. I’ve also ran the “sniffer” program and from the ‘getsniffer’, I am seeing 1 uncorrectable sector (a checksum error) on one of my LUNs in the affected disks. The LUN (which was on ATA disks) is now being rebuilt (unbind/re-bind). Unfortunately, from my logs I did not see any disk failures so I am still a little uncertain as to how the bad sector got into my situation. But I am suspecting there may be a connection to the failed SPA that occurred hours earlier.
klewis: Unfortunately, my performance issues occurs throughout the day, even on prime-time production hours (8am-5pm). Sometimes it would last for only 5 minutes, but other times it can go up to 2 hours of constant forced flushing
However, I am seeing a pattern such that when a database (oracle) refresh is occuring, that is usually when forced flushing occurs on the SAN as well. Through Navisphere analyzer, I’m seeing that when a DB refresh is occurring, a single LUN (150GB) this server owns, starts to push out over 1500 write IO/s and that is when the forced flushing begins. I am now wondering whether that is the limit my CX700 can take before performance degradation begins.
Of course, my CX700-1 environment is most likely different from yours. I have about 50 hosts connected to it with all 16 DAEs filled mostly with FC disks, and a few ATA disk. I have Oracle Peoplesoft, Email, and NFS Server environments on my CX700-1.