EMC Clariion CX700 Problems – Data Sector Invalidated

Fibre channel
Storage management
Storage servers
Hello, I have two EMC Clariion CX700's (which, for convenience we call CX700-1 and CX700-2) and one recently added CX3-40 in my SAN environment using fiber-switch fabric. Over the weekend, I experienced a hardware failure on SPA on CX700-1. After having it replaced by my vendor's technician and getting the SP to reboot successfully, all was fine until 2 hours later when I got another unwelcomed alert, this time it was about an uncorrectable sector (Data Sector Invalidated): "Event Number 840 Severity Warning Host CX700-1_SPA Storage Array APMxx SPA Device Bus 1 Enclosure 3 Disk 3 Description Data Sector Invalidated" Since then, I've been getting the same "Data Sector Invalidated - Bus 1 Enclosure 3 Disk 3" error at least once every day. The only solution to this as I gathered from my vendor is to unbind and then rebind the LUNs on this disk as there is a good chance some of my data within the LUNs have been corrupted -- hence the uncorrectable sector error. :( Has anyone else have this experience before? Is this a common occurrence from the Clariion CX-series product? As of now, I am still uncertain how the sector became uncorrectable and what lead/caused the data corruption. Unrelated to this, I am also having performance issues (in particular, forced flushing) every other week. As a long-term solution, I am getting quite concerned about my SAN environment. If possible, can anyone share some of their experiences with the EMC Clariion CX-series line? Any input is greatly appreciated! Thank you.

Answer Wiki

Thanks. We'll let you know when a new response is added.

I went through this exact same thing on one of our CX700s in the last couple of months. Here is what happened:
A disk had faulted and was being rebuilt to a hot-spare. When the disks in the raid group were being read from to rebuild the failing disk to the hot-spare, it encountered a sector read error, but no-longer had all the parity and data to rebuild with (hence, it was actually a double-fault at the data-stripe level across the raid-group). When Flare encounters this condition, it has no choice but to mark the sector as invalid.

This situation is typically a hardware problem, but there are software tools in place to minimize the risk of this happening. A “sniffer” program is incorporated into the Flare code that continually scans each lun to insure it can read all sectors and that data and parity calculations match. If the sniffer encounters an error, it attempts to remap the sector. Release 19 of the code has vast improvements in the way sniffer operates, allowing the sniffer to run at a much faster rate without impacting the hosts.

In our situation, we were extremely lucky that the affected LUNs were not in use because they did need to be unbound and rebound, and then EMC had to actually physically reseat the hotspare and the original failed disk before we could bind and use the LUNs again.

What version of FLARE are you on? I am guessing it is pre-19, and if so, you need to upgrade to 19 as soon as possible especially if you have ATA drives in your Clariion.

Discuss This Question: 2  Replies

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.
  • Klewis
    While I haven't had your exact problem, I had an issue with Flare 16 on my CX700 that prevented degraded RAID groups from invoking my hot spares. I had to unbind and bind the hot spares to get the array to invoke one of them. This was fixed by upgrading to Flare 19. Are your performance issues occuring during your BBU cache check? When the BBU discharges and recharges, I believe the SP write cache is disabled. My check was set for Saturday at 01:00AM, the same time as a large snapshot operation. Moving the check to a later time cleared up that issue. All in all, I've been happy with the performance on my CX700; in the past two years, it's been a real workhorse for us.
    0 pointsBadges:
  • UniCal
    Thank you for the responses. PSFischer: I am actually running flare 19 on my two CX700's and flare 22 on my CX3-40. I believe these flare levels are the latest from EMC. I've also ran the "sniffer" program and from the 'getsniffer', I am seeing 1 uncorrectable sector (a checksum error) on one of my LUNs in the affected disks. The LUN (which was on ATA disks) is now being rebuilt (unbind/re-bind). Unfortunately, from my logs I did not see any disk failures so I am still a little uncertain as to how the bad sector got into my situation. But I am suspecting there may be a connection to the failed SPA that occurred hours earlier. klewis: Unfortunately, my performance issues occurs throughout the day, even on prime-time production hours (8am-5pm). Sometimes it would last for only 5 minutes, but other times it can go up to 2 hours of constant forced flushing :( However, I am seeing a pattern such that when a database (oracle) refresh is occuring, that is usually when forced flushing occurs on the SAN as well. Through Navisphere analyzer, I'm seeing that when a DB refresh is occurring, a single LUN (150GB) this server owns, starts to push out over 1500 write IO/s and that is when the forced flushing begins. I am now wondering whether that is the limit my CX700 can take before performance degradation begins. Of course, my CX700-1 environment is most likely different from yours. I have about 50 hosts connected to it with all 16 DAEs filled mostly with FC disks, and a few ATA disk. I have Oracle Peoplesoft, Email, and NFS Server environments on my CX700-1.
    0 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

Thanks! We'll email you when relevant content is added and updated.


Share this item with your network: