Question

  Asked: Nov 15 2006   5:44 PM GMT
  Asked by: UniCal


EMC Clariion CX700 Problems - Data Sector Invalidated


Storage, Storage products and equipment, Adapters/Interfaces, Arrays, Fibre channel controllers/Host bus adapters, SAN, Storage management, Storage servers

Hello,
I have two EMC Clariion CX700's (which, for convenience we call CX700-1 and CX700-2) and one recently added CX3-40 in my SAN environment using fiber-switch fabric. Over the weekend, I experienced a hardware failure on SPA on CX700-1. After having it replaced by my vendor's technician and getting the SP to reboot successfully, all was fine until 2 hours later when I got another unwelcomed alert, this time it was about an uncorrectable sector (Data Sector Invalidated):

"Event Number 840
Severity Warning Host CX700-1_SPA
Storage Array APMxx SPA Device Bus 1 Enclosure 3 Disk 3
Description Data Sector Invalidated"

Since then, I've been getting the same "Data Sector Invalidated - Bus 1 Enclosure 3 Disk 3" error at least once every day. The only solution to this as I gathered from my vendor is to unbind and then rebind the LUNs on this disk as there is a good chance some of my data within the LUNs have been corrupted -- hence the uncorrectable sector error. :(
Has anyone else have this experience before? Is this a common occurrence from the Clariion CX-series product? As of now, I am still uncertain how the sector became uncorrectable and what lead/caused the data corruption.

Unrelated to this, I am also having performance issues (in particular, forced flushing) every other week. As a long-term solution, I am getting quite concerned about my SAN environment. If possible, can anyone share some of their experiences with the EMC Clariion CX-series line?

Any input is greatly appreciated!
Thank you.

Subscribe to Alerts! Get questions and answers delivered to your Inbox.


E-mail me updates on this question



   SUBSCRIBE

hidden modal window

Answer Wiki (Improve, edit or add to this answer)


 RATE THIS ANSWER
0
Click to Vote:
  •   0
  •  0



I went through this exact same thing on one of our CX700s in the last couple of months. Here is what happened:
A disk had faulted and was being rebuilt to a hot-spare. When the disks in the raid group were being read from to rebuild the failing disk to the hot-spare, it encountered a sector read error, but no-longer had all the parity and data to rebuild with (hence, it was actually a double-fault at the data-stripe level across the raid-group). When Flare encounters this condition, it has no choice but to mark the sector as invalid.

This situation is typically a hardware problem, but there are software tools in place to minimize the risk of this happening. A "sniffer" program is incorporated into the Flare code that continually scans each lun to insure it can read all sectors and that data and parity calculations match. If the sniffer encounters an error, it attempts to remap the sector. Release 19 of the code has vast improvements in the way sniffer operates, allowing the sniffer to run at a much faster rate without impacting the hosts.

In our situation, we were extremely lucky that the affected LUNs were not in use because they did need to be unbound and rebound, and then EMC had to actually physically reseat the hotspare and the original failed disk before we could bind and use the LUNs again.

What version of FLARE are you on? I am guessing it is pre-19, and if so, you need to upgrade to 19 as soon as possible especially if you have ATA drives in your Clariion.
  • AddThis Social Bookmark Button

Browse more Questions and Answers on Storage and DataCenter.

Looking for relevant Storage Whitepapers? Visit the SearchStorage.com Research Library.


Discuss This Answer


You must be logged-in to discuss a question. Log-in/Register

klewis  |   Nov 16 2006  9:40AM GMT

While I haven’t had your exact problem, I had an issue with Flare 16 on my CX700 that prevented degraded RAID groups from invoking my hot spares. I had to unbind and bind the hot spares to get the array to invoke one of them. This was fixed by upgrading to Flare 19.

Are your performance issues occuring during your BBU cache check? When the BBU discharges and recharges, I believe the SP write cache is disabled. My check was set for Saturday at 01:00AM, the same time as a large snapshot operation. Moving the check to a later time cleared up that issue.

All in all, I’ve been happy with the performance on my CX700; in the past two years, it’s been a real workhorse for us.

 

UniCal  |   Nov 16 2006  2:28PM GMT

Thank you for the responses.

PSFischer: I am actually running flare 19 on my two CX700’s and flare 22 on my CX3-40. I believe these flare levels are the latest from EMC. I’ve also ran the “sniffer” program and from the ‘getsniffer’, I am seeing 1 uncorrectable sector (a checksum error) on one of my LUNs in the affected disks. The LUN (which was on ATA disks) is now being rebuilt (unbind/re-bind). Unfortunately, from my logs I did not see any disk failures so I am still a little uncertain as to how the bad sector got into my situation. But I am suspecting there may be a connection to the failed SPA that occurred hours earlier.

klewis: Unfortunately, my performance issues occurs throughout the day, even on prime-time production hours (8am-5pm). Sometimes it would last for only 5 minutes, but other times it can go up to 2 hours of constant forced flushing :(
However, I am seeing a pattern such that when a database (oracle) refresh is occuring, that is usually when forced flushing occurs on the SAN as well. Through Navisphere analyzer, I’m seeing that when a DB refresh is occurring, a single LUN (150GB) this server owns, starts to push out over 1500 write IO/s and that is when the forced flushing begins. I am now wondering whether that is the limit my CX700 can take before performance degradation begins.

Of course, my CX700-1 environment is most likely different from yours. I have about 50 hosts connected to it with all 16 DAEs filled mostly with FC disks, and a few ATA disk. I have Oracle Peoplesoft, Email, and NFS Server environments on my CX700-1.