Posted by: Denny Cherry
So a few years ago a new Storage concept was introduced to market. That platform is know known as the IBM XIV platform. What makes this system so different from every other storage platform on the market is that the system doesn’t have any hardware level RAID like a traditional storage array does. What the system does is assigns 1 Meg chucks of space in pairs on disk throughout the system so that there is always a redundant copy of the data.
The hardware that makes up the system is relatively standard hardware. Each shelf of storage is actually a 3U server with up to two CPUs (depending on if the shelf is a data module or an interface module, I’ll explain the differences in a little bit) and 8 Gigs of memory for use as both system memory and read/write cache. Because of this architecture as you add more shelves you also add more CPUs and more cache to the system.
As I mentioned above there are two kinds of shelves, interface modules and data modules. There are effectively the same equipment with some slight additions for the interface modules. Each module is a single chip quad core server with 8 Gigs of RAM and 4 one Gig Ethernet ports for back-end connectivity (I have it on good authority that this will be increasing to a faster back end in the future). Each shelf contains 12 1TB or 2TB SATA hard drives spinning at 7200 RPM. The Interface modules have a second quad chip CPU, 4 four gig fibre channel ports, and 2 1 Gig iSCSI ports.
Now the system comes with a minimum of 6 shelves which gives you 2 interface modules, and 4 data modules. From there you can upgrade to a 9 shelf system which gives you 4 interface modules and 5 data modules. After you have the 9 shelf system you can upgrade to anywhere from 10 to 15 shelves with new interface modules being added 11 and 13 shelves. There’s a nice chart in this IBM PDF (down on page 2) which shows how many fibre and iSCSI ports you get with each configuration.
All these modules are tied together through two redundant 1 Gig network switches which use iSCSI to talk back and forth between the shelves. My contacts at IBM tell me that the haven’t ever had a customer max out the iSCSI back plane, but personally I see the potential for a bottleneck. Because of how distributed the system is, I can see this, but if things don’t balance across the interface modules just right I can see a bottleneck potential here (my contacts tell me that the next hardware version of the product should have a faster back plane so this is something they are addressing). There’s a nice picture on this IBM PDF which shows how the modules talk to each other, and how the servers talk to the storage modules.
The really nice thing about this system is that as you grow the system you add processing power and cache to the system as well as fibre and iSCSI ports so that should really help eliminate any bottlenecks. The downside that I see here is the cost to get into the system is probably a little higher than some of the competitor products as you can’t get a system with less than 6 shelves.
How it works
From what I’ve seen this whole thing is pretty cool when you start throwing data at it. When you create a LUN and assign it to a host the system doesn’t really do a whole lot. As the write requests start coming in it starts writing two copies of the data at all times to the disks in the array. Now as the data is written one copy of the data is written to an interface module, and one copy of the data is written to a data module. This way even if an entire interface module were to fail, there would be no loss of data. That’s right, looking back to the hardware config, we can loose all 12 disks in a shelf and not loose any data, because that data is duplicated to a data module. So if we had the largest 15 shelf system with all 6 interface modules, we could loose 5 of those 6 interface modules and not loose any data on the system. Now if we had a heavily loaded system we might start to see performance problems as we start to max out the fiber on the front end ports, or the 1 Gig back end interconnect ports, until those interface modules are replaced but that’s probably an acceptable problem to have as long as the data is intact.
Because there is no RAID there’s no parity overhead to deal with which keeps everything nice and fast. Because the disks aren’t paired up in a 1 to 1 like they would be in a series of RAID 1 arrays if a disk fails the rebuild time is much quicker because the data is coming from lots of different source disks, so the odds of a performance problem during that rebuild operation is next to nothing.
The system is able to keep everything running very quickly because every LUN is evenly distributed across every disk. When you create a LUN within the management tool it correctly sizes the LUN for you for maximum performance. While this will cost you a few gigs of space here or there the performance benefits are going to greatly out weight the lost storage space; especially when you remember that these are SATA disks, so the cost per Gig is already very low.
My Thoughts on the System
Now I’ve only had a couple of hours to work with one of these units. I’d really like to get access to one for a week or two to really pound on the system and really beat the crap out of the system to see what I can really make the system do (hint, hint IBM).
The potential IO that this system can serve up to a host server, such as a SQL Server, is massive. Now once you load up a few high IO servers against it the system should be able to handle the load pretty well. The odds are getting a physical hot spot on one disk is pretty low since the LUNs aren’t laid out in the same manor on each disk (in other words the first meg of each LUN isn’t on disk 1, the second meg of each LUN isn’t on disk 2, etc).
The management tools for the XIV system are pretty cool. They make data replication between two arrays very easy. It’s just a quick wizard and the data is moving between the systems. One thing which is very cool with the management tools where this system is a step above other arrays is that the XIV is aware of how much space has been used in each LUN. This makes disk management much easier as companies where the storage admins don’t have server access, and the server admins don’t have access to the storage array can each monitor free space from their respective sides which gives a better chance of someone seeing full disks quicker and being able to do something about it quicker before it becomes a problem.
Like every system with RAID if you loose the right two disks you’ll have some data loose. If you have a standard RAID 5 RAID array if you loose any 2 disks in the array then you loose all the data on the array. If you have a RAID 10 array if you loose a matching pair of disks then you loose everything on the array. With the XIV system if you loose two disks you’ll probably be ok, as the odds that you loose two disks that have the same 1 Meg block of data on it are very slim, but if you did loose those two disks before the system was able to rebuild you could loose the data on that LUN, or at least some of the data on the LUN. Now IBM’s docs say that the system rebuilds from a failed disk to a hot spare within 40 minutes or less (page 4), but I’d want to see this happening under a massive load before I would put my stamp on this.
Overall I would say that the XIV platform looks pretty stable. With what I’ve heard about the next generation of the hardware it appears that most if not all of the issues that I see with the platform appear to be resolved. The one thing which I’d really like to see would be three copies of each block of data through out the system; as the odds of loosing three disks all containing the same 1 meg block of data would be next to 0. Maybe this will be a configuration option with the 2TB disks, or maybe when the 3TB disks come up (when ever that happens). But then again, I’m a DBA so I love multiple copies of everything.
Now I’m sure that some of the other storage vendors have some opinions about the XIV platform, so bring it on folks.