 




<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ask the IT Consultant &#187; SLA</title>
	<atom:link href="http://itknowledgeexchange.techtarget.com/it-consulting/tag/sla/feed/" rel="self" type="application/rss+xml" />
	<link>http://itknowledgeexchange.techtarget.com/it-consulting</link>
	<description>Boston SIM Consultants' Roundtable Blog</description>
	<lastBuildDate>Sat, 27 Apr 2013 21:32:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>The Illusion of Cloud High Availability – Hardcore risk management</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/the-illusion-of-cloud-high-availability-%e2%80%93-hardcore-risk-management/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/the-illusion-of-cloud-high-availability-%e2%80%93-hardcore-risk-management/#comments</comments>
		<pubDate>Sat, 25 Feb 2012 02:00:30 +0000</pubDate>
		<dc:creator>Beth Cohen</dc:creator>
				<category><![CDATA[Cloud architectures]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cloud data center]]></category>
		<category><![CDATA[cloud infrastructure]]></category>
		<category><![CDATA[Cloud operations]]></category>
		<category><![CDATA[Data center operations]]></category>
		<category><![CDATA[Dev/Ops]]></category>
		<category><![CDATA[IT operations]]></category>
		<category><![CDATA[IT risk management]]></category>
		<category><![CDATA[risk management]]></category>
		<category><![CDATA[SLA]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=613</guid>
		<description><![CDATA["The enterprise needs to balance the cost of duplicating hardware throughout the cloud ecosystem, against the need for keeping operating expenses low. "]]></description>
				<content:encoded><![CDATA[<p><strong><em>Question</em></strong><em>:  I am building a private cloud and am concerned about how to meet the SLA high availability requirements.  What are my risks and how can I best manage them?</em></p>
<p align="left">To understand how to manage your high availability options, it is important to have a discussion about how high availability, risk and component failure work in a cloud environment.  High availability is very important for cloud environments; often the cloud provider is required to meet strict service level agreements for 99.99% or 99.999% (the so called 5-nines) availability.  In theory, that is the very reason that customers are interested in using cloud services.  It should be noted that most of the public cloud service providers have lots of methods to measuring availability that is in their favor, so even in the case of catastrophic systems failure, they are rarely accountable for the downtime.  This has been one of many sources of caution for enterprises that have wanted to leverage public cloud services.</p>
<p align="left">Before going into the details of how to quantify the cost of risk mitigation for a cloud, a short discussion of the science of risk management will help with understanding how it all works.  The goal of business risk management is to detail what kinds of risks exist in your specific business and determine how to prevent them entirely or minimize their impact on the business as a whole. Business risk management is essentially quantifying the risk that a given system will fail multiplied by the cost.  Cost is further broken into two more categories.  Out of pocket costs, also referred to as sunk costs, and lost opportunity costs.  Sunk costs are costs that you will need to pay out to fix the problem, while lost opportunity costs are revenue lost due to the system unavailability.  For example, the risk that there will be regular earthquakes in Japan is high.  The Japanese have responded to this threat by having some of the strongest earthquake resistant building codes in the world.  However, as last year&#8217;s 9.0 tremor and following tsunami so dramatically demonstrated, it is impossible to prepare for such extreme and rare events.</p>
<p align="left">High availability is best addressed by redundancy.  However, redundancy can be achieved at several levels of the IT infrastructure: hardware, software, network, or a combination.  Traditional IT organizations have reduced the risk of downtime by concentrating almost exclusively on hardware redundancy.  The scale of the cloud, where there are already thousands or hundreds of thousands of systems, hardware redundancy at the component level quickly becomes unsustainably expensive.   A telling scholarly article that looked at the reported hardware failures from several large data centers shows that by far the most likely failure at data center scales is as would be expected the components that have moving parts, such as hard drives and power supplies.</p>
<p align="left"><a href="http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/122/files/2012/02/mttfchart.gif"><img class="alignleft size-medium wp-image-616" src="http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/122/files/2012/02/mttfchart.gif" alt="" /></a></p>
<p align="left">In practice, the enterprise needs to balance the cost of duplicating hardware throughout the cloud ecosystem, which is the traditional approach to solving the risk management problem, against the need for keeping operating expenses low. Another consideration is that duplicating everything at the hardware level does not automatically guaranty that you do not still have a single point of failure in the environment.  For example, you might have remembered to contract with multiple carriers to spread the risk of a network outage, but if they all come into the data center at a single location, the data center is still prone to a catastrophic &#8220;backhoe failure&#8221;, which is what happens when a backhoe has severed all the up-link cables in one fell swoop.  It is an expensive and time consuming repair that leaves many unhappy customers in its wake.  Yes, there are ways to mitigate this risk, but they are expensive and need to be balanced against the relative probability of such an event.</p>
<p align="left">The best approach is to look at the probability of failure of each component in context of the entire ecosystem.  Since hard drives fail at such a high rate, the hardware approach is to mirror or RAID the drives across thousands of systems.  This translates to data redundancy and added costs that far exceed the optimum for availability and cost reduction.  At scale, building a storage system that handles the data redundancy at the software level is far more efficient.  Another examples, is planning for power supply (or more precisely fan) failures.  Again, since they are generally the next most common component to fail, instead of filling the cloud data center with thousands of extra power supplies and fans, it is better to build the cloud to be resistant to downtime if a server node fails.  In the end, addressing cloud high availability is not only about determining the MTTF of hard drives, cables or switch ports, it is also balancing it against the likelihood of a given failure at the data center macro level.</p>
<p>About the Author</p>
<p><em>Beth Cohen, </em><a href="http://www.cloudtp.com/"><em>Cloud Technology Partners, Inc</em></a><em>.  Transforming Businesses with Cloud Solutions</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/the-illusion-of-cloud-high-availability-%e2%80%93-hardcore-risk-management/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tangled Up in Clouds &#8212; Interdependency lessons from the AWS outage</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/tangled-up-in-clouds-interdependency-lessons-from-the-aws-outage/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/tangled-up-in-clouds-interdependency-lessons-from-the-aws-outage/#comments</comments>
		<pubDate>Wed, 27 Apr 2011 16:00:47 +0000</pubDate>
		<dc:creator>Beth Cohen</dc:creator>
				<category><![CDATA[Amazon Cloud Services]]></category>
		<category><![CDATA[Backup]]></category>
		<category><![CDATA[Cloud architectures]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[Cloud IT]]></category>
		<category><![CDATA[Cloud Services]]></category>
		<category><![CDATA[data protection]]></category>
		<category><![CDATA[Disaster Recovery]]></category>
		<category><![CDATA[high availability]]></category>
		<category><![CDATA[IT Infrastructure]]></category>
		<category><![CDATA[leveraging IT investment]]></category>
		<category><![CDATA[service level agreements]]></category>
		<category><![CDATA[SLA]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=426</guid>
		<description><![CDATA["What this outage highlighted for many companies is that even if they had built in the best fail-over and high availability into their systems, they were still dependent on vendors and services that might not have been quite so diligent."]]></description>
				<content:encoded><![CDATA[<p><strong><em>Question</em></strong><em>:  Amazon&#8217;s recent AWS outage affected a surprisingly large number of sites.  What can we learn about cloud resiliency and how can we minimize these outages in the future?</em></p>
<p>AWS, Amazon&#8217;s hosted web services offering suffered a major outage with some data lose at one of its data centers on April 21, 2011.  It was not the first such outage and I rather doubt it will be the last, but it was the one that was exposed what I call the dirty secret about cloud computing: the illusion of low cost high availability, systems backup and protection, and how quickly so many cloud services have become interdependent.</p>
<p>Ultimately, data protection and high availability boils down to having multiple copies of your data and IT systems in multiple locations with good reliable bandwidth connecting them.  Traditionally, high availability (that is 5 nines and up) has been expensive due to the cost of the bandwidth and hardware needed to deliver the level of service required.</p>
<p>On the surface moving your IT infrastructure to the cloud looks and sounds very attractive.  In theory, the cloud offers a great solution.  By purchasing cloud services, anyone can leverage the investments of Amazon, Google, Rackspace and the other major cloud vendors in state of the art data centers with full redundancy, and big network pipes for a tiny fraction of the cost of doing it in-house.  By moving IT infrastructure to the cloud you can take advantage of the redundancy and resiliency of using multiple vendors and multiple data centers and get enterprise class data protection at rock bottom prices.  Reading between the lines of the standard service level agreements for the low cost cloud services paints a very different picture.  Amazon guaranties 98% up-time, hardly earth shatteringly difficult to achieve.  Once you add in all those pesky asterisks and inter-dependencies, it is unlikely that anyone is going to be able to collect on this incident or any downtime at all.</p>
<p>Setting aside the issue of Amazon services level agreements, all of this assumes that you have control over most if not all of the systems and services in your IT stack.  What this outage highlighted for many companies is that even if they had built in the best fail-over and high availability into their systems, they were still dependent on vendors and services that might not have been quite so diligent.  As more companies take advantage of the increasingly specialized cloud services built on top of the cloud utility vendors&#8217; infrastructure, insuring up-time is going to be increasing more difficult to determine through the maze of inter-dependent services.</p>
<p>The bottom line for a business that wants to gain the advantage of high availability at low cost is that you need to make sure you have not only architected your own service to have a full fail-over solution, but you will also need to spend time doing diligence on all of your vendors&#8217; policies and architectures as well.  No matter how good the SLA is, if one of your upstream service providers does not have a good policy in place, your site will still be affected by their lack of planning.</p>
<p>About the Author</p>
<p><em>Beth Cohen, Cloud Technology Partners, Inc. Moving companies&#8217; IT services into the cloud the right way, the first time!</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/tangled-up-in-clouds-interdependency-lessons-from-the-aws-outage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
