<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ask the IT Consultant &#187; business continuity</title>
	<atom:link href="http://itknowledgeexchange.techtarget.com/it-consulting/tag/business-continuity/feed/" rel="self" type="application/rss+xml" />
	<link>http://itknowledgeexchange.techtarget.com/it-consulting</link>
	<description>Boston SIM Consultants' Roundtable Blog</description>
	<lastBuildDate>Sat, 27 Apr 2013 21:32:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Cloud High Availability Take Two – Supporting Rack Level Failure</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-high-availability-take-two-%e2%80%93-supporting-rack-level-failure/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-high-availability-take-two-%e2%80%93-supporting-rack-level-failure/#comments</comments>
		<pubDate>Sun, 04 Mar 2012 02:00:55 +0000</pubDate>
		<dc:creator>Beth Cohen</dc:creator>
				<category><![CDATA[BD/DR]]></category>
		<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Cloud architectures]]></category>
		<category><![CDATA[cloud computing models]]></category>
		<category><![CDATA[Cloud computing standards]]></category>
		<category><![CDATA[cloud data center]]></category>
		<category><![CDATA[cloud hardware]]></category>
		<category><![CDATA[cloud infrastructure]]></category>
		<category><![CDATA[Cloud innovation]]></category>
		<category><![CDATA[Cloud IT]]></category>
		<category><![CDATA[Disaster Recovery]]></category>
		<category><![CDATA[OpenStack]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=631</guid>
		<description><![CDATA["There is no need for redundant Top of Rack switches if the unit of failure is assumed to be the rack in a cloud ecosystem.  The average time to replace a switch including configuration is going to be under 2 hours. "]]></description>
				<content:encoded><![CDATA[<p style="text-align: left" align="left"><strong><em><span style="font-family: &quot;Calibri&quot;,&quot;sans-serif&amp;quot&amp;quot&#038;quot">Question</span></em></strong><em><span style="font-family: &quot;Calibri&quot;,&quot;sans-serif&amp;quot&amp;quot&#038;quot">:  I am concerned that the network is the weakest link in my private cloud.<span> </span>What will happen if any of my network hardware components fail?</span></em></p>
<p class="MsoNormal">In a previous discussion of <a href="../the-illusion-of-cloud-high-availability-%e2%80%93-hardcore-risk-management/">cloud high availability</a>, I covered in general terms what are some of the principals and approaches that make sense in a cloud environment.<span> </span>This time we will dive into some details of how this can be achieved in an Openstack environment.</p>
<p class="MsoNormal">The average published MTBF on switches seems to be between 100,000 and 200,000 hours.<span> </span>This number is dependent on the ambient temperature of the switch in the data center.<span> </span>I am assuming that most modern data centers are properly cooled for maximum switch life.<span> </span>This translates to between 11 and 22 years.<span> </span>Even in the worst case of poor ventilation and high ambient temperatures in the data center, the MTBF is still 2-3 years based on research found at <a href="http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf">http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf</a>.</p>
<p class="MsoNormal">The mean time to replacement (MTTR) for a switch is going to be dependent on how exactly how the data center is staffed and what processes are used for replacing switches.<span> </span>Assuming that you would keep a few spares in the data center and that it is fully staffed 24 hours/day, the average time to replace a switch including configuration is going to be under 2 hours.<span> </span>Most modern switches are auto-configured so the actually provisioning time after the switch is powered up in the rack is under 5 minutes.</p>
<p class="MsoNormal">Let me walk through what will happen in the case of a top of rack (ToR) switch failure in the Swift cluster. Swift by its nature is fault tolerant at the rack level.<span> </span>That means that the system will continue to operate without data loss if an entire rack goes off-line. The cluster would detect the rack being off-line and send out a notification that the NOC staff would see within 5 minutes.<span> </span>In the case of rack going off-line Swift does not automatically move any data.<span> </span>The reason for this is that in fact, the NOC staff needs to make a decision about the cause of the rack going off-line and how long it will take for it to come back on line.<span> </span>In the case of a switch failure, the data in the rack is still intact, so it is far more efficient to just replace the switch then bring the rack back on-line without having to move the data.<span> </span>Even if the NOC staff decides to move data around, which they would only do if the fault is in the servers not the switch, the network overhead that it adds to the cluster is in the range of 3-5% for a large cluster with properly tuned ring rebuild cycle.<span> </span>Clearly taking a rack off-line is not considered a problem.<span> </span>I would argue that you should expect to be able to take racks off-line with no impact to the system as a whole as a matter of course for maintenance, upgrades and other reasons.</p>
<p class="MsoNormal">Nova behaves is slightly differently in the case of a rack failure. Unlike Swift the architecture does not have an assumed base unit of failure at the rack level.<span> </span>It does have the concept of a availability zone, which is quite different from a Swift zone just to confuse things.<span> </span>That doesn’t mean that you cannot create an equally fault tolerant Nova architecture, it just requires more development of high availability at the application level of the system combined with the use of the availability zone as a mechanism for balancing the applications in different locations.<span> </span>The assumption is that it is the responsibility of the application to build in fault tolerance, not the underlying infrastructure to keep track of the individual VM instances.<span> </span>Nova zones can be used to achieve this level of fault tolerance in combination.<span> </span>Combining this with the a live migration functionality and HA application design will allow you to build support for rack level failure.<span> </span>Again, the metrics for determining the next steps (replacement of switch only or rebuilding of entire rack) will be based on the specific component failure.<span> </span>See the recent discussion of this at <a href="http://lists.us.dell.com/pipermail/crowbar/2012-January/000643.html">http://lists.us.dell.com/pipermail/crowbar/2012-January/000643.html</a> for more ideas on how to architect such a system.</p>
<p class="MsoNormal">Another approach would be to create high availability through redundant hardware.<span> </span>In this case you could provision the racks with two switches.<span> </span>However this is an expensive option in a large data center with hundreds of racks.<span> </span>It is clearly orders of magnitude more expensive to take this approach.<span> </span>From a risk perspective, you have substantially increased your per rack costs with little or no reduction in risks since the rate of failure is so low to begin with and the architected unit of failure for a cloud infrastructure should be at the rack level to begin with.</p>
<p><span style="font-family: &quot;Calibri&quot;,&quot;sans-serif&amp;quot&amp;quot&#038;quot">About the Author</span></p>
<p><em>Beth Cohen, </em><a href="http://www.cloudtp.com/"><em>Cloud Technology Partners, Inc</em></a><em>.<span> </span>Transforming Businesses with Cloud Solutions</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-high-availability-take-two-%e2%80%93-supporting-rack-level-failure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloud Redundancy – A different approach to component failure</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-redundancy/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-redundancy/#comments</comments>
		<pubDate>Sun, 15 Jan 2012 15:00:37 +0000</pubDate>
		<dc:creator>Beth Cohen</dc:creator>
				<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Cloud business models]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cloud computing models]]></category>
		<category><![CDATA[Cloud Services]]></category>
		<category><![CDATA[Disaster Recovery]]></category>
		<category><![CDATA[enterprise cloud]]></category>
		<category><![CDATA[enterprise cloud services]]></category>
		<category><![CDATA[hardware failure]]></category>
		<category><![CDATA[OpenStack]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=581</guid>
		<description><![CDATA["Unlike traditional IT operations, over-design to protect against obsolescence is not desirable when scaling to thousands of nodes."]]></description>
				<content:encoded><![CDATA[<p><strong><em>Question</em></strong><em>:  What is the best way to manage the thousands of components in a typical cloud?  How does managing &#8220;at scale&#8221; change my systems administration practices?</em></p>
<p align="left">People have been managing data centers for 30-40 years now, so that should mean that there are a good set of standard best practices for building highly available resilient components.   That is true for the old style data center, but the old best practices are expensive and do not scale well for cloud architectures.  Duplicating hardware to protect against failure works well when you have hundreds of components but the costs are linear so it does not scale.  Unlike traditional IT operations, over-design to protect against obsolescence is not desirable when scaling to thousands of nodes.  For example, spending an extra $6000/rack for 10GB switches might seem to be a sensible way to protect against hardware obsolescence if you have 10 racks, but that extra cost is much harder to justify when you are provisioning a 100 racks and it has turned into an extra $6 million!</p>
<p>The principal of ‘replacement management&#8217; takes on great importance when managing the thousands of physical devices required for a cloud deployment.  The advantage of the cloud is that you do not need to build expensive high availability redundant systems because an assumption that components will fail is built into the architecture.  By leveraging the huge pools of cloud resources, the level of redundancy can be considerably reduced.  If a component fails, the system will continue to work until someone replaces it.  Since commodity low price devices typically have a high rate of failure, the whole architecture needs to be based on &#8220;availability&#8221; and &#8220;partial failure&#8221;.</p>
<p>In a cloud environment, it makes much more sense to just replace a component than worry about what caused the failure and trying to troubleshoot it.  The most common components to fail are disks, since they are mechanical moving parts.  A typical disk failure rate in a cloud data center is about 10-15%.  However, fans, power supplies and memory will also fail less frequently.  For example, the OpenStack Swift architecture assumes that disks, systems and entire zones can and will disappear (fail) at any time.  Yet, there are only three copies of every file, and no additional redundancy in the hardware.</p>
<p>This approach to failure at scale can be very cost effective, but it takes different mindset from traditional operations.  Every cloud operations engineer for cloud should learn what is in the service, where the critical parts are located, and how to replace a failed component, then incorporate the knowledge into standard operations processes.  Automated tools need to be written to help identify the location of failed disks and other components so they can quickly be isolated from the environment and replaced.  To maintain a high level of robustness without sacrificing cost efficiency, the system needs to be designed to replicate data on the application/software level, not disk or network level.</p>
<p>In conclusion, the biggest paradigm shift is that development and operations groups need to work together to optimize the systems and drive down costs.  Tests and metrics need to be created to determine the optimum systems configurations.  By understanding how changes in the components affect the systems as a whole, it will allow you to flexibly configure the systems to meet the application requirements as they change.</p>
<p>About the Author</p>
<p><em>Beth Cohen, </em><a href="http://www.cloudtp.com/"><em>Cloud Technology Partners, Inc</em></a><em>.  Transforming Businesses with Cloud Solutions</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/cloud-redundancy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Real Life Business Continuity Planning</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/real-life-business-continuity-planning/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/real-life-business-continuity-planning/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 22:00:57 +0000</pubDate>
		<dc:creator>ITKE</dc:creator>
				<category><![CDATA[BC/DR]]></category>
		<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Data center operations]]></category>
		<category><![CDATA[Disaster Recovery]]></category>
		<category><![CDATA[Enterprise datacenter]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=298</guid>
		<description><![CDATA[Just because you are using your backup site, does not mean that all the normal business operations rules no longer apply.  ]]></description>
				<content:encoded><![CDATA[<p><strong><em>Question</em></strong><em>:  What are some operational considerations to expect while running a disaster recovery (DR) site during an actual disaster?</em></p>
<p>You are in a panic.  Suddenly, your primary data center is down and you are planning to failover your business critical application production to your carefully planned DR site.  Assuming a successful recovery, your first realization is that the DR site is now your production site, albeit temporarily.  All of a sudden you realize that you need to run your backup site in full production mode, i.e. it needs to be run as your ran your recently disabled production site.  Just because you are using the backup site, does not mean that all the normal business rules no longer apply.</p>
<p>When you were putting together your DR/BC plan, you figured that you only needed your backup site to be just a ‘bare bones&#8217; operation that would only support critical functions, but the reality is that life at the DR site will include most, if not all, of the normal production operations headaches.  When putting together your BC/DR plan including the following considerations will make an actual disaster situation that much less painful:</p>
<ul class="unIndentedList">
<li> Full operations management of the DR environment is necessary to keep recovered production running. DR servers have all of the same issues that any server does. Is a full set of your administration and monitoring tools ready to use at the DR site?</li>
<li> Backups will be required. Production data requires the same level of protection, especially if customer service level agreements are involved. Have you provided for these?</li>
<li> Support of the ‘hands on&#8217; variety may be needed even though you can manage your infrastructure remotely. If your DR site is far away from your primary site, getting staff there may be a challenge. Have you arranged for appropriate on-site assistance?</li>
<li> Security controls have to be as stringent as ever because the risks are the same (and perhaps even worse) and all legal requirements still hold. Can you control and monitor access to your DR site?</li>
<li> Applications still require support. Patches and emergency releases will inevitably be needed to keep the business running. Are all your code libraries and the tools needed for development and testing installed at the DR site?</li>
</ul>
<p>The planning implications are clear &#8211; since your DR site is a substitute for your primary production site, think of it as such and outfit it to perform at the same level. Even though it will not be as large as the primary site, it should offer all of the same capabilities as the primary site.  In some situations, it just might be your home for a long period of time.</p>
<p>About the Author</p>
<p><em>John McWilliams, JH McWilliams &amp; Associates, Business Continuity Consultants</em><strong><em></em></strong></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/real-life-business-continuity-planning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Five Tenets for Achieving High Availability for Critical Applications</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/five-tenets-for-achieving-high-availability-for-critical-applications/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/five-tenets-for-achieving-high-availability-for-critical-applications/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 14:00:09 +0000</pubDate>
		<dc:creator>ITKE</dc:creator>
				<category><![CDATA[business applications]]></category>
		<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Business Value]]></category>
		<category><![CDATA[high availability]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=287</guid>
		<description><![CDATA[Question:  My organization demands critical applications such as email and ERP be &#8220;always on&#8221; (whenever users want to access them).  What is the best way to achieve &#8220;always on&#8221; IT systems? The pervasiveness of the Internet has put IT executives in a bind.  Nowadays, organizations rely so heavily on IT to run their businesses that [...]]]></description>
				<content:encoded><![CDATA[<p><strong><em>Question</em></strong><em>:  My organization demands critical applications such as email and ERP be &#8220;always on&#8221; (whenever users want to access them).  What is the best way to achieve &#8220;always on&#8221; IT systems?</em></p>
<p>The pervasiveness of the Internet has put IT executives in a bind.  Nowadays, organizations rely so heavily on IT to run their businesses that users have become IT&#8217;s top priority.  IT is expected to deliver high availability and predictable performance for key user applications &#8212; the <em>always on imperative</em>.</p>
<p>Yet, many IT departments lack sufficient resources &#8211; skilled personnel, streamlined processes and effective technology &#8211; to keep IT operations running smoothly when needed.  Moreover, existing applications were often designed and deployed from IT&#8217;s perspective, not the business, or users&#8217;.  For example, IT generally concentrates on analyzing technical specifications, defining IT acceptance testing, and managing project deliverables.  User requirements get little attention which often leads to poor user adoption, negating the expected project ROI.</p>
<p>This technical mindset of designing, deploying and managing IT systems is no longer sustainable.  Application development needs to be prioritized for business requirements, user needs and business value.  The deployment phase should be concerned with user acceptance, training and usage, first.  Incorporating metrics for business impact (e.g. application availability &amp; performance for the users; user adoption rates; and productivity gains) ensure that these goals are met.  Operations must concentrate on high availability and performance.  A through awareness and visibility into the applications and all other elements that must function optimally will make certain critical business services are enabled properly.</p>
<p>Adopting the following five tenets will deliver &#8220;always on&#8221; applications and lead to high availability and predictable performance of key applications for users:</p>
<ul class="unIndentedList">
<li> Design with the end in mind &#8211; Meet the objective of high availability and performance over the entire multi-year usage period</li>
<li> Follow the money &#8211; Stay focused on financial and business benefits of IT systems instead of technical benefits</li>
<li> Focus on user experience &#8211; Shift perspective from technology performance to user productivity</li>
<li> Break the silos &#8211; Create cross functional teams to achieve high collaboration between IT and users</li>
<li> Manage from the business process perspective &#8211; Monitor critical applications down through underlying equipment to understand business impact of all system components. This will hasten problem resolution and reduce unplanned downtime.</li>
</ul>
<p>About the Author</p>
<p><em>Robert Johnson, Director of Product Marketing, Atrion Networking Corporation</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/five-tenets-for-achieving-high-availability-for-critical-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Preparing IT for Flu Epidemics</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/preparing-it-for-flu-epidemics/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/preparing-it-for-flu-epidemics/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 23:00:25 +0000</pubDate>
		<dc:creator>ITKE</dc:creator>
				<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Business Value]]></category>
		<category><![CDATA[Disaster Recovery]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=236</guid>
		<description><![CDATA[Question: Are there ways we can prepare for a large percentage of our technical support staff out with H1N1 flu? With the heightened interest in how to respond to the H1N1 pandemic, every organization should be considering how to manage production support operations in the face of high absenteeism rates that could exceed 30%. Because [...]]]></description>
				<content:encoded><![CDATA[<p class="text"><strong><em><span style="font-size: 11pt;font-family: &quot;Arial&quot;,&quot;sans-serif&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&#038;quot">Question</span></em></strong><em><span style="font-size: 11pt;font-family: &quot;Arial&quot;,&quot;sans-serif&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&#038;quot">:<span> </span>Are there ways we can prepare for a large percentage of our technical support staff out with H1N1 flu?</span></em></p>
<p class="MsoNormal">With the heightened interest in how to respond to the H1N1 pandemic, every organization should be considering how to manage production support operations in the face of high absenteeism rates that could exceed 30%. <span> </span>Because the illness often hits suddenly, staff members could be sick or home caring for a family member and unable to work. <span> </span>Temporarily losing key individuals, such as system administrators and DBAs, can be traumatic without proper planning and redundancy. <span> </span>A good response plan should focus on assuring that necessary skill sets are available when needed. <span> </span>The following suggested approach is a good start towards making sure you are covered:</p>
<p class="MsoNormal">
<ul>
<li><!--[if !supportLists]-->Identify the critical functions and performance timeframe.<span> </span>This information may have already been gathered as part of a business impact analysis. If not, draw up a simple list of the functions or tasks and how time critical they are.</li>
<li><!--[if !supportLists]-->List the skills and knowledge required to perform critical functions and the staff that possesses them. These might include UNIX administration or knowledge of a custom finance application. <span> </span>Management and the operational staff will know.</li>
<li><!--[if !supportLists]-->Identify the primary and secondary staff members who can provide backup for each task or skill. In particular, identify critical skills that are possessed by only one staff member. Gaps such as these are the biggest risks.</li>
<li><!--[if !supportLists]-->Develop a plan for backfilling those critical skills. This may include documenting procedures and training other staff members or locating an outside resource to provide the skill on a temporary basis.</li>
<li><!--[if !supportLists]-->Practice running operations using backup staff and documentation.<span> </span>This validates the ability of the backup staff to perform the tasks and also provides on-the-job training and job enrichment opportunities</li>
<li><!--[if !supportLists]--><!--[endif]-->Plan for working at home (WAH). <span> </span>In many organizations, technical staff members are already required to be available 7&#215;24, so the mechanisms are in place.<span> </span></li>
<li><!--[if !supportLists]-->Develop a contingency plan for reducing workload when absenteeism is high. <span> </span>Discuss with senior management the possibility of performing only minimal system changes and delaying major deployments to reduce risk and maintain system stability. <span> </span>Given the possible business implications of such a plan, buy-in from all stakeholders is essential. Define conditions and triggers for putting the plan in action.</li>
</ul>
<p class="MsoNormal">Having to deliver services without a full staff is a situation that every organization encounters sooner or later.<span> </span>It can be triggered by events other than a flu pandemic.<span> </span>Preparing for it will make your organization more resilient and provide unexpected benefits.</p>
<p class="text"><span style="font-size: 12pt;font-family: &quot;Arial&quot;,&quot;sans-serif&quot;color: windowtext">About the Author</span></p>
<p class="text"><em><span style="font-size: 11pt;font-family: &quot;Arial&quot;,&quot;sans-serif&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&amp;quot&#038;quot">John McWilliams, JH McWilliams &amp; Associates, Business Continuity Consultants</span></em><strong><em></em></strong></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/preparing-it-for-flu-epidemics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disaster Recovery and Data Integrity</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/disaster-recovery-and-data-integrity/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/disaster-recovery-and-data-integrity/#comments</comments>
		<pubDate>Mon, 12 Oct 2009 02:00:03 +0000</pubDate>
		<dc:creator>ITKE</dc:creator>
				<category><![CDATA[Backup]]></category>
		<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Data Integrity]]></category>
		<category><![CDATA[Disaster Recovery]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=209</guid>
		<description><![CDATA[Question: When recovering an application at a remote disaster recovery site, how can the data&#8217;s integrity be verified before resuming production processing? One of the primary goals of a disaster recovery (DR) plan is to protect business data.  The RTO (recovery time objective) component is often the most time consuming part of a DR plan. [...]]]></description>
				<content:encoded><![CDATA[<p><em><strong>Question:</strong></em> W<em>hen recovering an application at a remote disaster recovery site, how can the data&#8217;s integrity be verified before resuming production processing?</em></p>
<p>One of the primary goals of a disaster recovery (DR) plan is to protect business data.  The RTO (recovery time objective) component is often the most time consuming part of a DR plan.  The objective is to shorten the recovery task as much as possible while maintaining trust in your data so that the business can recommence regular operations quickly.  Since the complexity of many applications leave far too many places for bad data to hide, the following recommendations will help speed the recovery process.</p>
<p><strong>Disaster Recovery Preparation</strong></p>
<ul class="unIndentedList">
<li> Implement the fastest replication to your DR site that budget and technology allow. The best option to minimize data loss is synchronous replication of your business critical data to a remote backup site. This ensures full data integrity because the replication logs will record data currency while the transactions are copied to both sites simultaneously.</li>
<li> The next best option is an asynchronous replication scheme. Even if the delay is measured in minutes or based on periodic file copies, you will still be more prone to some known level of data loss. However, since this option is considerably less expensive, the business owners might decide that the reduced costs will offset the increased risk of data lose. Make sure that everyone understands the impact of data loss on the business recovery process. This includes establishing data loss tolerances, and identifying likely types of data loss during the development of the DR plan.</li>
<li> Copy verifying data to your DR site along with the primary data. This means finding supporting data from outside the application that corroborates your database. This might include data input forms, activity logs, emails, checksums, or input files. Some of these may be paper-based and could be scanned or copied to the DR site.<strong></strong></li>
</ul>
<p><strong>Recovery Plan Preparation</strong></p>
<ul class="unIndentedList">
<li> Assess the data at the DR site by knowing both its currency (i.e. how old it is, as precisely as possible) and its consistency across multiple applications. Currency is important because it will identify any lost transactions or updates, while consistency is important because incomplete transactions can cause data corruption problems after you have resumed production.</li>
<li> Review the replication logs for the last data transmission to establish currency.</li>
<li> Compare verifying data at the DR site with the recovered data to follow specific transactions and identify the degree of data loss.</li>
<li> Have the business application users access the data and determine if recent changes to production data are found in the recovered data.</li>
<li> Execute a build acceptance test (BAT) like the ones used to verify application installation and integrity. These tests often point out data anomalies.</li>
<li> Have a process and tools in place for tracking and resolving data problems after you return the applications to production.</li>
</ul>
<p>The above recommendations may not identify all the issues, but they will go a long way toward meeting your RTOs and customer requirements, so that you will be able to start using the application while continuing the verification of the recovered data.</p>
<p><em>John McWilliams, JH McWilliams &amp; Associates, Business Continuity Consultants</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/disaster-recovery-and-data-integrity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mapping Application Disaster Recovery to Business Requirements</title>
		<link>http://itknowledgeexchange.techtarget.com/it-consulting/understanding-disaster-recovery/</link>
		<comments>http://itknowledgeexchange.techtarget.com/it-consulting/understanding-disaster-recovery/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 12:00:42 +0000</pubDate>
		<dc:creator>ITKE</dc:creator>
				<category><![CDATA[Application testing]]></category>
		<category><![CDATA[business continuity]]></category>
		<category><![CDATA[Business Value]]></category>
		<category><![CDATA[Disaster Recovery]]></category>
		<category><![CDATA[IT consultant]]></category>
		<category><![CDATA[IT Infrastructure]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://itknowledgeexchange.techtarget.com/it-consulting/?p=139</guid>
		<description><![CDATA[Question: Now that my organization has acquired space at a  remote co-location data center and we&#8217;ve installed hardware, where do we need to consider in setting up recovery for our critical business applications? While it would be impossible in this forum to go into all the possible strategies that you could employ for application recovery, [...]]]></description>
				<content:encoded><![CDATA[<p><em><strong>Question: </strong>Now that my organization has acquired space at a  remote co-location data center and we&#8217;ve installed hardware, where do we need to consider in setting up recovery for our critical business applications?</em></p>
<p>While it would be impossible in this forum to go into all the possible strategies that you could employ for application recovery, it will describe the areas that you should consider when developing a recovery solution for your company.</p>
<p>Before thinking about any technology, disaster recovery is really more about business risk management.  As such it is important to start by meeting with the business owners of each application to identify the recovery requirements such recovery time objective (RTO), recovery point objective (RPO), end user workload, and whatever other applications or services are required by the application. In short, understand the main parameters of your recovery solution from the business perspective first. Keep in mind that the business owners may not be familiar with the technological underpinnings of the application, so involve the application support staff to ensure a full understanding of the recovery requirements so that the managers can make reasonable decisions based on what is achievable with the current technology and architectures.</p>
<p>From here, design your recovery solution while considering the following:</p>
<ul type="disc">
<li><strong>Server power</strong> &#8211; How much processing power will be needed by the      recovered application at the DR site? Will the DR site support production      only or will development activities also be occurring there?</li>
<li><strong>Replication</strong> &#8211; How much data has to be available at the DR      site, how fresh will it need to be, and how will it get to the DR site?</li>
<li><strong>Network</strong> &#8211; How much network capacity will be needed to support      data replication and end user access to capacity and what protocols should      the network support?</li>
<li><strong>End user access</strong> &#8211; How will the users of the application access      it while running at the recovery site?</li>
<li><strong>Application installation and code management</strong> &#8211; How do you      ensure that the latest version of the application is available at the DR      site?</li>
<li><strong>Application recovery process</strong> &#8211; What will be the step by step      process for recovering the application? Who will execute the recovery      process?</li>
<li><strong>Change control</strong> &#8211; How do you ensure that changes to the      production version of the application are reflected in the DR environment?</li>
<li><strong>Testing </strong>- How will you test the resources at the DR site and      the recovery process?</li>
</ul>
<p>In designing your recovery solution, think of it as an on-going resource that must be managed with the same attention as your production environment. That&#8217;s because it might someday <span style="text-decoration: underline">be</span> your production environment.</p>
<p><em>John McWilliams, JH McWilliams &amp; Associates, Business Continuity Consultants</em></p>
<!-- wpms-network-global-inserts -->]]></content:encoded>
			<wfw:commentRss>http://itknowledgeexchange.techtarget.com/it-consulting/understanding-disaster-recovery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
