Adventures in Data Center Automation

Apr 24 2008   3:57PM GMT

Performance and Availability vs. Analytics - Part 3 of ?



Posted by: Ryan Shopp
DataCenter

In part one we talked about data collection and basic functionality offered by Performance/Capacity/Availability Management for Data Center Automation. In part two we hit upon how some vendors our taking this collected data and extending it’s effectiveness in the name of proactive, even predictive analysis while attempting to drive false positive issue identification to absolute zero. Now, how do we take this information and apply it back to the Data Center Automation Blueprint we’ve been working on.

We currently have two distinct entities combined together within the current Performance & Availability functional category;

  • Reactive events/alarms/alerts…something has already happened and single-pain of glass, event correlation or root cause analysis is attempting to automatically sort through all this to weed out the less import or false positives concerns.
  • Proactive determination…we are attempting to identify any issues before it happens…many times these proactive tools also feed their event/alert/alarm information up to the reactive single pain of glass consoles

These two distinct areas are both very necessary and I believe with further automation will continue to consolidate together. So I believe they should stay together as one entity and we should continue to push vendors to further pull these together. What we really want is a unified list of events that are 100% accurate and detailed instances of 1) things that will be going wrong very soon and 2) do to unforeseen or controllable circumstance an immediate condition/concern that is upon us.  A term I seen used in the past, mostly in the service provider space, is calling these conjoined areas “service assurance.” I really like this term as it’s all about assuring our data center is providing us with the business services we come to expect from it.  Maybe I’ll use that in the blueprint going forward.

One other area that I encourage and expect we will see continued convergence into this “service assurance” category are not performance or outage related situations, but security and privacy events. There is no reason when a abnormality caused by a worm equates to an outage or degraded performance situation the metrics should not be correlated together vs. the separate silos of today. But that is a whole other novel.

So with that said, I’m planning to update the Data Center Automation Blueprint to relabel Performance & Availability to Service Assurance (Performance, Capacity & Availability).  Now, about analytics? That’s the next topic to tackle in part 4 of this series.

Apr 23 2008   4:50PM GMT

links for 2008-04-23



Posted by: Ryan Shopp
DataCenter


Apr 22 2008   4:44PM GMT

links for 2008-04-22



Posted by: Ryan Shopp
DataCenter


Apr 21 2008   3:11PM GMT

Performance and Availability vs. Analytics - Part 2 of ?



Posted by: Ryan Shopp
DataCenter, Analytics, eg innovations, Indicative, HP Software, Integrien, Netuitive, BMC, NetQoS, Opnet

So in part 1 we talked through the collection of performance/capacity/availability data. Next up is focused on where innovations using this collected data are taking us.

The next level of Performance & Availability I previously mentioned are coming from a variety of companies doing cross-metric analysis or even automated behavioral analytics. These vendors are typically classify themselves as Service Level Management, some types of Business Service Management or Analytics. They either leverage a variety of data collection entities or they themselves offer capabilities that span multiple sources to elevate and/or automate results in the hope of proactive (even predictive) identification of issues with minimal (striving for zero) false positives. Here are some more thoughts on each of these areas:

  • Service Level Agreement vendors seem to focus on leveraging a variety of data sources/metrics and normalizing them into very detailed quality of service/performance agreements between a service provider and their customers (in some situations the service provider is the internal IT department themselves).
  • Business Service Management vendors in the realm of performance/capacity/availability seem to focus on the mapping of each business service (e.g., application(s) and the infrastructure that supports those application(s) from and end-to-end perspective). Then, if any component in the mapped bundle shows signs of trouble, an alert is raised for proactive resolution.  NOTE:  BSM is a very broad term - I’m focusing it down here on just this functional area, I’m not talking comprehensive dashboard spanning all functional areas, service desks etc.
  • Real-time Analytic vendors seem to leverage a variety of time-series metrics from various collection sources mapped together appropriately (like BSM), then using behavioral algorithms they dynamically determine normal behavior. If something deviates from that behavior then in real-time it raises an alarm (now were getting predictive).
  • Historical Analytics or modeling/simulation vendors seem to leverage a variety of data sources coupled with other cross-functional details (e.g., CMDB, configuration settings) to establish a model and expected behavior. Then you can tweak, tune or even re-design to see impact of potential changes, upgrades, etc.

We could probably come up with better names for these higher level performance/capacity/availability areas but Service Level Management, Business Service Management and Performance Analytics are the ones on the marketing being advertised today.

One area of data collection and reporting that does continue to innovate  is from the end-user, passive traffic flow perspective. This first popped up on the scene back in the last 1990’s and since then there seems to have been a major resurgence in vendors focusing on specific, mission-critical applications. Since these agents typically reside and monitor from the desktop or mobile device perspective I’ve placed them beyond the scope and control of Data Center Automation. Some vendors are doing the end-to-end monitoring (as mentioned before) from an appliance in the data center making some TCP/IP assumptions (e.g., NetQoS, CA Wily).

So now we’ve discussed Performance/Capacity/Availability management and how it also has analytics occurring within that functional silo. So what does that mean to the Data Center Automation Blueprint from my perspective. Stay tuned for part 3.


Apr 18 2008   4:45PM GMT

links for 2008-04-18



Posted by: Ryan Shopp
DataCenter


Apr 17 2008   9:58PM GMT

Performance and Availability Management vs. Analytics - Part 1 of ?



Posted by: Ryan Shopp
nimsoft, cittio, eg innovations, Alcatel-Lucent, Analytics, Apparent Networks, Brix Networks, Compuware, Entuity, Fluke Networks, Gomez, Groundwork, Hyperic, Indicative, Application monitoring, DCAB, Firescope, HP Software, IBM Tivoli, InfoVista, Integrien, NetScout, Netuitive, Solarwinds, Systems monitoring, BMC, Quest Software, NetIQ, Network monitoring, Packet Design, Performance management, CA, Keynote, Nagios, NetQoS, Network Instruments, OpenNMS, Opnet, Xangati, ZenOSS

I’ve had an opportunity to be briefed over the past couple months by a number of current Data Center Automation Blueprint’s Performance & Availability vendors (e.g., CITTIO, eG Innovations, InfoVista, Integrien, Nimsoft).  With that and some further research I think I’m ready to take another pass at this area of the blueprint.

First up, all these vendors use a variety of techniques to collect a variety of data from as many points of view as possible.

  • Their own server agents that collect data about systems, services, applications, databases, etc and then aggregate back to a centralized console
  • Agent-less centralized consoles that leverage infrastructure standard communications protocols (e.g., SNMP, RPC, ODBC, WMI, SSH, TCP, UDP, HTTP) to query or connect remotely to collect data from networks, systems, services, applications, databases, etc.
  • Passive traffic flow collectors (which can be an agents or appliance) that are either in-line with the traffic flows or receive an exact copy of all traffic flows traversing a network connection (e.g., switch port uplink) through hardware vendor capabilities (e.g., spanning)

These data collection points can be statistics about a specific IT infrastructure resource ; physical devices, virtual devices, physical connections, virtual connections or resources running on physical or virtual devices like services, processes, applications, databases, etc.

Or the data collection points can be traffic flows or end-to-end specifics including passive traffic flows, synthetic transactions or even as simple as a pinging from remote points.

Metrics that are captured, typically revolve around throughput, errors, utilization, latency, up/down status, etc. (there are way to many to mention here).

After saying all this, there is a list a mile long of vendors (a number already noted on the DCAB) that capture these predominately time-series oriented data points about performance, capacity, availability using any/all these methods or vantage points (I know, passive traffic flows are not time-series data but patterns/usage/performance etc can be determined from them).

So, with all that data, what most these vendors offer are two primary types of functionality; 1) a variety graphical reports and 2)metric thresholding capabilities that produce a list of outstanding issues/alerts/alarms/events/concerns (whatever you want to call them).

Ok, so why did I organize and point all this out. So I can draw a line around where most of the innovation from my perspective is occurring. The above is for the most part in my eyes a commodity these days. Most companies have had collection/reporting/thresholding capabilities spanning multiple technology silos since pretty close to the start of the enterprise networking. The reports continue to get fancier, the number of data sources a single product collects from continues to expand, etc.  Another sign of commoditization is related to the variety of economic business models offering these products; open source, managed service providers, internet distributed products, appliances deployment models and indirect sales forces, large enterprise direct sales force, completely flexible frameworks for service providers to basically “build their own,” etc.

For the most part where the majority of technical innovation is occurring these days is the next layer above this data collection, reporting and alerting. Now let me say this, yes…there is some great innovation still occurring in the data collection realm (e.g., Xangati offering real-time Netflow down to a user level, PacketDesign monitoring routing messages, NetQoS leveraging advanced TCP/IP theory to analyze where end-to-end bottlenecks are occurring). But, for the most part these new data sources are being used to augment or replace currently deployed data sources in an attempt to see things from either as many vantage points or the best vantage points to avoid surprises within their unique enterprise IT environment.

So where is the serious innovation coming from…stay tuned for part 2.


Apr 17 2008   4:40PM GMT

links for 2008-04-17



Posted by: Ryan Shopp
DataCenter


Apr 16 2008   4:43PM GMT

links for 2008-04-16



Posted by: Ryan Shopp
DataCenter


Apr 15 2008   4:47PM GMT

links for 2008-04-15



Posted by: Ryan Shopp
DataCenter


Apr 14 2008   9:45PM GMT

Mapping HP Software to the Data Center Automation Blueprint



Posted by: Ryan Shopp
DataCenter, Analytics, CMDB, DCAB, HP Software, Integrien, Netuitive

I had the chance to recently chat with an executive at HP to breakdown what pieces and parts ended up where post Peregrine, Mercury and Opsware acquisitions. Here is my attempted and mapping them to the Data Center Automation Blueprint.

  • Configuration & Change
    • for networks - Network Automation Software (formerly Opsware, formerly Rendition)
    • for servers - Server Automation Software (formerly Opsware)
    • for storage - Storage Essentials Software (formerly Appilog)
  • Resource Reconciliation
    • Universal CMDB software (formerly Mercury, formerly AppLogic)
  • Process Orchestration - Operations Orchestration Software (formerly Opsware, formerly iConclude)

The focus of our call was around the above areas…from here I’m trying to piece together by using the website and the knowledge that:

  • The Business Service Management group is where all the monitoring products reside; Mercury (excluding QA products) and original OpenView monitoring products.  There still seems like a ton of overlap here…
  • The IT Service Management is where Peregrine and the original HP Service Desk products reside.

So that means for the other functional areas of the Data Center Automation Blueprint we have:

  • Analytics
    • HP Dashboard software & HP Business Service Level Management  - each offers a unified user interface consolidating reports and statistics spanning multiple other product lines within Performance & Availability to IT Service Desks.
  • Performance & Availability
    • Products that are event/availability centric for the Data Center Infrastructure
      • HP Network Node Manager software - agent-less performance and availability software for networks
      • HP Operations Manager software - agent-based performance and availability software for servers/services/applications/databases.
      • HP Problem Isolation software - agent-less performance and availability software for servers/services/application/databases.
      • HP Process Monitor software & HP TransactionVision software - agent-based performance and availability software for services/applications/databases
    • Products that are trend/capacity centric for the Data Center Infrastructure
      • HP Performance Insight software - agent-less time series performance and capacity reporting software for networks that also consolidates data for reporting on servers/services/applications/databases
      • HP SiteScope software - agent-less performance and availability software for servers/services/applications/databases
      • HP Performance Manager software & HP GlancePlus software - agent-based time series performance & capacity statistics collected from servers/services/applications/databases.l
      • HP Real-User Monitor software - monitors applications/services/data traffic flows
  • Security & Prevention
    • HP WebInspect software - web application vulnerability scanning
      • **NOTE:  In my eyes, this is more a security extension to the QA and Testing products from Mercury then part of a security & prevention software portfolio like that of Symantec, McAfee or EMC RSA.

So there we have it (i think).  Now please correct me if I’m wrong, but one thing I didn’t see in the portfolio was anything that does proactive performance analytics like Integrien, Netuitive or ProactiveNet (acquired by BMC)?  Besides that, from an outside perspective they merely have a very confusing Performance & Availability functional category (due to Mercury/OpenView overlap) that does seem to have all the pieces.  So for HP Software, it’s just about executing and tying things together based on end-to-end use cases from their customers.  One other area to keep an eye on is Configuration & Change for databases (from companies like GridApp).  As more and more enteprises deploy the Server Automation Software, they may start wanting to get more detailed in the world of databases, if so that may be a build/buy decision point to consider in the future.  One other thing based on what I’ve read is all these products are busy making sure they extend beyond physical systems support into the virtualized world.

I guess one outstanding thing to ponder is why shouldn’t HP also offer a comprehensive security & prevention offering to help them better compete against IBM?  At some point many people assume/expect security and operations to converge, why not help drive that with a comprehensive security offering?