Adventures in Data Center Automation:

Integrien

May 20 2008   10:47PM GMT

Performance and Availability vs. Analytics - Part 4 of 5



Posted by: Ryan Shopp
DataCenter, Integrien, Managed Objects, Firescope, Opnet

Sorry for the delay but family time called as we were blessed with a baby boy a couple weeks ago.  So back on track; in part one we hit data collection, part two talked about applying analytics and business/service mapping and part three we hit on evolving the Data Center Automation Blueprint from Performance & Availability to Service Assurance.  So what does that mean for analytics?

Well, here is where it gets tricky.  I believe their are two types of analytics that are sometimes being confused or blended together…

Type 1:  Is Per functional category - meaning, software automation that uses algorithms, automated analysis, etc focused on one of the 3 functional categories (e.g., Performance & Availability, Configuration & Change, Security & Protection).
Cross functional category.

Type 2: Is Cross-functional - like Process Orchestration & Resource Reconciliation, you have a roll-up aggregated view of metrics that are mapped to the business (beyond IT specific metrics).  This is also commonly called Business Service Management by most definitions.

Some quick examples….companies like Integrien, Opnet fall into type 1, while companies like Managed Objects, Firescope map closer to type 2.  Now this all gets very confusing as there are overlaps where vendors who do mostly type 1 analytics and some type 2 analytics claim both and even call themselves BSM vendors…meanwhile, the same occurs where mostly type 2 analytics (aka BSM) also claim to do some type 1.  So I’m not a BSM guru but I do exchange blogs/emails with some and would love to hear them chime in on this thread.  Based on this feedback and some further reading over at my favorite BSM blog, my next post will wrap up this series and I’ll update the Data Center Automation Blueprint.

Apr 21 2008   3:11PM GMT

Performance and Availability vs. Analytics - Part 2 of ?



Posted by: Ryan Shopp
DataCenter, Analytics, eg innovations, Indicative, HP Software, Integrien, Netuitive, BMC, NetQoS, Opnet

So in part 1 we talked through the collection of performance/capacity/availability data. Next up is focused on where innovations using this collected data are taking us.

The next level of Performance & Availability I previously mentioned are coming from a variety of companies doing cross-metric analysis or even automated behavioral analytics. These vendors are typically classify themselves as Service Level Management, some types of Business Service Management or Analytics. They either leverage a variety of data collection entities or they themselves offer capabilities that span multiple sources to elevate and/or automate results in the hope of proactive (even predictive) identification of issues with minimal (striving for zero) false positives. Here are some more thoughts on each of these areas:

  • Service Level Agreement vendors seem to focus on leveraging a variety of data sources/metrics and normalizing them into very detailed quality of service/performance agreements between a service provider and their customers (in some situations the service provider is the internal IT department themselves).
  • Business Service Management vendors in the realm of performance/capacity/availability seem to focus on the mapping of each business service (e.g., application(s) and the infrastructure that supports those application(s) from and end-to-end perspective). Then, if any component in the mapped bundle shows signs of trouble, an alert is raised for proactive resolution.  NOTE:  BSM is a very broad term - I’m focusing it down here on just this functional area, I’m not talking comprehensive dashboard spanning all functional areas, service desks etc.
  • Real-time Analytic vendors seem to leverage a variety of time-series metrics from various collection sources mapped together appropriately (like BSM), then using behavioral algorithms they dynamically determine normal behavior. If something deviates from that behavior then in real-time it raises an alarm (now were getting predictive).
  • Historical Analytics or modeling/simulation vendors seem to leverage a variety of data sources coupled with other cross-functional details (e.g., CMDB, configuration settings) to establish a model and expected behavior. Then you can tweak, tune or even re-design to see impact of potential changes, upgrades, etc.

We could probably come up with better names for these higher level performance/capacity/availability areas but Service Level Management, Business Service Management and Performance Analytics are the ones on the marketing being advertised today.

One area of data collection and reporting that does continue to innovate  is from the end-user, passive traffic flow perspective. This first popped up on the scene back in the last 1990’s and since then there seems to have been a major resurgence in vendors focusing on specific, mission-critical applications. Since these agents typically reside and monitor from the desktop or mobile device perspective I’ve placed them beyond the scope and control of Data Center Automation. Some vendors are doing the end-to-end monitoring (as mentioned before) from an appliance in the data center making some TCP/IP assumptions (e.g., NetQoS, CA Wily).

So now we’ve discussed Performance/Capacity/Availability management and how it also has analytics occurring within that functional silo. So what does that mean to the Data Center Automation Blueprint from my perspective. Stay tuned for part 3.


Apr 17 2008   9:58PM GMT

Performance and Availability Management vs. Analytics - Part 1 of ?



Posted by: Ryan Shopp
nimsoft, cittio, eg innovations, Alcatel-Lucent, Analytics, Apparent Networks, Brix Networks, Compuware, Entuity, Fluke Networks, Gomez, Groundwork, Hyperic, Indicative, Application monitoring, DCAB, Firescope, HP Software, IBM Tivoli, InfoVista, Integrien, NetScout, Netuitive, Solarwinds, Systems monitoring, BMC, Quest Software, NetIQ, Network monitoring, Packet Design, Performance management, CA, Keynote, Nagios, NetQoS, Network Instruments, OpenNMS, Opnet, Xangati, ZenOSS

I’ve had an opportunity to be briefed over the past couple months by a number of current Data Center Automation Blueprint’s Performance & Availability vendors (e.g., CITTIO, eG Innovations, InfoVista, Integrien, Nimsoft).  With that and some further research I think I’m ready to take another pass at this area of the blueprint.

First up, all these vendors use a variety of techniques to collect a variety of data from as many points of view as possible.

  • Their own server agents that collect data about systems, services, applications, databases, etc and then aggregate back to a centralized console
  • Agent-less centralized consoles that leverage infrastructure standard communications protocols (e.g., SNMP, RPC, ODBC, WMI, SSH, TCP, UDP, HTTP) to query or connect remotely to collect data from networks, systems, services, applications, databases, etc.
  • Passive traffic flow collectors (which can be an agents or appliance) that are either in-line with the traffic flows or receive an exact copy of all traffic flows traversing a network connection (e.g., switch port uplink) through hardware vendor capabilities (e.g., spanning)

These data collection points can be statistics about a specific IT infrastructure resource ; physical devices, virtual devices, physical connections, virtual connections or resources running on physical or virtual devices like services, processes, applications, databases, etc.

Or the data collection points can be traffic flows or end-to-end specifics including passive traffic flows, synthetic transactions or even as simple as a pinging from remote points.

Metrics that are captured, typically revolve around throughput, errors, utilization, latency, up/down status, etc. (there are way to many to mention here).

After saying all this, there is a list a mile long of vendors (a number already noted on the DCAB) that capture these predominately time-series oriented data points about performance, capacity, availability using any/all these methods or vantage points (I know, passive traffic flows are not time-series data but patterns/usage/performance etc can be determined from them).

So, with all that data, what most these vendors offer are two primary types of functionality; 1) a variety graphical reports and 2)metric thresholding capabilities that produce a list of outstanding issues/alerts/alarms/events/concerns (whatever you want to call them).

Ok, so why did I organize and point all this out. So I can draw a line around where most of the innovation from my perspective is occurring. The above is for the most part in my eyes a commodity these days. Most companies have had collection/reporting/thresholding capabilities spanning multiple technology silos since pretty close to the start of the enterprise networking. The reports continue to get fancier, the number of data sources a single product collects from continues to expand, etc.  Another sign of commoditization is related to the variety of economic business models offering these products; open source, managed service providers, internet distributed products, appliances deployment models and indirect sales forces, large enterprise direct sales force, completely flexible frameworks for service providers to basically “build their own,” etc.

For the most part where the majority of technical innovation is occurring these days is the next layer above this data collection, reporting and alerting. Now let me say this, yes…there is some great innovation still occurring in the data collection realm (e.g., Xangati offering real-time Netflow down to a user level, PacketDesign monitoring routing messages, NetQoS leveraging advanced TCP/IP theory to analyze where end-to-end bottlenecks are occurring). But, for the most part these new data sources are being used to augment or replace currently deployed data sources in an attempt to see things from either as many vantage points or the best vantage points to avoid surprises within their unique enterprise IT environment.

So where is the serious innovation coming from…stay tuned for part 2.


Apr 14 2008   9:45PM GMT

Mapping HP Software to the Data Center Automation Blueprint



Posted by: Ryan Shopp
DataCenter, Analytics, CMDB, DCAB, HP Software, Integrien, Netuitive, GridApp Systems

I had the chance to recently chat with an executive at HP to breakdown what pieces and parts ended up where post Peregrine, Mercury and Opsware acquisitions. Here is my attempted and mapping them to the Data Center Automation Blueprint.

  • Configuration & Change
    • for networks - Network Automation Software (formerly Opsware, formerly Rendition)
    • for servers - Server Automation Software (formerly Opsware)
    • for storage - Storage Essentials Software (formerly Appilog)
  • Resource Reconciliation
    • Universal CMDB software (formerly Mercury, formerly AppLogic)
  • Process Orchestration - Operations Orchestration Software (formerly Opsware, formerly iConclude)

The focus of our call was around the above areas…from here I’m trying to piece together by using the website and the knowledge that:

  • The Business Service Management group is where all the monitoring products reside; Mercury (excluding QA products) and original OpenView monitoring products. There still seems like a ton of overlap here…
  • The IT Service Management is where Peregrine and the original HP Service Desk products reside.

So that means for the other functional areas of the Data Center Automation Blueprint we have:

  • Analytics
    • HP Dashboard software & HP Business Service Level Management - each offers a unified user interface consolidating reports and statistics spanning multiple other product lines within Performance & Availability to IT Service Desks.
  • Performance & Availability
    • Products that are event/availability centric for the Data Center Infrastructure
      • HP Network Node Manager software - agent-less performance and availability software for networks
      • HP Operations Manager software - agent-based performance and availability software for servers/services/applications/databases.
      • HP Problem Isolation software - agent-less performance and availability software for servers/services/application/databases.
      • HP Process Monitor software & HP TransactionVision software - agent-based performance and availability software for services/applications/databases
    • Products that are trend/capacity centric for the Data Center Infrastructure
      • HP Performance Insight software - agent-less time series performance and capacity reporting software for networks that also consolidates data for reporting on servers/services/applications/databases
      • HP SiteScope software - agent-less performance and availability software for servers/services/applications/databases
      • HP Performance Manager software & HP GlancePlus software - agent-based time series performance & capacity statistics collected from servers/services/applications/databases.l
      • HP Real-User Monitor software - monitors applications/services/data traffic flows
  • Security & Prevention
    • HP WebInspect software - web application vulnerability scanning
      • **NOTE: In my eyes, this is more a security extension to the QA and Testing products from Mercury then part of a security & prevention software portfolio like that of Symantec, McAfee or EMC RSA.

So there we have it (i think). Now please correct me if I’m wrong, but one thing I didn’t see in the portfolio was anything that does proactive performance analytics like Integrien, Netuitive or ProactiveNet (acquired by BMC)? Besides that, from an outside perspective they merely have a very confusing Performance & Availability functional category (due to Mercury/OpenView overlap) that does seem to have all the pieces. So for HP Software, it’s just about executing and tying things together based on end-to-end use cases from their customers. One other area to keep an eye on is Configuration & Change for databases (from companies like GridApp). As more and more enteprises deploy the Server Automation Software, they may start wanting to get more detailed in the world of databases, if so that may be a build/buy decision point to consider in the future. One other thing based on what I’ve read is all these products are busy making sure they extend beyond physical systems support into the virtualized world.

I guess one outstanding thing to ponder is why shouldn’t HP also offer a comprehensive security & prevention offering to help them better compete against IBM? At some point many people assume/expect security and operations to converge, why not help drive that with a comprehensive security offering?


Mar 26 2008   2:03PM GMT

IT Performance Management Call for Resources; I have a dream for performance management



Posted by: Ryan Shopp
Integrien, Netuitive, RBA, Run Book Automation, BMC, Performance management

So in my last posting I called out for some links, resources that people recommend to others when it comes to understanding the variety of options and functions for Network & Application Performance Management.  Upon making the request I decided to spend a few minute looking around.  First up for me is a quick trip over to Wikipedia to see what they have on the topic.

On the topic of Network Performance Management; there is a nice write-up  on factors that contribute to performance issues - Latency, Packet loss, retransmission, throughput.

On the topic of Application Performance Management; there were some very in-depth graphs focused around monitoring response time which I found intriguing.

On the topic of Performance Engineering; I was very surprised not only by a nice write-up of principals and perspectives related to the software development lifecycle, but also a laundry list of interesting and applicable whitepapers at the bottom.

So at this point I stopped and started pondering, is there a product out there that goes beyond grabbing statistics and reporting on them?  Some tools collect data from flows, some collect data from individual resources, some tools set-up endpoints that systematically send sythentic transactions to measure response times, etc.

What do I really mean by this…is there a product that takes a troubleshooting workflow (think Run Book Automation) approach to the different steps involved with determining performance concern.  He is what I mean…

  • Start with monitoring traffic flows for their response time
  • Automatically baseline this and when a major deviation occurs go to the next bullet point
  • Is this traffic delay specific to a specific type of traffic or is affecting all traffic
  • What is causing this anomaly, calculate which points of the infrastructure are traversed by these traffic flows
  • Look at each input/output point on the infrastructure (e.g., interfaces) to see if their are errors, retransmissions, etc
  • If not errors, next look at each input/output point on the infrastructure to see if throughput in bottlenecked.
  • If no bottlenecks, next look at the processors/CPU on each point of the infrastructure to see if that is causing the delay
  • If no processor delays, look at…. (etc, etc, etc)

At this point I think we get the picture.  Most products I’m familiar with collect data metrics from one, two, three, etc points of view on the network and roll-up those into impressive looking graphical reports.  Then it’s up to the administer to review each report and self-analyze.  As mentioned previously in posts I’m familiar with Integrien, Netuitive & BMC (ProactiveNet) who perform impressive behavioral baselining in creating more intelligent alerts to forward to the event management console but I’m looking for more here.  I want someone to take all the collected data and basically apply root cause analysis/run book automation principles.  If someone is out there doing this please speak up and throw a link to your site down in the comments so I can come take a look.


Feb 28 2008   4:55PM GMT

Analytics; What are the top capabilities?



Posted by: Ryan Shopp
Analytics, DataCenter, DCAB, Integrien, NCCM, BMC, Alterpoint, Configuresoft, Netuitive, Opnet

Recently, I made some adjustments to the Data Center Automation Blueprint where we combined 2 original areas and added a new one for Analytics.  Steve Henning just posted a great guest blog entry over at Doug McClure’s blog called “Why Real Time Analytics?” I personally liked the analogy to TQM and the manufacturing industry.

He also recently jotted down some of his thoughts on capabilities within the comments section for the posting “Data Center Automation Blueprint; now includes virtualization thoughts.”

Here are some of my initial thoughts that I will take another pass at cleaning up in the next week or two.  I wanted to get this posted in a timely manner to hopefully inspire some discussions:

1) Inter-domain Integrations - Steve called it “Cross Silo” in his comment post. But the analytics solutions need to have a data model and API/SDK that is not specific to one domain (e.g., databases, windows systems, network devices, websphere applications).  To perform holistic analysis you need more then one point of view.

2) Pattern Logic Automation- Automation through algorithms, rules etc that work to mimic the human problem solving / analysis process.

3) “Advanced” Graphical Visualization- more then summary graphics, pie charts etc…what I’m think here is something I can look at that helps me see the pattern or some unique situation/trend affecting the business (e.g., correlation of trouble ticket and performance monitoring details).  A better name then “advanced” is needed here for sure.

So far the vendors I’m thinking of when I’m creating the above functionality list (as noted in the DCAB) include;

Who else do we believe should be in this analytics bucket? Thoughts on these 3 capabilities?  What are some others?


Jan 24 2008   3:11PM GMT

Innovations and evolutions in Performance Managment



Posted by: Ryan Shopp
Integrien, Netuitive, BMC, NetQoS

Great write-up by Glenn on the innovation occurring in Performance Management; Get Innovative About Performance.  He has tremendous perspective and I’m excited to sees his candid perspective back now that he has departed EMC.  Great job Glenn, keep it up!

A well articulated summary of entire post from my perspective is this statement:  “Analysis has proven effective for fault management (evaluation of up/down conditions), but performance is a different animal. Whereas fault management deals with binary conditions of black and white, performance involves the full pallet of colors and shades of gray. Of course, dealing in colors is much more difficult than black and white, but help is now here.”

If your a vendor or enterprise doing something innovative, beyond reporting, in Performance Management please throw down some details in the comments section sharing the company, capability and benefits for other (including myself) to check out.

BTW, my conversation thread on virtualization I’ve put on hold for a week or two.  I’m still pulling together some research and thoughts.  In the meanwhile, a great resource I’ve come across for learning and tracking the world of virtualization is Virtualization.info. 


Dec 28 2007   11:31PM GMT

Digging into each of these 6 functional areas: Performance and Capacity



Posted by: Ryan Shopp
DataCenter, HP Software, IBM Tivoli, InfoVista, Integrien, Netuitive, Systems monitoring, OSS, BMC, Quest Software, NetIQ, Network monitoring, Performance management, CA, Zabbix, ZenOSS, OpenNMS, Nagios, Hyperic, Groundwork, Packet Design, Apparent Networks, Xangati, Gomez, Keynote, Brix Networks, Entuity, Opnet, Network Instruments, Fluke Networks, Alcatel-Lucent, Compuware, NetScout, NetQoS, Symantec, EMC

First things first, we have many of the same vendors from the Availability & Notification functional area of this Data Center Automation Blueprint in this category. Which probably begs the question, do we combine Availability & Notification with Performance & Capacity? I know in the OSS (not Open Source Software but telco-oriented Operational  Support Systems) model they do this and call it “Service Assurance”, another name could be Service Level Management as they two monitoring-centric functions are about ensuring service levels are met…or simply I call it Availability & Performance? I’ll come back to this at the end after I type up the players in this Performance & Capacity area:

But then, we have a slew of others that have been around for quite some time now…

And some innovative up-and-comers in some unique technology/approaches…

Real-Time Behavior/Pattern Analysis through Dynamic Thresholding

IP Traffic/Packet Flow Monitoring & Analysis

Open Source Software (OSS) vendors

Whew..that was more work then I expected to pull together and I’m not done yet…  Please throw into the comment who I’ve missed (I know there has to be a few).

The major challenge here is organizing and breaking down this functional area.  There are so many approaches to obtain performance metrics from/for the data center.  Some of the techniques and perspectives include;

  • passive vs. active
  • agent vs. agent-less
  • in-line appliance vs. out-of-band appliance (e.g., span a port)
  • proprietary vs. leverage infrastructure mgmt. capabilities (e.g., Cisco Netflow)
  • outside the data center looking in vs. inside the data center itself.
  • Reactive troubleshooting vs. Proactive Predictive

I’m going to need to have a part two (and maybe more) for this functional category breaking down the pro’s and con’s of various approaches.  Which vendors do what, etc.  I also need to revisit that question from the top of do we combine this into a single “availability & performance” functional category???  For now, this first pass will have to do…


Dec 4 2007   10:04PM GMT

What are the Six Functional Areas of Data Center Automation



Posted by: Ryan Shopp
DataCenter, Alterpoint, BladeLogic, Cassatt, Integrien, IT Process Automation, HP Software, IBM Tivoli, InfoVista, BMC, Microsoft Windows, NetIQ, Netuitive, Opalis, Optinuity, PlateSpin, RealOps, Scalent, Stratavia, Veeam, Vizioncore

Alright, here is my first pass at a graphic I’m attempting to build that will capture the spirit of my previous posts (this is a work still in progress as previously mentioned);

I’m attempting to come up with a 30,000 foot reference model (functionality focused) for when you’re building out a data center’s software automation architecture.

The yellow areas are the 6 current areas I’ve functionally identified. The tricky part is based on the complexities of each category in the Data Center Infrastructure (e.g., Network vs. System), many of the functional areas require technical depth and audience-specific focus (e.g., network engineers vs. SAP administrators). The arrows are trying to capture that.

I know this still needs work but this is an evolution, and I only have a little time each week to currently work on it during these blog posts.

Below the graphic are some current vendors by function that have product(s) in each function that I’ve mentioned during previous blog posting so far.

data-center-automation-reference-model-v1.jpg

  • Configuration & Change: BMC (Marimba), CA, EMC (Voyence), HP (Opsware), IBM, BladeLogic, Cassatt, AlterPoint, Platespin, Scalent, Veeam, Vizioncore
  • Security & Protection: Symantec, IBM, EMC, McAfee, nCircle, Lumension, ArcSight
  • Performance & Capacity: BMC, CA, EMC, HP, IBM, Quest, InfoVista
  • Availability & Notification: BMC, CA, EMC, HP, IBM, Microsoft, Quest, Integrien, Netuitive, NetIQ
  • Process Orchestration: BMC (RealOps), HP (iConclude), Opalis, Optinuity, NetIQ, Stratavia
  • Resource Reconciliation: Symantec, IBM, HP, BMC, EMC

I know I’ve missed many and also it would probably be helpful to not simply mention the company but also the product name but that will have to wait until another time.


Dec 3 2007   11:41PM GMT

Availability Management, so what’s been going on here?



Posted by: Ryan Shopp
DataCenter, Netuitive, HP Software, IBM Tivoli, Symantec, BMC, Microsoft Windows, CA, EMC, Quest Software, Integrien

As mentioned in my November 2007 round-up, I haven’t given any love to automation products watching for outages, faults or other availability of the infrastructure oriented events.

Part of the reason for this oversight is these days most data centers are locked into a product from the “big 4″ vendors; BMC (Performance, formerly Patrol), CA (formerly Aprisma), HP (NNM, Operations), IBM (NetView, formerly Micromuse) or the “upcoming 5″ vendors EMC, Oracle, Microsoft, Quest Software and Symantec due to their overall IT infrastructure architecture and strategy.

But their are other innovative players in town to consider for replacement or complimenting these bigger guys. Self-learning technologies are being advanced by companies like Netuitive and Integrien. These technologies are focused on monitoring real-time events and then leveraging mathematical algorithms to estimate baselines and set thresholds in an attempt to accurately predict system and service level degradation.