Adventures in Data Center Automation

Mar 26 2008   2:03PM GMT

IT Performance Management Call for Resources; I have a dream for performance management



Posted by: Ryan Shopp
Integrien, Netuitive, RBA, Run Book Automation, BMC, Performance management

So in my last posting I called out for some links, resources that people recommend to others when it comes to understanding the variety of options and functions for Network & Application Performance Management.  Upon making the request I decided to spend a few minute looking around.  First up for me is a quick trip over to Wikipedia to see what they have on the topic.

On the topic of Network Performance Management; there is a nice write-up  on factors that contribute to performance issues - Latency, Packet loss, retransmission, throughput.

On the topic of Application Performance Management; there were some very in-depth graphs focused around monitoring response time which I found intriguing.

On the topic of Performance Engineering; I was very surprised not only by a nice write-up of principals and perspectives related to the software development lifecycle, but also a laundry list of interesting and applicable whitepapers at the bottom.

So at this point I stopped and started pondering, is there a product out there that goes beyond grabbing statistics and reporting on them?  Some tools collect data from flows, some collect data from individual resources, some tools set-up endpoints that systematically send sythentic transactions to measure response times, etc.

What do I really mean by this…is there a product that takes a troubleshooting workflow (think Run Book Automation) approach to the different steps involved with determining performance concern.  He is what I mean…

  • Start with monitoring traffic flows for their response time
  • Automatically baseline this and when a major deviation occurs go to the next bullet point
  • Is this traffic delay specific to a specific type of traffic or is affecting all traffic
  • What is causing this anomaly, calculate which points of the infrastructure are traversed by these traffic flows
  • Look at each input/output point on the infrastructure (e.g., interfaces) to see if their are errors, retransmissions, etc
  • If not errors, next look at each input/output point on the infrastructure to see if throughput in bottlenecked.
  • If no bottlenecks, next look at the processors/CPU on each point of the infrastructure to see if that is causing the delay
  • If no processor delays, look at…. (etc, etc, etc)

At this point I think we get the picture.  Most products I’m familiar with collect data metrics from one, two, three, etc points of view on the network and roll-up those into impressive looking graphical reports.  Then it’s up to the administer to review each report and self-analyze.  As mentioned previously in posts I’m familiar with Integrien, Netuitive & BMC (ProactiveNet) who perform impressive behavioral baselining in creating more intelligent alerts to forward to the event management console but I’m looking for more here.  I want someone to take all the collected data and basically apply root cause analysis/run book automation principles.  If someone is out there doing this please speak up and throw a link to your site down in the comments so I can come take a look.

Comment on this Post


You must be logged-in to post a comment. Log-in/Register

Shenning  |   Mar 26 2008   7:36PM GMT

Ryan - let me talk a bit more about what Integrien Alive does because we provide far more than just baselining and more actionable alerts. We integrate whatever monitoring data sources a customer has available so that we are able to provide our insights cross-silo. We can analyze the data across an entire business service’s components, which is critical to troubleshooting. Of course, we do the intelligent, behavioral baselining you spoke about, but the main reason for this is to establish the abnormal precursors to problems, which static threshold-based alerting cannot. What really separates our solution from the competition is that we allow our customers to set key indicators based on business service performance, end user experience, and/or pure IT metrics.The key indicators now drive our patented problem modeling analytics. When a key indicator is exceeded (either exceeding its normal behavior or a preset problem indicator value), our solution creates a model of the building pattern of abnormalities that lead to the problem (we refer to this as “Problem Fingerprinting”). This problem model provides a forensics tool - similar to what you are describing above - that allows the Ops team to focus their troubleshooting efforts. The model will pinpoint the silos of the business service where the problem resides. For example, the model may show that the application server and database tiers of the application are where the abnormal behaviors are occurring when a serious transaction slowdown occurs. The application server and DB server experts are provided with the specific abnormal symptom behaviors that occurred up to an hour before the problem manifested. Armed with this information, the experts can use their deep dive tools to determine the root cause of the problem. To continue my example, it may be found that the database abnormalities are the result of duplicate transactions being submitted by the application logic. Therefore the problem does not reside in the DB tier. The application expert may use a tool such as CA Wily to find the errant code that caused the problem and note this as a regression from a previous build. The focus provided by the problem modeling significantly reduces the MTTI/MTTR over getting representatives from all silo teams on a bridge call. The other benefit is that once a model has been captured, if a similar building pattern of events occurs in the future, the transaction slowdown can be predicted so that it can be addressed before it occurs. The predictive alert that is sent provides complete information on what to look for and how the problem was solved the first time it occurred.

While the troubleshooting workflow approach you speak of would certainly work in individual silos (and I believe is already implemented in some tools like EMC SMARTS for the network silo), the complexity of today’s mutli-tier applications and their interdependencies make this approach unlikely across an entire business service. The complexity of managing performance and availability in multi-tier business services requires an analytics-based solution, with data agnostic algorithms, that can model problems and allow Ops to to focus their efforts. Once this focus is provided, the silo-based experts and their troubleshooting workflow tools can do their thing.


 

Amena  |   Mar 27 2008   12:58PM GMT

Ryan, OPNET has an integrated application performance management solution that provides real-time monitoring and global visibility all the way down to local troubleshooting and problem remediation.

ACE Live delivers an end-to-end solution that spans monitoring, measurement, and detection of violations, and then bridges seamlessly into uncovering the root-cause of application performance problems. It provides visibility of all transactions and users across the enterprise, with detailed real-time and historical information about performance, utilization, route quality, ISP performance, and end-user response times.

While ACE Live provides visibility of all users and transactions across the network, OPNET Panorama collects detailed data across all the servers in your application’s environment, then feeds this into our expert analysis engine, to produce real-time dashboards, historical reports and in-depth data views based on events and behavior patterns. Alarming thresholds are established dynamically, automatically adjusting their limits based on historical performance. Forensic ‘snapshots’ capture and archive in-depth data on key events for detailed troubleshooting. Drill-down capabilities identify specific resources, such as CPU, Java/.NET classes or database components, that scale inefficiently. Deep transaction tracing enables a detailed analysis of execution times at the method level, pinpointing statements in the application code that are responsible for performance problems.

OPNET’s ACE Analyst provides the “forensic” analysis that complements ACE Live. ACE Analyst automatically deconstructs individual application transactions to determine protocol delay, error messages, retransmissions, and arrival times. Diagnostic reports pinpoint performance bottlenecks and summarize sources of response time delay, providing actionable recommendations for improving response time (e.g. is application “chattiness” or TCP windowing between a specific client or server causing performance issues?) Hundreds of protocol and transaction level decodes provide code-level visibility into application statements. ACE Analyst even goes a step further than root-cause analysis. Its unique predictive model, created from captured traces, allows the troubleshooter to quickly and easily validate fixes to performance problems before implementation. For example, you can adjust infrastructure and application design parameters (such as bandwidth and application message turns) and immediately see the impact on the application’s response time.

All the pieces described above are tightly integrated with seamless workflows, delivering a complete monitoring and root-cause analysis solution, as you describe in your “dream” — here’s a link to our website so you can come take a look: <a href="http://www.opnet.com/solutions/application_performance/index.html" rel="nofollow">http://www.opnet.com/solutions/application_performance/index.html</a>

One final note on the subject of Performance Engineering: OPNET provides a methodology service for establishing a performance engineering practice at your organization, which has been highly successful operationally at a number of our key client sites.