March 26, 2008 2:03 PM
Posted by: Ryan Shopp
, Performance management
, Run Book Automation
So in my last posting I called out for some links, resources that people recommend to others when it comes to understanding the variety of options and functions for Network & Application Performance Management. Upon making the request I decided to spend a few minute looking around. First up for me is a quick trip over to Wikipedia to see what they have on the topic.
On the topic of Network Performance Management; there is a nice write-up on factors that contribute to performance issues – Latency, Packet loss, retransmission, throughput.
On the topic of Application Performance Management; there were some very in-depth graphs focused around monitoring response time which I found intriguing.
On the topic of Performance Engineering; I was very surprised not only by a nice write-up of principals and perspectives related to the software development lifecycle, but also a laundry list of interesting and applicable whitepapers at the bottom.
So at this point I stopped and started pondering, is there a product out there that goes beyond grabbing statistics and reporting on them? Some tools collect data from flows, some collect data from individual resources, some tools set-up endpoints that systematically send sythentic transactions to measure response times, etc.
What do I really mean by this…is there a product that takes a troubleshooting workflow (think Run Book Automation) approach to the different steps involved with determining performance concern. He is what I mean…
- Start with monitoring traffic flows for their response time
- Automatically baseline this and when a major deviation occurs go to the next bullet point
- Is this traffic delay specific to a specific type of traffic or is affecting all traffic
- What is causing this anomaly, calculate which points of the infrastructure are traversed by these traffic flows
- Look at each input/output point on the infrastructure (e.g., interfaces) to see if their are errors, retransmissions, etc
- If not errors, next look at each input/output point on the infrastructure to see if throughput in bottlenecked.
- If no bottlenecks, next look at the processors/CPU on each point of the infrastructure to see if that is causing the delay
- If no processor delays, look at…. (etc, etc, etc)
At this point I think we get the picture. Most products I’m familiar with collect data metrics from one, two, three, etc points of view on the network and roll-up those into impressive looking graphical reports. Then it’s up to the administer to review each report and self-analyze. As mentioned previously in posts I’m familiar with Integrien, Netuitive & BMC (ProactiveNet) who perform impressive behavioral baselining in creating more intelligent alerts to forward to the event management console but I’m looking for more here. I want someone to take all the collected data and basically apply root cause analysis/run book automation principles. If someone is out there doing this please speak up and throw a link to your site down in the comments so I can come take a look.
March 24, 2008 12:00 PM
Posted by: Ryan Shopp
The deeper we dig the tougher it becomes to make dividing lines. Originally we had 2 DCAB areas Performance & Capacity as one and Availability as another. Then, based on some further thoughts we made an adjustment and bundled Performance & Availability together along with Capacity & Analytics. I still find myself questioning this when I take a look things like yesterday’s recent announcement by Xangati; End-User Activity to Front-line support and NetQoS; add Network Behavior Analysis.
The reason for my questioning this is many performance solutions maintain their data allowing for historical/capacity analysis. To say it another way, you will more often find a performance vendor also doing capacity management vs. doing availability monitoring (unless your talking about the big 4 or 5). So it’s time to step back and take a deeper look at approaches and functionality then figure out which vendors go where (aka a bottom up perspective).
Looking back at the original post for the Data Center Automation Blueprint called Digging into these 6 functional areas: Performance & Capacity we notice a discussion that started talking through approaches.
- passive vs. active
- agent vs. agent-less
- in-line appliance vs. out-of-band appliance (e.g., span a port)
- proprietary vs. leverage infrastructure mgmt. capabilities (e.g., Cisco Netflow)
- outside the data center looking in vs. inside the data center itself.
- Reactive troubleshooting vs. Proactive Predictive
So let’s start pondering these a little more.
Passive Monitoring – the monitoring of actual traffic flows passively to collect statistics; example inline appliance or spanned port on a switch that mirrors over a copy of all traffic flows to that appliance
Active Monitoring – the monitoring of end points using different protocols to collect statistics; example create a TCP packet and query an applictions/service that should respond to it…e.g., SMTP port 25.
Now even starting with these two you can start pulling things apart due to various hybrid approaches and positioning by vendors. The case can be made that passive is nothing more then collection of data that is then passed back to a centralized point…just like active, which requests the data and receives immediate response to place within the centralized aggregation point. It gets further convoluted when some vendors allow you to use passive data as it’s gathered on appliance vs. other wait for it to be aggregated back to the central management point. So with all this confusion where do we go from here…
I was reading another blog posting last night and thought it had a interesting way of talking about these two types of performance statistics. They called them rows & columns. Rows meaning infrastructure and columns meaning flows.
So from here let’s step down one more level, what type of statistics do we want to capture….
How much (e.g., bitrate/throughtput/activity)
How fast (e.g., latency/round-trip time/response time)
How ready (e.g., availability/response)
But, there are also statistics in between to help us identify potential bottleneck points; processor, memory, etc.
Also, we need to be able to gather down to a specific endpoint for a specific application or we could desire to aggregate up to all traffic types to a specific data center.
So I’m going to push pause here, with all the Performance vendors out there I know some of y’all must have found some good resources (whitepapers etc) that already attack this question of what statistics, from what vantage point, using what technology etc. I’m going to do some more research and ponder how we can better articulate mapping things for Performance, Capacity & the Analytics of those details. Please feel free to share with me quality whitepapers that as independent as possible attempt to answer this.