Adventures in Data Center Automation


April 7, 2008  7:08 PM

Data Center Automation Blueprint – made a round of updates



Posted by: Ryan Shopp
DataCenter, DCAB

It’s been a few weeks since I’ve taken a pass at the Data Center Automation Blueprint, so it was time for some additions and clean-up.First up I added descriptions for some of the categories and made sure the known vendor list was up to date (e.g., BladeLogic acquired by BMC).

Resource Reconciliation
Description – Automation that captures a complete view of all IT resources, assets, services, etc. and their relationships, layers 1 through 7. This comprehensive view of all IT resources is the “record of truth” and needs to always be 100% accurate. Once in place, this is the hub of information that keeps all other monitoring and management solutions on the same page so nothing is missed or overlooked.

Process Orchestration
Description – Cross-silo automation for mundane manual or high occurrence tasks. The capabilities are focused around helping individual technology domains (e.g., network, windows, unix, database, etc) communicate and collaborate to automate tasks that before required numerous people and passing around a trouble ticket.

Configuration & Change
Description – Automation around making configuration or software changes in mass or in a more controlled, systematic way even if at an individual level. Understanding what the potential impact or risks are associated with making that change and keeping tabs on what is changing and if it is authorized or in line with established standards.

Top Capabilities
1) Making changes easier through a simplified user interface – enables more junior administrators to make traditionally more complex changes that required senior individuals.
2) Abstraction layer that enables the same change to be applied to a numerous resources, which includes spanning multiple vendors.
3) Ability to recommend when a change is not recommended or even unauthorized…understanding the interdependencies and risks associated with a change.

One area I’m still pondering back and forth is Analytics.  The more I research and dig into things, I’m seeing that analytics automation is functional category specific (e.g., Config, Performance or Availability) with only a hint of cross DCAB category integration today.  Examples include:

  • Configuration & Change Management vendors offering analytics for servers/applications, integrates in details from help desk solutions around the changes tickets but does offer a hint of cross-functional with applicable incidents from an availability solution
  • Performance & Capacity Management vendors offering analytics for end-to-end applications/services with a hint of configuration & change specifics so they know what change and if it has an impact.
  • Performance & Capacity Management vendors offering analytics through real-time algorithms that perform dynamic thresholding and problem fingerprinting based on performance and availability conditions.

It seems the point it goes serious cross-functional, we find it discussed in terms of Business Intelligence, Business Service Management or Dashboards.So my gut is telling me to go back to these 6 areas of the Data Center Automation Blueprint where analytics is a key area of capabilities within each functional area…not it’s own stand alone category:

  • Resource Reconciliation (aka CMDB)
  • Process Orchestration (aka RBA)
  • Availability & Notification
  • Performance & Capacity
  • Security & Protection
  • Configuration & Change
  • Any thoughts on this please speak up!

    April 7, 2008  6:32 PM

    links for 2008-04-07



    Posted by: Ryan Shopp
    DataCenter


    April 3, 2008  6:33 PM

    links for 2008-04-03



    Posted by: Ryan Shopp
    DataCenter


    April 2, 2008  6:34 PM

    links for 2008-04-02



    Posted by: Ryan Shopp
    DataCenter


    April 1, 2008  6:35 PM

    links for 2008-04-01



    Posted by: Ryan Shopp
    DataCenter


    March 31, 2008  8:25 PM

    Month in Review – March 2008



    Posted by: Ryan Shopp
    DataCenter

     Acquisitions 

    Infrastructure & Application Performance Management 

    Data Center Automation Blueprint development & discussion


    March 31, 2008  6:37 PM

    links for 2008-03-31



    Posted by: Ryan Shopp
    DataCenter


    March 26, 2008  2:03 PM

    IT Performance Management Call for Resources; I have a dream for performance management



    Posted by: Ryan Shopp
    BMC, Integrien, Netuitive, Performance management, RBA, Run Book Automation

    So in my last posting I called out for some links, resources that people recommend to others when it comes to understanding the variety of options and functions for Network & Application Performance Management.  Upon making the request I decided to spend a few minute looking around.  First up for me is a quick trip over to Wikipedia to see what they have on the topic.

    On the topic of Network Performance Management; there is a nice write-up  on factors that contribute to performance issues – Latency, Packet loss, retransmission, throughput.

    On the topic of Application Performance Management; there were some very in-depth graphs focused around monitoring response time which I found intriguing.

    On the topic of Performance Engineering; I was very surprised not only by a nice write-up of principals and perspectives related to the software development lifecycle, but also a laundry list of interesting and applicable whitepapers at the bottom.

    So at this point I stopped and started pondering, is there a product out there that goes beyond grabbing statistics and reporting on them?  Some tools collect data from flows, some collect data from individual resources, some tools set-up endpoints that systematically send sythentic transactions to measure response times, etc.

    What do I really mean by this…is there a product that takes a troubleshooting workflow (think Run Book Automation) approach to the different steps involved with determining performance concern.  He is what I mean…

    • Start with monitoring traffic flows for their response time
    • Automatically baseline this and when a major deviation occurs go to the next bullet point
    • Is this traffic delay specific to a specific type of traffic or is affecting all traffic
    • What is causing this anomaly, calculate which points of the infrastructure are traversed by these traffic flows
    • Look at each input/output point on the infrastructure (e.g., interfaces) to see if their are errors, retransmissions, etc
    • If not errors, next look at each input/output point on the infrastructure to see if throughput in bottlenecked.
    • If no bottlenecks, next look at the processors/CPU on each point of the infrastructure to see if that is causing the delay
    • If no processor delays, look at…. (etc, etc, etc)

    At this point I think we get the picture.  Most products I’m familiar with collect data metrics from one, two, three, etc points of view on the network and roll-up those into impressive looking graphical reports.  Then it’s up to the administer to review each report and self-analyze.  As mentioned previously in posts I’m familiar with Integrien, Netuitive & BMC (ProactiveNet) who perform impressive behavioral baselining in creating more intelligent alerts to forward to the event management console but I’m looking for more here.  I want someone to take all the collected data and basically apply root cause analysis/run book automation principles.  If someone is out there doing this please speak up and throw a link to your site down in the comments so I can come take a look.


    March 25, 2008  6:22 PM

    links for 2008-03-25



    Posted by: Ryan Shopp
    DataCenter


    March 24, 2008  12:00 PM

    DCAB 6: Performance, Availability, Capacity & Analytics revisted



    Posted by: Ryan Shopp
    DataCenter

    The deeper we dig the tougher it becomes to make dividing lines.  Originally we had 2 DCAB areas Performance & Capacity as one and Availability as another.  Then, based on some further thoughts we made an adjustment and bundled Performance & Availability together along with Capacity & Analytics.  I still find myself questioning this when I take a look things like yesterday’s recent announcement by Xangati; End-User Activity to Front-line support and NetQoS; add Network Behavior Analysis.

    The reason for my questioning this is many performance solutions maintain their data allowing for historical/capacity analysis.  To say it another way, you will more often find a performance vendor also doing capacity management vs. doing availability monitoring (unless your talking about the big 4 or 5).  So it’s time to step back and take a deeper look at approaches and functionality then figure out which vendors go where (aka a bottom up perspective).

    Looking back at the original post for the Data Center Automation Blueprint called Digging into these 6 functional areas: Performance & Capacity we notice a discussion that started talking through approaches.

    • passive vs. active
    • agent vs. agent-less
    • in-line appliance vs. out-of-band appliance (e.g., span a port)
    • proprietary vs. leverage infrastructure mgmt. capabilities (e.g., Cisco Netflow)
    • outside the data center looking in vs. inside the data center itself.
    • Reactive troubleshooting vs. Proactive Predictive

    So let’s start pondering these a little more.

    Passive Monitoring – the monitoring of actual traffic flows passively to collect statistics; example inline appliance or spanned port on a switch that mirrors over a copy of all traffic flows to that appliance

    Active Monitoring – the monitoring of end points using different protocols to collect statistics; example create a TCP packet and query an applictions/service that should respond to it…e.g., SMTP port 25.

    Now even starting with these two you can start pulling things apart due to various hybrid approaches and positioning by vendors.  The case can be made that passive is nothing more then collection of data that is then passed back to a centralized point…just like active, which requests the data and receives immediate response to place within the centralized aggregation point.  It gets further convoluted when some vendors allow you to use passive data as it’s gathered on appliance vs. other wait for it to be aggregated back to the central management point.  So with all this confusion where do we go from here…

    I was reading another blog posting last night and thought it had a interesting way of talking about these two types of performance statistics.  They called them rows & columns.  Rows meaning infrastructure and columns meaning flows.

    So from here let’s step down one more level, what type of statistics do we want to capture….

    How much (e.g., bitrate/throughtput/activity)

    How fast (e.g., latency/round-trip time/response time)

    How ready (e.g., availability/response)

    But, there are also statistics in between to help us identify potential bottleneck points; processor, memory, etc.

    Also, we need to be able to gather down to a specific endpoint for a specific application or we could desire to aggregate up to all traffic types to a specific data center.

    So I’m going to push pause here, with all the Performance vendors out there I know some of y’all must have found some good resources (whitepapers etc) that already attack this question of what statistics, from what vantage point, using what technology etc.  I’m going to do some more research and ponder how we can better articulate mapping things for Performance, Capacity & the Analytics of those details.  Please feel free to share with me quality whitepapers that as independent as possible attempt to answer this.