Adventures in Data Center Automation:

Network monitoring

Jun 23 2008   3:00PM GMT

So let’s talk a little about Traffic Flow Reporting and Analysis



Posted by: Ryan Shopp
Network monitoring, Performance management, Alcatel-Lucent, NetScout, DataCenter, Application monitoring, SolarWinds, InfoVista, Accellent, HP Software, NetQoS, Compuware, Opnet, Xangati, Packet Design

Next up, I plan to dig into this sector a little deeper (as always from a purely data center centric perspective - aka no End-User Monitoring that requires a desktop agent).

The priority for these products is to provide an end-to-end service/application perspective on traffic performance and capacity. The goals; help quickly troubleshoot from an application or end-point perspective OR better understand what/where traffic levels are going across the infrastructure. All this from a network-centric control point (no loading of agents on a server or client - since the network team doesn’t own the responsibility for those).

So on the surface I see two main categories (each has subcategories that I’ll dig into during follow-up posts)

Flow Reporting-centric (these vendors gather Cisco NetFlow, J-flow, sFlow from infrastructure agents and report in various ways)

  • Netscout, Solarwinds, CA eHealth, NetQoS, Mazu Networks, Xangati, InfoVista, Opnet, Lancope, Packet Design, Q1 Labs. Alcatel-Lucent VitaNet, HP Performance Insight - to name a few

Flow Self-Collection & Reporting (these vendors span/tap actual traffic flows and report in various ways)

  • NetQoS, Mazu Networks, InfoVista (through acquisition of Accellent), Lancope, CA Wily, Q1 Labs, Compuware - to name a few

I quickly notice now that many of the vendors actually support both - which I assume is about flexibility as some customers don’t have NetFlow type capabilities enabled or don’t wish to enabling them for a variety of reasons.

So my first set of questions/experiences I’m now reading/researching about are:

1) What are the key benefits to going the self-collection route over the Reporting only route? Unique metrics? Scalability? Limitations around NetFlow (e.g., Performance)

2) When it comes to reporting only using Netflow, etc - what metrics are being used these days.

I remember first integrating and being able to report on RMON2 probes and early Cisco NetFlow data back in 2001 within the Lucent VitalNet product…so where are things 6 years later now that NetFlow is much more pervasive and I’m sure improved.

My assumption on some of these are as follows (vendors & users please leave comments to help educate me for my follow-up posts),

When it comes to reporting, there are historical/capacity centric reports & their are real-time/troubleshooting centric views. My assumption (again, currently an assumption..I haven’t read to much on this topic yet) is most the reporting centric vendors (that don’t also offer their own passive flow monitoring capability) are focused more on those historical/capacity reports (e.g., eHealth, Solarwinds, InfoVista). These reports are how much data is going where and what type of data is it over a day/week/month etc. Once this data is archived, they slide & dice in a variety of ways. But, basically it’s about looking at it for trends over time.

Now, when it comes to real-time, since so much data is coming in so quickly their needs to be extra intelligence/automation helping out - building a “what looks normal” model and then focusing on identifying and then alerting someone when something “odd” is noted. Of course, they need to store/report on much of the same data as the historic/capacity centric products as they build credibility and trust in their users.

So when it comes down to it..much of the same data is being used for 2 unique users…one focused on planning improvements and the other focused on quickly resolving issues. So now that I’ve finished writing this post a better way to probably organize the field of play is not by technology (NetFlow vs. Self-Collect) but by usage. I’ll read some more and do that next time.

Another angle to ponder on this topic will be around the WAN acceleration/optimization vendors…but again, for another day.

Apr 17 2008   9:58PM GMT

Performance and Availability Management vs. Analytics - Part 1 of ?



Posted by: Ryan Shopp
Network monitoring, Performance management, BMC, NetIQ, Alcatel-Lucent, NetScout, Analytics, CA, Systems monitoring, Application monitoring, SolarWinds, InfoVista, IBM Tivoli, HP Software, Quest Software, Netuitive, Integrien, NetQoS, Compuware, Fluke Networks, Network Instruments, Opnet, Entuity, Brix Networks, Keynote, Gomez, Xangati, Apparent Networks, Packet Design, Groundwork, Hyperic, Nagios, OpenNMS, ZenOSS, Firescope, Indicative, DCAB, eg innovations, cittio, nimsoft

I’ve had an opportunity to be briefed over the past couple months by a number of current Data Center Automation Blueprint’s Performance & Availability vendors (e.g., CITTIO, eG Innovations, InfoVista, Integrien, Nimsoft).  With that and some further research I think I’m ready to take another pass at this area of the blueprint.

First up, all these vendors use a variety of techniques to collect a variety of data from as many points of view as possible.

  • Their own server agents that collect data about systems, services, applications, databases, etc and then aggregate back to a centralized console
  • Agent-less centralized consoles that leverage infrastructure standard communications protocols (e.g., SNMP, RPC, ODBC, WMI, SSH, TCP, UDP, HTTP) to query or connect remotely to collect data from networks, systems, services, applications, databases, etc.
  • Passive traffic flow collectors (which can be an agents or appliance) that are either in-line with the traffic flows or receive an exact copy of all traffic flows traversing a network connection (e.g., switch port uplink) through hardware vendor capabilities (e.g., spanning)

These data collection points can be statistics about a specific IT infrastructure resource ; physical devices, virtual devices, physical connections, virtual connections or resources running on physical or virtual devices like services, processes, applications, databases, etc.

Or the data collection points can be traffic flows or end-to-end specifics including passive traffic flows, synthetic transactions or even as simple as a pinging from remote points.

Metrics that are captured, typically revolve around throughput, errors, utilization, latency, up/down status, etc. (there are way to many to mention here).

After saying all this, there is a list a mile long of vendors (a number already noted on the DCAB) that capture these predominately time-series oriented data points about performance, capacity, availability using any/all these methods or vantage points (I know, passive traffic flows are not time-series data but patterns/usage/performance etc can be determined from them).

So, with all that data, what most these vendors offer are two primary types of functionality; 1) a variety graphical reports and 2)metric thresholding capabilities that produce a list of outstanding issues/alerts/alarms/events/concerns (whatever you want to call them).

Ok, so why did I organize and point all this out. So I can draw a line around where most of the innovation from my perspective is occurring. The above is for the most part in my eyes a commodity these days. Most companies have had collection/reporting/thresholding capabilities spanning multiple technology silos since pretty close to the start of the enterprise networking. The reports continue to get fancier, the number of data sources a single product collects from continues to expand, etc.  Another sign of commoditization is related to the variety of economic business models offering these products; open source, managed service providers, internet distributed products, appliances deployment models and indirect sales forces, large enterprise direct sales force, completely flexible frameworks for service providers to basically “build their own,” etc.

For the most part where the majority of technical innovation is occurring these days is the next layer above this data collection, reporting and alerting. Now let me say this, yes…there is some great innovation still occurring in the data collection realm (e.g., Xangati offering real-time Netflow down to a user level, PacketDesign monitoring routing messages, NetQoS leveraging advanced TCP/IP theory to analyze where end-to-end bottlenecks are occurring). But, for the most part these new data sources are being used to augment or replace currently deployed data sources in an attempt to see things from either as many vantage points or the best vantage points to avoid surprises within their unique enterprise IT environment.

So where is the serious innovation coming from…stay tuned for part 2.


Mar 5 2008   7:59PM GMT

Top Enterprise Management Tools vs. Data Center Automation Blueprint



Posted by: Ryan Shopp
Network monitoring, Performance management, BMC, DataCenter, Networkingchannel, Analytics, CA, Systems monitoring, CMDB, Application monitoring, InfoVista, IBM Tivoli, HP Software, Network Configuration, RealOps, RBA, Run Book Automation, IT Process Automation, Netuitive, NetQoS, Opnet, DCAB, Tideway

I was doing some “light” reading this morning and came upon this recent article:  Top 10 Enterprise Management Tools

It’s focused on Complete Enterprise Management, not specifically focused on the Data Center so I thought I would summarize and then compare/contrast/discuss:

  • Network Fault & Performance: CA eHealth & Spectrum
  • Consolidated Event Management: IBM Tivoli Netcool
  • Service Impact Monitoring : IBM Tivoli Business Service Manage & Service Level Advisor
  • Application Discovery Mapping: Tideway Foundation
  • Business Intelligence: Cognos
  • ITSM Workflow, CMDB and Service Desk: BMC Remedy ITSM and Atrium
  • Network & Systems Configuration Managment: HP Automation (formerly Opsware SAS & NAS)
  • Process Automation: BMC RunBook Automation

Since it isn’t data center centric, it’s light on automated management for applications & databases.  It also chooses to stay away from the very congested and sometimes confusing security/protection market.

Next up, I thought  it would be fun to do a quick mapping to the Data Center Automation Blueprint.

  • Network Fault & Performance, Consolidated Event Management, Service Impact Monitoring = Availability & Performance
  • Application Discovery Mapping, CMDB = IT Resource Reconciliation
  • Business Intelligence = Analytics (maybe…Analytics is still a work in progress…need to figure out this vs. BSM etc)
  • ITSM Workflow, Service Desk = outside of DCAB listed as Manual Task Orchestration

I was surprised not to see an End-User Application Performance Monitoring category.  These products either do their duty from passive agents on the endpoint or from data center appliances using slick algorithms, TCPIP theory, etc.  Maybe that could have indirectly been rolled under Network Fault & Performance as CA acquired Wily which offers that.  The other one missing was more towards Capacity Planning and Trending Analytics, either based off historical data like what Opnet offers or from real-time data patterns from Netuitive.

Needless to say I found it a really nice write-up and summary of those products/offerings.  The only thing I struggle with is all of the big 4 (BMC, CA, HP, IBM) are represented in this mix.  Which means you will have 4 sales guys all continously battling it out to grab more land.  This may be good from a cost competition standpoint, but it’s a real fiasco for making sure all parts are playing nicely with each other or simply managing those vendor relationships.  Bottom line, you’re always going to have at least one of the big 4 in there as they continue to snap-up the innovative smaller companies/ technologies to enhance their portfolio and offer differentiation.  So I’d typically recommend a strategy where you pick 2 of the big 4 and keep them in check versus each other while continually looking for those innovative start-up’s to fill in the gaps.  Here is an example of how you could do this using the categories in the original article.

  • Network Fault & Performance: HP Network Node Manager, Operations Manager, Performance Insight
  • Consolidated Event Management: IBM Tivoli Netcool
  • Service Impact Monitoring : IBM Tivoli Business Service Manage & Service Level Adviser
  • Application Discovery Mapping: IBM Tivoli Application Dependency Discovery Manager
  • Business Intelligence: Cognos (which IBM recently acquired)
  • ITSM Workflow, CMDB and Service Desk: HP AssetCenter (former Peregrine)
  • Network & Systems Configuration Managment: HP Data Center Automation (formerly Opsware SAS & NAS)
  • Process Automation: HP Operations Orchestration (formerly iConclude that Opsware acquired)

Or, if you want to completely rebel and go the non-big 4 route, take a look at the above mappings to the DCAB and look for a name that’s not big-4.  Example:  Network Fault & Performance: InfoVista or NetQoS


Dec 28 2007   11:31PM GMT

Digging into each of these 6 functional areas: Performance and Capacity



Posted by: Ryan Shopp
Network monitoring, Performance management, Symantec, BMC, EMC, NetIQ, Alcatel-Lucent, NetScout, DataCenter, CA, OSS, Systems monitoring, InfoVista, IBM Tivoli, HP Software, Quest Software, Netuitive, Integrien, NetQoS, Compuware, Fluke Networks, Network Instruments, Opnet, Entuity, Brix Networks, Keynote, Gomez, Xangati, Apparent Networks, Packet Design, Groundwork, Hyperic, Nagios, OpenNMS, ZenOSS, Zabbix

First things first, we have many of the same vendors from the Availability & Notification functional area of this Data Center Automation Blueprint in this category. Which probably begs the question, do we combine Availability & Notification with Performance & Capacity? I know in the OSS (not Open Source Software but telco-oriented Operational  Support Systems) model they do this and call it “Service Assurance”, another name could be Service Level Management as they two monitoring-centric functions are about ensuring service levels are met…or simply I call it Availability & Performance? I’ll come back to this at the end after I type up the players in this Performance & Capacity area:

But then, we have a slew of others that have been around for quite some time now…

And some innovative up-and-comers in some unique technology/approaches…

Real-Time Behavior/Pattern Analysis through Dynamic Thresholding

IP Traffic/Packet Flow Monitoring & Analysis

Open Source Software (OSS) vendors

Whew..that was more work then I expected to pull together and I’m not done yet…  Please throw into the comment who I’ve missed (I know there has to be a few).

The major challenge here is organizing and breaking down this functional area.  There are so many approaches to obtain performance metrics from/for the data center.  Some of the techniques and perspectives include;

  • passive vs. active
  • agent vs. agent-less
  • in-line appliance vs. out-of-band appliance (e.g., span a port)
  • proprietary vs. leverage infrastructure mgmt. capabilities (e.g., Cisco Netflow)
  • outside the data center looking in vs. inside the data center itself.
  • Reactive troubleshooting vs. Proactive Predictive

I’m going to need to have a part two (and maybe more) for this functional category breaking down the pro’s and con’s of various approaches.  Which vendors do what, etc.  I also need to revisit that question from the top of do we combine this into a single “availability & performance” functional category???  For now, this first pass will have to do…


Dec 17 2007   5:59PM GMT

Next pass on Data Center Automation “Blueprint”



Posted by: Ryan Shopp
Storage, Security, Network monitoring, Performance management, Virtualization, DataCenter, Systemschannel, ITIL, Systems monitoring, CMDB, Application monitoring, WAN optimization, FCAPS, eTOM, RBA, Run Book Automation, IT Process Automation

Thanks for the feedback, I’ve incorporated some points that have been made into an updated version of the Data Center Automation Blueprint (DCAB).

data-center-automation-blueprint2.jpg

As mentioned previous this is a work in progress and I love getting feedback, ideas, concerns etc. with the model. As mentioned previously I’m trying to build a functional model (at the 30,000 foot level) that represents key software functionality to automate the data center towards someday becoming “lights out.”

Also, with that said, it needs to be comprehensive but not overwhelming. I want to keep the yellow DCA functional areas limited in number…if this grows to be much more then the current six I feel it becomes too complex. So to add any new areas I need to assess how do they compare to the current areas and could I combine any areas.

One I’m struggling with right now is I’ve received feedback that analytics itself is an area. The interesting thing is analytics currently fits to some degree within each of the 4 horizontal functional areas (e.g., Configuration/Change, Security/Protection) as each of those products offer advanced reporting and as that progresses they do predictive reporting and analytics around that functional area.

Analytics would also show up at the dashboard level (currently beyond the scope of what I’m defining as the functional areas of the Data Center Automation Blueprint) where you would correlate business intelligence, patterns etc. across not just Data Center Automation functional categories but also across manual task orchestration (e.g., service/help desk) details.

Thoughts?

One more thing to clear up, I know some (many) of these functional categories and their products extend beyond the Data Center. The lens this blog looks through is exclusively focused on the challenges posed by large, complex data centers. For example, I know performance products are also useful in all sized companies (big & small) and also beyond the data center (e.g., headquarters, remote offices, partner networks, etc).


Nov 12 2007   10:54PM GMT

How to be a network admin god?



Posted by: Ryan Shopp
Networking, Network monitoring, DataCenter, OSS, NCCM, Alterpoint

Simple, take advantage of FREE but powerful tools to do your job better/faster/easier! Then share these cool tool finds with your friends.

I had the chance to take a look at ZipTie, a free network administrator “cockpit”, over the weekend. The utility, available for download from www.ziptie.org, is part of a growing open source movement in network and systems management. I recommend putting aside 60 minutes over lunch one day to download and check this out while you ‘re eating your sandwich.

The best comparison I can make around current ZipTie capabilities would be to imagine PuTTY or SecureCRT on steroids.  NOTE: you need to have credential password access to the network devices to get the value I’m going to talk about from here on out…so if you don’t have those rights on your network devices then this may not be for you. Below is a quick screen shot that shows the primary cool features I’m going to hit.

ziptie.JPG

What is so impressive about this desktop utility is it’s simplicity. Download, install, discover and now you have a personal inventory list (e.g., routers, switches, wireless access points, application acceleration devices). From that device list you can take a variety of forensic or troubleshooting actions when you need to:

  • telnet/ssh
  • ping
  • traceroute
  • nslookup
  • SNMP MIB walk
  • Port status
  • Interfaces status
  • View current configuration files (search it)
  • Compare to historical configuration files
  • NIPPER (a really cool configuration auditing tool that analyzes your configurations for vulnerabilities)
  • and much more…

If you don’t see a tool that represents a current script you typically use when you’re troubleshooting, no worries. You can build one (remember this is open source) or if that’s not your forte, head up to their user community, post the current script you use and ask for someone else to help build it. Same thing goes with making sure ZipTie has support for the network devices you need. Say for example you have some firewall that it seems no other network management vendor supports, not a trouble for ZipTie. There is a “how to guide” to build it yourself or again, post up to the community and ask for help! Also, while your up on the site, check out the other capabilities the utility offers while making sure you review their complete road map which they publish.
It’s amazing how far network management has and still is evolving. Functionality like this would have cost an enterprise tens if not hundreds of thousands of dollars less then 10 years ago. This will be another angle to consider as I get back on track and continue to build out the Data Center Automation Taxonomy I’ve been working on. Just wanted to take a moment and share this find.

Full disclosure: I worked for AlterPoint over a year ago. This ZipTie initiative was just about to start when I left. This was my first chance to check it out and since I was so impressed I felt compelled to share my perspective.


Nov 6 2007   5:17PM GMT

Building out a reference model for data center monitoring/automation



Posted by: Ryan Shopp
Security, Network monitoring, Performance management, Virtualization, DataCenter, Systems monitoring

So as this blog is just getting rolling I’m quickly realizing I need to come up with a graphical reference model, key approaches and metrics to reference. So to get that process started, i’m going to brainstorm some items here and hope to get some feedback on areas I should make sure I don’t forget. I’m not trying to re-create the wheel here but in my experience with ITIL, FCAPS, TMN, OSS, etc I still haven’t found a model that is technical enough to capture the essence of the challenges I’m solving - while not so technical I get lost in the weeds. I’ve see 50,000 foot views and I’ve seen 10,000 foot views but I’m aspiring to find something that is at the 30,000 foot level.

Data Center Infrastructure categories:

  • Network Connectivity: Routers, Switches, CSU/DSU, WiFi
  • Network/Application Optimization: Load Balancers, WAN Optimizers
  • Network/Application Security: Firewalls, Intrusion Prevention, Data Leakage
  • Application Servers: Windows, Solaris, Linux, Virtualization
  • Applications: ERP, CRM, Web, Databases, VoIP, Streaming Media (may need to break this down further)

Data Center Automation categories

  • Performance/Capacity Management - throughput, processor usage, memory usage, latency
  • Event/Fault Management - availability, consolidator of all alerts/messages into single pane of glass
  • Configuration/Software Management - upgrades, functionality changes, deployment, provisioning
  • Security Management - vulnerabilities, intrusions, leakage

The first area I’m thinking through is Performance Management where you gather key metrics over time to assist in the identification of current or future performance hindering situations that may ultimately result in productivity or revenue losses by an enterprise.

Key Performance Metrics

  • Basic (all components in the Data Center should provide these): Processor Usage, Memory Usage, Throughput, Latency
  • Advanced (will be unique/specific to each Data Center category): Bandwidth savings (e.g., WAN optimization), Transaction failures, page faults, etc)

Point of View for actual metric

  • System-centric - something specific to a Data Center infrastructure category (e.g., processor utilization)
  • Flow-centric - something watching transactions end-to-end at some point in the infrastructure (e.g, VoIP transaction, DNS resolution request)

Then the last area to consider and discuss are the methods by which this information is gathered; proprietary agent, agentless, hardware appliance, leveraging an established vendors agent, etc. Certain information may only be available through certain methods. Those method may or may not be an option for use depending on the enterprises’ business requirements. I’m going to need to come up with a way to organize/categorize these based on business uses (e.g., NetFlow, RMON2, SNMP, WMI, RPC, XML, Proprietary)

So stay tuned as I work to pull this together over the days ahead. Once I’ve hashed out this model I hope to provide a taxonomy of vendors and how they map to each. Once we have that in place then it will be time to start going through best practices and methodologies around evaluating vendors to meet you company’s individual business requirements.

As always, please provide feedback, thoughts, ideas as we build this out.  Note to self:  This is currently centered on managing the IP portion of the Data Center, not inclusive of power, space, non-IP storage, etc…once I get the IP portion down I hope to extend into those areas.


Oct 31 2007   8:12PM GMT

Activities in Application, System & Network Performance Monitoring



Posted by: Ryan Shopp
Networking, Network monitoring, Performance management, Microsoft Windows, Symantec, BMC, EMC, DataCenter, CA, Systems monitoring, Application monitoring, SolarWinds, InfoVista, Accellent, IBM Tivoli, HP Software, Quest Software

Big item to post about right out of the gate!  We all are familiar with the “Performance Management” sector within the Data Center.  Quick couple sentence summary.  Software that automates the collection and identification of potential performance bottlenecks within the data center.  Performance bottlenecks meaning real-time delays, conditions that are affecting productivity or analytics that leverages historical collected data that can help predict a potential performance concern before it happens.

Now there are a TON of large players in this space which we will review in more details in upcoming posts (e.g., BMC, CA, HP, IBM, EMC, Symantec, Quest Software, Microsoft) but today I want to hit on a couple vendors you should consider if you’re tired of working with your current vendor (most likely one of the big names above).

InfoVista is one of the last pure-play companies that provide solutions for automating Data Center Performance Management/Monitoring.  Yesterday, they finally announced a move (after years of OEM’ing various product) to round out their solution on the application performance management perspective.  I’ve talked to a number of large global enterprise/telecom customers who speak the gospel about the quality and capabilities of their products.  They’ve been known in the past for their network and systems centric capabilities but with the acquisition of Accellent they now own the application monitoring technology.  Now, let’s be clear - their solution is designed for large, large Enteprises and/or Telecommunication companies.  If your not looking to do a major global deployment spanning a large data centers and/or vast numbers of remote offices this solution may be overkill for you.   If that is the case I would recommend you taking a look at another company.

Solarwinds is making some major investments in their offerings.  If your a small, medium business or wishing to manage a portion (specific group/organization) within a larger enterprise then take a look at their Orion product line.  You get a major bang for your buck (many times 75% of the functionality you use from one of the big guys at a price point most likely less then your annual maintenance contract).  The other beautiful thing is you can download and evaluate the product in all is glory without ever talking to a single sales person.  Also, they have a very active community behind their products including a great blog, Geek Speak by my friend Josh Stephens, that provides very useful insights and perspective on leveraging their products.