Welcome to the next installment of the Network Management and Operations mini-series. This is part of the CLN VIP Network Specializations series of blog posts.
Here is a quick link to the original blog post for those that did not see it. It offers a great primer to many of the facets of Network Management : http://goo.gl/P6jwG. Please take a few minues to read that if you have not already, it will provide a seamless transition into this blog post.
Most network administrators think they run a great network - and for good reason. If your customer is the only one who tells you when things are broken, you really only notice the major failures. In a stable network where you are not changing things up constantly, this can occur rather infrequently. Even worse, when you have distributed IT support where on-site personnel do not correctly diagnose an end-user issue as a possible network problem, you as the Network Administrator may not find out about for quite a while, or at all. While this is a problem - it is a failure to implement correct tools moreso than anything else.
This is where it can get fun - introducing Network Monitoring. With options that range from simple ping-based Up/Down tracking, to full blown SNMP-based Network Management Systems, you have the ability to track as little or as much information as you have space to store it on. Simple up/down tools are fairly straight forward, typically using vanilla pings to track the state of a device. We won't get into much detail on those here, as they should be self-explanatory. Most are very easy to set up and maintain.
We'll discuss two major facets - monitoring and alerting - and brush on a tertiary component, reporting. They are seperate functions indeed, because you do not necessarily want to alert on all of the items you monitor. For example, you may want to track utilization on uplinks, but only want to alert on WAN uplink utilization. You can keep LAN link utilization statistics in your database without setting alerts on them. This can keep emails/trouble tickets to a manageable number, without losing visibility. We'll talk more about tracking vs alerting later on.
Where a network adminitrastor can really make their bread and butter, is the higher end NMS suites that allow a broad view of the state of their network. Until you see the sheer amount of trackable items via SNMP MIBs, you may not appreciate how detailed you choose to be in your monitoring.
Monitoring is where you can either throw a monkey wrench into your illusion of a great network, or confirm your networking prowess. Most of the time, unfortunately, turning on monitoring results in more work - fixing the problems you uncover! While this is the bread and butter of network administration, the maintenance and improvement of the existing facilities - it can often seem counterintuitive to some.
You know that user that calls the help desk every Sunday, saying their network is slow? You log into the router and try to view some historical statistics, and see nothing of consequence. No errors, no up/down events, etc - you are always stumped. It seems to come and go throughout the year, as well, but since you're a busy network admin you don't break out your Sherlock Holmes kit to figure out the minutiae of the situation. Well, NMS would save you all of that time, as you would receive the link utilization alerts, which always oddly enough correspond to the local sports team's games. Coincidence? I think not......case closed.
You can track events by a few different methods, the most popular being SNMP. SNMP stands for Simple Network Management Protocol, and there are 2 version in popular use today. SNMPv2 is a community-based protocol with the ability to read/write data on devices, however it is not secure. SNMPv3 is secured and allows granular access control, however is not always supported by older devices. New deployments should gravitate towards SNMPv3 as it is fairly straight forward to configure while offer far superior security - albeit outside of the scope of this blog post.
Now you ask, well why wouldn't I want to track every possible variable, and alert on everything? Well, that is a great question, and it depends largely on the size of your infrastructure divided by the staff that handle these alerts. If you enable 900 traps on 10,000 devices, with an average alert per day per device at 3, and you have 12 folks on staff, you're left with 2500 alerts per tech, per day, to reconcile. That is a ridiculous number of alerts to deal with. So let us make it manageable, to do this we need to figure out what metrics we need to see pass a threshold, before they cause a problem. Often times, this includes:
- WAN Utilization
- Core LAN link utilization
- CPU Utilization
- This is a primary reason for slow network response outside of bandwidth shortages.
- Crucial network infrastructure link up/down status
- Device up/down status
- QoS thresholds surpassed
- Events tracked by SNMP or syslog
- BGP/IGP convergence/neighborship events
- IP SLA failures
- Module failures
- Supervisor failovers
- HSRP failover
As you can see, there is no shortage to things you can track. SNMP MIBs have had books written about them, I am not going to try and re-hash that. SNMP is an extremely powerful tool that can not only read information from devices, but it can also push/write information to them, including configurations.
Many manhours are often spent trying to identify the most critical, yet effective metrics to monitor within your network. You can tune this per box, device type, etc depending on what solution you use. This can be done at the device configuration level or often within the NMS configuration itself. The myriad possiblities are mind-boggling. Tracking everything but alerting on only specific things, as discussed next, is a great strategy to maintain the historical data set without flooding your IT staff with a cataclysmic amount of alerts.
Now that you know what you are going to monitor - or you have a decent idea of where you will start - on to alerting. Alerting is fairly self-explanatory at face value, it is the means by which you inform your IT staff of events logged by the monitoring system which crossed a configured threshold. This could be tracked interfaces going down, it could be utilization on a link going over 75%, it could be HSRP failover, it could be a device not responding to the NMS. Whatever it is, you've configured a threshold and the event crossed it. Now you want to know about these specific events, in real time, as they occur.
Depending on the NMS you're using, you may have rudimentary options to alert on events, or you may have an option rich set of tools to alert across multiple mediums, including emails, automatic ticket opening, etc. A popular option is to have an email alias set up for your monitoring tool to send emails to, such that it can be forwarded to many people, as well as other groups.
Another component of alerting, which the higher end tools typically do a fair job on - is surpressing alerts via hierarchical alerting schemes. The concept behind this is "teaching" the NMS tool a certain level of your topology, in such that it understands the notion of dependancies. For example, if you lose your internet router, and all of your VPN sites traverse this router to hit the VPN concentrator, there is no reason to alert on all 250 VPN sites going down, you merely alert on the internet router. You as the network admin know full well that means all VPN sites are down - there is no need to receive 251 emails when only 1 would do. This also does wonders for campus networks where you do not have full redundant topologies in place. If you lose the core or distribution layers, there is no need to report on all of their single-homed access switches.
With that said, receiving notifications of an event in real time can be a rough process for companies to adjust to. Knowing that something occurred does not always give one an ability to resolve it immediately. During a transition period fixing all of the previously unknown chronic issues can be a burden. Some companies bring on additional headcount for remdiation, others just supress notifications on chronic issues until they can be resolved. Both are viable options, as long as there is a solid plan on place moving forward.
You can also extend this functionality to internal customers as a value-add, providing further visibility to them that they may not have at the current point in time. Monitoring servers or services can help them maintain better SLAs - even merely providing proactive notifications to the customers that you are tracking an issue and are working to resolve it - can stave off a bombardment of calls and emails going "what happened? why can't i reach this? what's going on?". We all know how that goes - it's like dominoes. One person notices a problem, calls three other people looking for answers, those three call three more each, and soon you're getting three dozen calls every 5 minutes requesting status reports.
Ideally notifications should be worked into a well documented, strictly followed reporting structure both within the IT organization as well as to your business parterns/customers. Depending on how well insulated these groups currently may be, it is a business call to how you may peel back that insulation, or maintain it. Never forget, knowledge is power, so be careful who you share it with, and when!
As a network admin, often times business units will come to you asking to provide an explanation for a situation they either do not understand or do not find acceptable. More often than not this involves availability, performance, or both. The burden of proof is then on you, as unless you are a company that makes money off of it's IT services, you are a cost center at the mercy of the buiness. Too often the IT management team will field these questions, only to turn to the Network Admin to provide the reasoning/explanation/proof. Do you have the tools in your toolbox to adequately do that today?
I've seen cases like this, albeit typically more work related, all too often. In many of those cases, we had solid monitoring and reporting tools - we'll talk about that later - which allowed us to dig back through historical data and re-create a historical map of network behavior over a time lapse. We then compiled this data through reporting tools in a way that even the users could understand - and then forwarded this along the business unit. In many cases we found legitimate utilization issues, specifically when the entire site went to log in during the morning and after-lunch rush. With this information compiled and put together in a professional manner, it was easy to hand this off to the Business Unit for them to make an informed decision on whether they wanted to upgrade the bandwidth to those sites. Without the facilities in place to collect, organize, store and retrieve those data sets, however, our job would have been infinitely more complex. As a provider of IT services, you should consider your burden when it comes time to tell/ask someone to spend more money on something intangible such as bandwidth. Make it tangible - put it in a graph!
This can also be used to justify billing agreements, provide proof of SLAs being met - or not being met by external providers - and can make the differnce in a meeting. You can sit down and track availability against your approved maintenance windows, and claim Five Nines, with proof to back you up. Upon closer inspection, you find you didn't reach that goal. Not a huge deal(unless you lose money over it....), now you have a specific goal to reach in the next quarter/year.
Hopefully this edition of the Specialist blog series has illuminated another facet of network administration - one that you can take from here and add to your list of things to learn. As I said earlier, this can make or break a network group, and it can truly add value if executed properly. In this entry I merely scratched the surface of monitoring, alerting and reporting - but now that you have a solid foundation of the basics you can expand on them and really add a critical skillset to your toolbox!
Thanks again for taking the time to visit our blog!