I got involved in a conversation at the whiteboard the other day where we were talking about the value of managing network devices via their loopback address. It occurs to me that this may be a best practice that many of you haven't implemented so I thought I'd write a little bit about why it's important and how it can help you.
For the sake of keeping things simple, let's talk about managing/monitoring routers. As you know, a router will typically have several IP addresses - at least one per interface and you can usually use any of these addresses to communicate with the router from your NMS. Now, let's assume that you use the IP address of interface serial 1/0/1.1 for monitoring of the device. What happens if that interface is down? Suddenly, your NMS thinks that the entire device is offline and you'll get alerts that the router is down and the NMS will stop collecting the data for the other interfaces as well as the router level statistics like latency, CPU, memory, and buffer usage.
So, how do you get around this issue? You implement a loopback addresss on all of your routers and you be sure that these addresses are reachable throughout the network. A loopback address is basically a virtual interface at the core of the device. As long as you have a path to the device, the loopback address should be reachable.
This best practice can save you a lot of time and effort down the road if you implement the loopback addresses before upgrading or deploying a new NMS.
Follow me on Twitter
Tomorrow I'm hosting a webcast on best practices for monitoring applications and application servers. To attend the live event register here.
Otherwise, we'll be posting a recorded version within the next few weeks.
I'm working on an issue with a few of our customers that I thought was interesting to mention here. We're attempting to collect NetFlow data from some Enterasys N7 switchs. We've been able to successfuly manage the switches with Orion NPM and we've configured the switches to export NetFlow data and that was all very straight forward. However, once we started looking at the NetFlow data we ran into an issue.
Within a NetFlow packet there are typically several NetFlow PDUs - or individual records of conversations. Typicall each PDU will have information like the source and destination IP address of the traffic, source and destination port number, protocol, AS information, and the ingress and egress interfaces of the router/switch where the conversation traversed. The interface information will should be represented by the ifIndex (from within the MIB-II ifTable) of the interface the traffic went through.
On these particular switches, we're seeing some indexes that don't appear anywhere within the ifTable so I can't tell where to associate the traffic. It looks like maybe these indexes are from the bridge table, but it's hard to tell for sure.
I'm working with the customers and Enterasys on this and hope to have a resolution soon but I thought I'd throw it out there to see if anyone else is seeing this.
If you want to stay up to date with this issue, I'll be tweeting about it as the situation progresses...
I read an interesting article today on Windows Version 7 and what it will take to make it successful. It's a great little article and really got me to thinking about the impact that "geek approval" has upon new technology as a whole. I'm not sure that many of us ever really appreciate the awesome power that we have as a "geek nation" and the fact that we can pretty much control the world from the comfort of our own ergonomic computer chairs if we will but band together and let our opinions be known.
Just remember this - with great power comes great responsibility. Yeah, who said that?
As a guy that's been using PCs pretty much since day one, I must say that I really enjoyed the new "I'm a PC" commercial that Microsoft recently released. After sufferring through the Emmys with my wife last night, I can honestly say that this commercial was the only bright spot in what was otherwise somewhat like a 3 hour root canal without the pleasure of nitrous oxide.
Apple's released some very creative and entertaining Mac commercials the last few years and it was really nice to finally see Microsoft step up a little and lose the pocket protector.
I get asked a lot about integrating Orion with helpdesk/trouble ticketing systems. As a matter of fact, I was in a meeting with a group of people here in NY today and the subject came up again, so I thought I'd blog about it tonight.
There are several really easy ways to integrate SolarWinds applications like Orion Network Performance Monitor (Orion NPM) and Orion Network Configuration Manager (Orion NCM) with trouble ticketing systems and these types of integrations can offer a lot of payback to your business. The easiest way to integrate the two systems is to have the Orion alerting engines send a message to the ticketing system when an alert is triggerred. This can be done with an SNMP Trap, a Syslog message, an e-mail, a script, or any of several ways but the real key is to have Orion include the fields that you want in the ticket and then configure the receiver to use those fields appropriately.
Additionally, many customers take advantage of Orion's open SQL database and the easy customization of the Orion website to do deeper integrations. There's really no limit here, as long as you plan head and document what you want to accomplish before you start hacking away (yeah, I know it isn't how we like to do it but trust me it works).
If you've integrated Orion with these types of systems in your environment or are interested in learning more about this please comment here, contact me directly, or it would be great if you could add to the forum thread on this subject here.
Anyways, I think I'll drop off for now as I got lost somewhere between the Philadelphia Airport and Atlantic City tonight in a rental car that was way too small and didn't have a GPS and ended up driving for an extra hour or two and my eyes are getting a bit blurry.
Although truthfully, I must admit that it's cool chillin' here at the Taj Majal in Atlantic City and I just noticed that even the soap has Trump's name on it. Dude, that's pretty cool.
Well, I'm headed out to New York and Atlantic City for a few days for some presentations and customer meetings. Should be a fun trip and definitely cooler there than it's been here in Texas for the last few months.
I was working with a customer the other day and we were analyzing some of the data that Orion NPM is collecting from his core routers. On some of his gigabit interfaces we noticed that every few hours we got a couple of hundred discards (all at once, not spread through the hours). This caused us to investigate the root cause and also got us to talking about errors and discards and the more I thought about I thought that some of this data might be useful to other people.
First off, when you errors or discards within your network management system you need to ask yourself two questions:
a) Do you trust the NMS?
b) Are you seeing any issues on those interfaces?
I mention trusting your NMS first as I've definitely seen cases where the network management software misreported these stats. If your software is from us here at SolarWinds, then you skip this part as in the 10+ years I've been helping to create and using these applications I've never seen them misbehave in this particular way.
When it comes to the second question,what I mean is, if you hadn't noticed the stats being reported in your NMS would you have been thinking about this interface? If not, and the number of errors or discards or relatively low then you might just sort of keep an eye on it to see if it gets any worse.
But let's assume that you've decided to go investigate these stats. One very important thing to understand is that there's a world of difference between discards and errors. Errors indicate packets that were received but couldn't be processed because there was a problem with the packet. In most cases, when you're seeing inbound errors on a router interface the issue is upstream of that device. Could be a bad cable, misconfiguration on one end or the other, or etc. In most cases, these issues are resolved outside of the router where you're seeing the errors. Errors reporting is documented within RFC 1213 (among others including RFC 1573) and typically is pulled from the IF-MIB (ifInErros and ifOutErrors).
With discards, the situation is almost the opposite. The packets were received with no errors but were dumped before being passed on to a higher layer protocol. A typical cause of discards is when the router needs to regain some buffer space. In the case of discards, the issue is almost always with the router that's reporting the discards (not witha a next hop device, bad cable, etc). RFC 1213 also documents discard reporting and they're right beside the errors within the IF-MIB.
This blog post is getting long so I'll stop the description here, but ping me if you want to know more about this as I never really tire of talking about packets...
Recently a good buddy of mine Jimmy Ray Purser started a blog on Network World's site. Jimmy Ray is an engineer with Cisco and is one of the best all around dudes I've ever met. Network World did a short article announcing the new blog which can be read here.
Jimmy Ray is not only a great engineer but the dude is also hillarious. I highly recommend checking it out and adding it to your list of blogs that you watch regularly.
Today I was troubleshooting some issues with VoIP where we had a user connecting to our Austin office from our Brno office in the Czech Republic and voice quality was pretty much zero. It got me to thinking about the way that many of us monitor, support, and manage our VoIP infrastructures and our networks in general respective to latency, jitter, and packet loss.
In the old days so long as bandwidth was available and latency was somewhat under control your job as the network administrator was done. Not so today. In today's environments telephony is really just another component of our network infrastructure and we're not only managing the routers and switches but the call managers and gateways.
Understanding true latency, jitter, and packet loss (and as a result MOS) is a big part of getting a handle on VoIP performance - or for that matter performance of any latency sensitive application. This is why leveraging technologies like IP SLA are so important as it helps you understand performance within the WAN vs. monitoring from a single point on the WAN.
Additionally, since most of our Network Operations Centers (NOCs) are now handling tickets on phone issues we need to have visibility into these statistics and health and performance of the call managers from within our central NMS.
As many of you are aware, Orion has a module called the "VoIP Monitor" which is aimed at doing exactly this. It leverages IP SLA to pull statistics on WAN performance and pulls health and status data from Cisco Call Manager. As we continue to add features to this product I'd love to hear more about how you manage your environments and where you'd like to see it expanded.