NPM 12.4 is available today, December 4, on the Customer Portal! The release notes are a great place to get a broad overview of everything in the release. Here, I'd like to go into greater depth on the brand-new Cisco ACI support. Let’s talk a bit about how software-defined networks are different than traditional networks, what that means for monitoring, and how to get the most out of the new ACI monitoring feature.
What is SDN?
The first time I heard the term Software Defined Network, I thought it was stupid. All networks are defined by software. Software moves packets and frames, or programs the hardware that does it. Software is used to manually configure networks via CLI. Software is used to automatically configure networks with protocols like OSPF, STP, and LLDP. Networks were alreadysoftware-defined!
Whether SDN is a good name or not, it is an important concept. There’s a lot of people trying to define SDN, usually with some ulterior motive of placing themselves in a favorable position. For a slightly less biased view, check out the Wikipedia definition. The thing that stands out to me is:
“SDN suggests to centralize network intelligence in one network component by disassociating the forwarding process of network packets (data plane) from the routing process (control plane).”
This is a big change. In an SDN environment, network devices like routers and switches become simple devices that just move traffic at a high rate. All the intelligence is in a separate device called the controller. The controller learns how everything is connected, what connectivity applications need, and writes instructions to all of the network devices so they know how to forward traffic.
There are a ton of SDN solutions available today. The two most popular commercial solutions seem to be Cisco ACI and VMware NSX. Cisco ACI is more commonly requested by our customers (see NPM Monitor Cisco ACI and Support of a Cisco ACI networks in Network Performance Monitor compared to Vmware NSX Support), so we’ve built support for it first.
How Do I Monitor SDN?
An SDN fabric consists of a data plane and a control plane. The data plane is comprised of physical devices, Nexus switches, and, in the case of Cisco ACI, cabling. The control plane is comprised of many logical components that fit together to define what endpoints are allowed to send network traffic to each other. The modular nature of the configuration reminds me of Cisco’s MQC. To make sure your SDN environment is running well, you need to monitor both layers.
Data Plane (aka Underlay aka Infrastructure Layer)
AKA the boring stuff. This is not the glamorous part of SDN. It’s the stuff you’ve been doing for years: power supplies, fans, temperatures, CPU, RAM, and interface stats. The fact of the matter is, these things all need to function properly for your SDN environment to be performant and reliable.
The data plane for Cisco ACI environments is made up of the Cisco Nexus model line. Fortunately, NPM 12.3, the release before this one, introduced Network Insight for Nexus. This gave NPM better than ever support for this hardware.
It’s easy to set up. Navigate over to Settings (top menu bar) -> Manage Nodes -> Add Node. Add your spine switches and leaf switches as SNMP nodes. On the last step, make sure to check this box:
If you already have your switches in NPM, you can find the same checkbox when you edit a node.
You’ll be prompted for your CLI credentials. CLI is the only way some of this very important data is available, so that’s how NPM gets it. This will cover the basics like power supplies, fans, temperature sensors, CPU, RAM, and interface statistics, plus the advanced stuff like VPC. Those of you with NCM can also get access list version control and analysis. Those of you with NTA will get flow analysis. You can check all of that out on our demo site here.
Okay, let’s get to the new interesting stuff.
Control Plane (aka Overlay aka Control Layer)
In an SDN environment, the controller has all the intelligence. This has a big impact on monitoring. Instead of polling dozens or hundreds of devices that each have their own very narrow view of the network, we can poll the controller directly. It has to know where everything is or it couldn’t control it. This means we can learn a lot from monitoring it.
This part is also easy to set up. Navigate again to Settings (top menu bar) -> Manage Nodes -> Add Node. In a Cisco ACI environment, the controller is called an APIC. Add your controller as SNMP nodes. At the bottom of the first screen you’ll see this checkbox:
Check it! If you’ve already got your APIC added, edit the node and you can find the same box to check.
Cisco strongly recommends each ACI fabric have three APICs. Since each APIC must be able to control the entire network if necessary, each APIC has a complete view of the network. Polling them all results in a lot of duplication of work and potentially duplicate alerts. You have a choice in how you approach monitoring of these devices:
- 1) Add all three APICs to monitor but enable API-based ACI polling (the checkbox) for only one controller.
- a. Pros: efficient for the APICs and efficient for NPM.
- b. Cons: if the controller you’re doing API-based polling on goes down, you’ll see the APIC is down, but you’ll lose visibility to the control plane until you fix it or enable API-based polling for another controller.
- 2) Add all three APICs to monitoring and enable API based ACI polling for all three controllers.
- a. Pros: Control plane monitoring works, even if one or two of the three APICs go down.
- b. Cons: NPM has to poll the same data three times. APICs have to provide the same data three times. You will get duplicate alerts and reporting data unless you’re careful to write your alerts in consideration of the duplicate data. More on this in a future post.
Our recommendation is to do #1, but either way will work.
The API-based polling runs over TLS. If you have a valid cert on your controllers, everything will add fine and you’ll be good to go. If you have a self-signed cert, you will receive a warning about it and you’ll have to accept the risk or replace it with a properly signed cert before proceeding. You do have a real cert on your APIC, right?
Once you complete the add node wizard, navigate on over to Node Details for one of your APICs with API-based polling enabled. You can click along with me right now on the Online Demo. On the left side, you’ll see two new views: Members and Map. Let’s look at Members first.
The Members view shows all of the logical components we have discovered. This includes Tenants, Application Profiles, and EndPoint Groups. It also includes the APIC’s view of the physical components: leaf switches and spine switches.
This uses the framework’s List View, which is a polished way to deal with large lists. You can do multilevel filtering on the left, like sort, and search. The list contains the name of the component (example: Tenant3), the type of component (example: Tenant), and the distinguished name (example: uni/tn-Tenant3). On the right, we see the health score. Let’s talk about that.
Since the controller has visibility into all components and their relationships, for the first time, part of the network infrastructure is in a position to accurately assess its health. Cisco ACI does this by assigning a health score. The health score is an integer from 1 to 100, where 100 is perfectly healthy and less than 100... isn’t. The health score takes into consideration both parents and descendants in the ACI model. You can check out the exact formula here. Since health scores represent status, they’re polled at the status interval in NPM. As always, you can adjust this interval. All of this data is polled via Cortex, incidentally, our new polling framework that you previously saw powering PerfStack Real-Time Polling.
Health scores will be colored red, yellow, or green according to thresholds. There are thresholds on the APIC already for this that determine what color that score is in the APIC GUI. To stay consistent, NPM learns the thresholds from the APIC and applies those. If you customize the thresholds on the APIC, NPM will learn and apply the new threshold settings.
You can click on a health score to get the history in the PerfStack dashboard:
Thanks to this being in PerfStack, it’s easy to start correlating other metrics about the APIC, leaf switches, and spine switches. It gets more interesting when you start correlating to end node availability, latency, and other data NPM has. If you own other modules on the Orion Platform, you can correlate that data too; for example, application counters, database wait time, IOPs, logs, and all the rest. Seeing all this data normalized on the same shared timeline is powerful for troubleshooting. If a health score is in bad shape and you think the issue is on the controller, it’s time to log in to the APIC itself. The APIC can tell you what is causing the score to be what it is and has a bunch of additional ways to troubleshoot.
Returning to the sub-view menu on the left, let’s check out the Map tab.
When you first open the map, you’re only going to see the APIC in the center. To get more on the map, select the APIC. On the right side, the inspector panel will open. Here you can check the box next to related entities and press Add at the bottom to add them to the map. You can use this method to continue to spider through your ACI environment. This works well for creating a map of a small ACI environment or of a specific section of a larger ACI environment, like a tenant or an app. Once you’ve got a map you like, you can select to Save as a group in the top right. From that point forward, you can navigate to that group and press the Map tab to see the map again. Here’s an example of one I saved in my lab:
Pretty slick! One important note: the APIC GUI already has some capability to map an ACI environment. In talking to NPM users who run ACI environments, I frequently heard that they would like to grant read-only access via a common platform for folks who don’t have access to the APIC directly, like NOC engineers. This accomplishes that goal and lets you correlate and visualize with all of the other data currently available in Orion Maps.
To upgrade now, customers with NPM under active maintenance can head over to the Customer Portal and download NPM 12.4. Thanks to the improved Orion Installer, upgrade is faster than ever with centralized upgrade of additional polling engines. Once you’re installed, add those ACI nodes and reply here to let us know how it’s working for you!