The team continues to hammer away on enhanced and new application template content for Server & Application Monitor. The list below adds to what has been discussed in recent earlier posts, which you can find here, here, and here.

 

In this update, we will walk through the latest updates, including:

 

  • Verson 2 of Office 365 monitoring – We’ve reorganized the templates a bit, but more importantly, fixed an issue some customers were experiencing where components would randomly go into unknown status
  • Citrix XenServer – Net-new template support
  • Citrix PVS Accelerator for XenServer – Net-new template support
  • Oracle RAC (Real Application Clusters) – Net-new template support

 

As always, please let us know if you have any comments about these templates or requests to add to our list for new template creation. 

 

The info provided in this post is relatively high-level. Click on the links to see the complete detail for each new or updated template.

 

With that, let’s jump right in!

 

Office 365 Exchange Enhancements:

 

As mentioned above, besides reorganizing the templates a bit, the main update here is the fix to the issue some folks had reported where components would randomly go unknown. The issue was due to the fact that Microsoft has a “Global Throttling Policy,” which limits simultaneous connections from one client for O365 and maximum three simultaneous connections are allowed.

 

To overcome this concurrency issue, we have implemented a locking mechanism and restricted three scripts establishing a connection with Office 365.

 

Oracle RAC:

Next up, following on from the previous Oracle template updates, we are also releasing a new template for Oracle RAC, which you can download and read more about at https://thwack.solarwinds.com/docs/DOC-203744

 

The list of metrics available for monitoring include:

  • Average MTS response time
  • Average MTS wait time
  • Sort ratio
  • MTS UGA memory
  • Database file I/O reads
  • User locks
  • Locked users
  • Global cache service utilization
  • Global cache block lost
  • Global cache average block receive time
  • Long queries elapsed time
  • Redo logs contentions
  • Active users
  • Buffer cache hit ratio
  • Dictionary cache hit ratio
  • Average enqueue timeouts
  • Global cache block access latency
  • Nodes down
  • Long queries count
  • Database file I/O write operation
  • Global cache corrupt blocks

 

 

The thing to keep in mind about this template is, just like our other Oracle templates, it requires some prerequisites be set up on the Orion Server and/or Poller for it to work.

 

 

 

 

Citrix XenServer:

The third template we are releasing is for XenServer, which you can download and read more about here – https://thwack.solarwinds.com/docs/DOC-203745

 

Monitors the host as well as the guest VMs running on that host, including the following metrics:

 

  • Host - Free Memory
  • Host - Average CPU
  • Host - Control Domain Load
  • Host - Reclaimed Memory
  • Host - Potential Reclaimed Memory
  • Host - Total Memory
  • Host - Total NIC Receive
  • Host - Total NIC Send
  • Host - Agent Memory Allocation
  • Host - Agent Memory Usage
  • Host - Agent Memory Free
  • Host - Agent Memory Live
  • Host - Physical Interface Receive
  • Host - Physical Interface Sent
  • Host - Physical Interface Receive Error
  • Host - Physical Interface Send Error
  • Host - Storage Repository Cache Size
  • Host - Storage Repository Cache Hits
  • Host - Storage Repository Cache Misses
  • Host - Storage Repository Inflight Requests
  • Host - Storage Repository Read Throughput
  • Host - Storage Repository Write Throughput
  • Host - Storage Repository Total Throughput
  • Host - Storage Repository Write IOPS
  • Host - Storage Repository Read IOPS
  • Host - Storage Repository Total IOPS
  • Host - Storage Repository I/O Wait
  • Host - Storage Repository Read Latency
  • Host - Storage Repository Write Latency
  • Host - Storage Repository Total Latency
  • Host - CPU C State
  • Host - CPU P State
  • Host - CPU Utilization
  • Host - HA Statefile Latency
  • Host - Tapdisks_in_low_memory_mode
  • Host - Storage Repository Write
  • Host - Storage Repository Read
  • Host - Xapi Open FDS
  • Host - Pool Task Count
  • Host - Pool Session Count
  • VM - CPU Utilization
  • VM - Total Memory
  • VM - Memory Target
  • VM - Free Memory
  • VM - vCPUs Full Run
  • VM - vCPUs Full Contention
  • VM - vCPUs Concurrency Hazard
  • VM - vCPUs Idle
  • VM - vCPUs Partial Run
  • VM - vCPUs Partial Contention
  • VM - Disk Write
  • VM - Disk Read
  • VM - Disk Write Latency
  • VM - Disk Read Latency
  • VM - Disk Read IOPs
  • VM - Disk Write IOPs
  • VM - Disk Total IOPs
  • VM - Disk IO Wait
  • VM - Disk Inflight Requests
  • VM - Disk IO Throughput Total
  • VM - Disk IO Throughput Write
  • VM - Disk IO Throughput Read
  • VM - VIF Receive
  • VM - VIF Send
  • VM - VIF Receive Errors
  • VM - VIF Send Errors

 

 

Citrix PVS Accelerator for XenServer

Last but not least, we added a net-new template for Citrix PVS Accelerator for XenServer, which you can read more about and download here - https://thwack.solarwinds.com/docs/DOC-203773

 

Includes the following metrics available for collection:

 

  • PVS - Accelerator Eviction Rate
  • PVS - Accelerator Hit Rate
  • PVS - Accelerator Miss Rate
  • PVS - Accelerator Traffic Clients Sent
  • PVS - Accelerator Traffic Servers Sent
  • PVS - Accelerator Read Rate
  • PVS - Accelerator Saved Network Traffic
  • PVS - Accelerator Space Utilization

 

That’s it for this round of content updates! We have more in process and will post to let you all know as soon as they are ready. As always, you can suggest new templates or features for SAM by creating a Feature Request.

 

Network Configuration Manager (NCM) v7.9 is available today on the customer portal! For a broad overview of this release, the release notes are a great place to start. This is a particularly pleasing release as we are delivering a feature that has received over 470 votes: Multi-Device Baselines.

 

What are Configuration Baselines?

Baselines are often attached to the act of measuring and rating the performance of a given object (interface, device, or similar) in real time. In configuration management terms, baselines are used to provide a framework for change control and management. The configuration baselines measure and evaluate the content set within the config and indicate whether the content is aligned to the baseline or not.      

 

Given that configuration changes over time are more difficult to directly observe and more complex to manage, this means that baselines play a role in monitoring and preventing unwanted changes. I find that this definition of baselines from Techopedia is interesting and accurate:

“It is the center of an effective configuration management program whose purpose is to give a definite basis for change control in a project by controlling various configuration items like work, features, product performance and other measurable configuration.”

 

This means that monitoring may be possible for a small number of nodes, but it is not practical nor is it reasonable to scale this type of manual monitoring framework. Actively monitoring each device’s config makes the validation of consistency and alignment to corporate or regulatory requirements reliable and possible.

 

Baselines

The great news is that NCM already helps with mitigating the challenges related to monitoring configuration drift by providing config change reports, Real Time Change Detection, rules and policies that monitor configurations based on a set of user-defined conditions, and a one-to-one configuration baselining. What we implemented in the latest version of NCM extends and improves configuration baselines to include:

  1. Creating new baseline(s) through
    1. Promoting an existing config to be a baseline, or
    2. Creating a new baseline by copy/paste or loading a file
  2. Ignoring unnecessary configuration lines (or lines unique to each device)
  3. Applying baseline(s) to a single node or multiple nodes

 

<New!> Baseline Management

In this release, there is a new list view of all baselines that have been created or migrated from an upgrade. From this new page, users can create new baselines, edit existing, apply or remove nodes for a given baseline, enable or disable a baseline, update the status of the baseline, or delete a baseline.

 

<New!> Updated Diff Viewer

A major improvement in this release is the implementation of a new diff viewer for baselines. This new diff viewer will collapse lines that are unchanged, highlight ignored lines as gray, and mark all changes as yellow.

 

 

More Ways to Create a Baseline

The process of creating baselines should be easy—take an existing config and simply apply it against a set of nodes, right? In NCM, you can do just that by promoting an existing configuration, loading a config from file, or copying and pasting.

 

Promoting a config is now nested under the node and in the baseline column:

 

Creating a new baseline can be done through the new Baseline Management Page:

 

No matter the steps to create the baseline, each will ultimately lead to applying the baseline to the nodes and configs.

 

Ignoring Extraneous Config Lines

One of the key challenges with baselines is being able to get an accurate assessment of the config and not having false positives for config lines that are unique to a node or not relevant to the baseline. In NCM v7.9, we have introduced an ignore line capability that allows users to click through lines that are not relevant to the baseline to aid in reducing false positives. To read more on this, check out this link.

 

Baseline Status Indicators

To monitor whether or not a node (config) is in compliance with a baseline or baselines, there needs to be a visual and written indication. Baseline Management, Configuration Management, and ‘Baseline vs. Config Conflicts’ report all now have visual and written indicators. On the Configuration Management page, there is a new baseline column that contains the visual and written indication of whether or not that node is in alignment with the baselines applied.

 

For each status, there is a hover that provides a list of all the baselines and their associated status for that node.

 

The new Baseline Management view provides a complete list view of all baselines that have been created. This view is meant to show the alignment of all the nodes that are applied against a single baseline.

 

Each baseline can be expanded to show the status for different nodes to which it is applied (similar to the hover for Configuration Management). Each one of the statuses is clickable and will load the diff of that baseline vs. the config selected.

 

Lastly, the “Baseline vs. Config Conflicts” report also inherits the visual indicators and now shows the status of a node to one or many baselines.

 

This is a major step forward for baselines and the monitoring of configuration drift within NCM. Of course, please be sure to create new feature requests for any additional functionality you would like to see with baselines or NCM in general.

 

Helpful Links:

NCM v7.9 Releases Notes

NCM Support Documentation

Network Configuration Management Software

Network Performance Monitor (NPM) 12.4 and the Orion Platform 2018.4 are now generally available in your customer portal. For those of you subscribing to the updates in What We're Working on for NPM (Updated June 1st, 2018)  you may have noticed a line item called "Centralized Upgrades." This update will give you the first chance to experience Centralized Upgrades on your environment.

 

Great news this upgrade is going to be easier than ever!

 

 

Planning for Your Upgrade to 2018.4

 

Read the release notes and minimum system requirements prior to installation as you may be required to migrate to new server or database infrastructure. For quick reference, I have provided a consolidated list of release notes below.

Note: Customers running Windows Server 2012, 2012 R2, and SQL 2012 will be unable to upgrade to these latest releases prior to migrating to a newer Windows operating system or SQL database version. Check for the recommended Microsoft upgrade path through the upgrade center.

 

See more information about why these infrastructures are deprecated here: Preparing Your Upgrade to Orion Platform 2018.4 and Beyond - Deprecation & Other Important Items

 

SolarWinds strongly recommends that you update to Windows Server 2016 or higher and SQL Server 2016 or higher at your earliest convenience. 

 

 

 

 

 

Refresh your upgrade knowledge with the following upgrade planning references.

 

 

Always back up your database and if possible take a snapshot of your Orion environment.

 

 

Start Your Upgrade on the Main Polling Engine

 

Download any one of the latest release installers to your main polling engine.

 

For the screenshots that follow I'm upgrading my Orion deployment with the following setup:

  • Main Polling Engine is installed with Virtualization Manager (VMAN) 8.3 and will be upgraded to VMAN 8.3.1
    • Utilizes a SQL 2016 database
  • Three scalability engines
    • One Free Additional Polling Engine for VMAN on Windows 2012
    • One Free Additional Polling Engine for VMAN on Windows 2016
    • One HA Backup on Windows 2016

 

My first screen confirms my upgrade path to go from 8.3 to 8.3.1.

  • If I'm out of maintenance for a specific product, I would see indicators here first on the screen. Being out of active maintenance will prevent you from upgrading this installation to the latest, so please pay attention to the messaging here.
  • The SolarWinds installer will upgrade all of the products on this server to the versions of product that are compatible with this version of the Orion Platform for optimal stability. This may mean that you'll be upgrading more than just one product.
  • When in doubt, feel free to run the installer to see the upgrade path provided, so you can plan for your downtime. Cancelling out at the pre-flight check stage will give you all the information needed to plan ahead, without surprises and without changes to your environment.  This information can also be used for your change request before scheduling downtime for your organization.

The second step will run pre-flight checks to see if anything would prevent my upgrade from being successful on the main polling engine.

  • In case there are no blocking, warning, or informational pre-flight checks, we will proceed straight to the next step, accepting the EULA.
    • My main polling engine server and DB meet all infrastructure system requirements for the 2018.4 Orion Platform, so I am not shown any blocking pre-flight checks at this stage.
  • Pre-flight checks can block you from moving forward with your installation
    • You  may need to confirm whether you meet new infrastructure requirements (e.g. NTA 4.2.3 -> 4.4 upgrade) to proceed. Blockers will prevent you from successfully installing or upgrading, so the installer will not allow you to proceed until those issues have been resolved.
    • Warning pre-flight checks give you important information that could affect the functionality of your install after upgrade but will not prevent you from successfully installing or upgrading.
    • Informational pre-flight checks give you helpful troubleshooting information for "what if" scenarios, in case we don't have enough information to determine whether this would be an active issue for your installation.

 

The online installer will start to download all installers needed from the internet

  • SolarWinds recommends that you use the online installer because it will be able to auto-update and download exactly what's needed for the installation. Not only is it more efficient, but it will save you from downloading unnecessary or outdated bits.

 

This screen gives you an overview of next steps. The Configuration wizard will launch next, to allow you to configure database settings and website settings.

In this release, all scalability engines, including Additional Polling Engines, Additional Websites and HA Backup Servers, can be upgraded in parallel manually, using the scalability engine installer. Manual upgrades are still supported, but if you have scalability engines, please try our centralized upgrade workflow to save you time.

 

Follow the configuration wizard steps to completion. If you only have a main polling engine to upgrade, your installation is now complete. Log in to your SolarWinds deployment and enjoy the new features that have been built with care for your use cases.

 

Centralized Upgrades of the Scalability Engines

For those customers who have chosen to scale out their environment using scalability engines, such as Additional Polling Engines, HA Backup Servers or Additional Websites this is the section for you.

 

If you kept the "Launch Orion Web Console" checkbox checked in the final step of the Configuration Wizard, the launched web browser session will navigate you directly to the Updates Available page, where you can continue with the Centralized Upgrade workflow. If you want to open a new web browser session on a different system, you can quickly navigate to where you want to go by following these steps.

 

Launch the web browser and log in.

Navigate to 'My Orion Deployment' from the Settings drop-down.

 

Click to the UPDATES AVAILABLE tab. If this tab is not showing, that means there are no updates available for you to deploy.

Click Start, to begin the process of connecting to your scalability engines.

My environment is not experiencing any issues connecting to my scalability engines.

Bookmark this page Connection problems during an Orion Deployment upgrade - SolarWinds Worldwide, LLC. Help and Support  for future guidance on common "gotcha" scenarios, and how to handle them.

After the contact with scalability engines has been established, pre-flight checks will be run against all scalability engines

Looking at my pre-flight checks you can see that one server PRODMGMT-49 has a blocker that would prevent upgrades from occurring, mainly that it does not meet infrastructure requirements for this version of the Orion Platform.

However, my "Start Upgrade" is enabled. This is because if at least one scalability engine is eligible for upgrade, we will allow you to proceed. Only when none of the scalability engines are eligible will this button be disabled. Pay attention to servers that have blocking pre-flight checks, as you will have to manually upgrade them or move items being monitored via this scalability engine to one that is upgraded.

 

Clicking "Start Upgrade" begins the centralized upgrade process, first by downloading all the necessary bits to all the scalability engines in parallel. Notice how my scalability engine that was on incompatible 2012 infrastructure is not being upgraded.

Grab a coffee as the rest of your installation and configuration happens silently on each of the servers being centrally upgraded.

Oh no, an error occurred. What can you do at this point?

  • Click Retry download after troubleshooting (e.g. did the scalability engine lose connectivity to the main polling engine?)
  • RDP directly into the server using the convenient RDP link that is provided

 

Common scenarios to investigate:

  • Is this scalability engine set up inconsistently from the other servers? For instance, you may have Engineer's Toolset on the Web installed on this server and not on the others.
  • Do some of the installed products have dependencies on .NET 3.5? Engineer's Toolset on the Web has a dependency on .NET 3.5 to be able to upgrade. Ensure that if you have enabled .NET 3.5 and try again.
  • Check the Customer Success Center for more scenarios to help while troubleshooting.

 

In my case, I clicked Retry and was able to get past the issue.

My upgrade is complete! Congratulations on an upgrade well done.

Click Finish to complete your Centralized Upgrade session.

 

Gotchas - What to do with Unreachable Servers

If your server isn't being blocked because of incompatible infrastructure, you have an opportunity to manually upgrade that server in parallel while the rest of your environment is being centrally upgraded.

 

In the installation example captured below, if I were to run the installer on the Additional Website that is currently being upgraded by Centralized Upgrades, I would be blocked from running the installer on that server. However for the listed unreachable Additional Website, I can run that upgrade manually with no problem in parallel.

If you're blocked from proceeding on a manual upgrade, you would see the following. Only until you have finished the Centralized Upgrade process will you be allowed to proceed with a manual upgrade that is blocked in this fashion. For these scenarios, simply navigate to My Orion Deployment and exit out of the deployment wizard flow to cancel the centralized upgrade session.

 

Manual Upgrades

Manual upgrades of your deployment are still supported. If you have only one scalability engine, Centralized Upgrades may not be the fastest way to upgrade. However, if you have more, it is. This upgrade is still beneficial for those considering using manual upgrades for their deployment, and the reason is the installation and configuration wizard process can now be run in parallel. Existing customers have always known that there were some scenarios where you could run the configuration wizard in parallel across servers (e.g. same server type) and some that you could not. It took time and training to understand what scenarios those were. In this release, that limitation is lifted, and all server types can be configured in parallel.

 

There are times where you may need to consider falling back to manual upgrades in combination with your Centralized Upgrade. As an example, take this installation: two have completed, one has the configuration wizard in process.

If the download, installation, or configuration is taking a long time for one of your scalability engines, and you need to see more information that is only available in the client, you may consider canceling out of the Centralized Upgrade session to resume the rest of your upgrade manually. The servers that have been upgraded thus far will remain in a good spot, so you can cancel out with confidence. Proceed with this option carefully, as you will want to ensure that you have upgraded everything by the end of your scheduled downtime.

Check the My Orion Deployment page to ensure that all the servers in your Orion deployment are upgraded.

 

Support

We have all been there, despite all the best intentions and all the preparation in the world, something went wrong. No worries! File a support ticket Submit a Ticket | SolarWinds Customer Portal  and start gathering diagnostics via our new web based and centralized diagnostics.

 

Click to the Diagnostics tab

Select all the servers in your deployment,

and click "Collect Diagnostics."

Sit back and relax as your diagnostics are centrally gathered in preparation for your support call.

 

Customer Experience

 

Early adopters and those who have participated in our release candidates have already begun to enjoy the benefits of centralized upgrades. Check out our THWACK forums for testimonials from customers just like you as they experience the new and improved "Easy Button" upgrade experience. Here's a link to one from one of our very own THWACK MVPs  The "Easy Button" has arrived with the December 2018 install of NAM (and other Solarwinds modules) If you'd like to share your upgrades with me, I'm very interested, and we'd love to see screenshots and your feedback on this new way to upgrade your SolarWinds deployment.

 

More centralized upgrade success - Success with Centralized Upgrades

IPAM 4.8 has arrived and is now generally available! You can find this latest release in your Customer Portal. In recent releases, we’ve brought you integration with VMware vRealize Automation and Orchestrator and monitoring support for Amazon Web Services (AWS) Route 53 and Azure DNS. In this release, we have extended our support (yet again) to additional platforms and bring you these goodies:

 

Monitoring Support for Infoblox

You asked for it, you got it! This is our #1 integration feature request on THWACK®, and I’ve spoken to many of you at tech conferences about wanting us to monitor your Infoblox DHCP and DNS environments. IPAM provides valuable resources, alerting, and reporting capabilities without having to purchase add-ons, as well as a centralized management console across heterogeneous environments.


 

Migration to Core Custom Properties
We have migrated from product-specific custom fields to the unified custom properties designed to be simple and powerful for you to use with other Orion® Platform products. Now you can add new custom properties the same way you would for other modules and use them for IPAM entities in Reports and Alerts.

Support for More Linux Versions
We have extended DHCP and DNS support to the following Linux distributions:

    • Ubuntu 14.04
    • Ubuntu 16.04
    • Debian 9.5
    • Debian 8.6 (DHCP only)

 

HELPFUL LINKS:

 

 

The SolarWinds trademarks, service marks, and logos are the exclusive property of SolarWinds Worldwide, LLC or its affiliates.  All other trademarks are the property of their respective owners.

·         IPAM IP address management software[MJ1]

We’re delighted to announce the release of version 4.5 of NetFlow Traffic Analyzer (NTA)!

 

The latest release of SolarWinds® NetFlow Traffic Analyzer is designed to help create alerts based on application flows. In past releases, we could alert on the overall utilization of an interface and provide a view of the top talkers when the configured threshold was exceeded. In this release, you can set a threshold on the volume of a specific application in order to trigger an alert. We're making use of the Orion Platform alerting framework, so that flexibility is available to you.

 

You’ve outlined a small set of critical problems in multiple requests, and in this release, we’re delivering on the five most popular of these.

 

  • Application traffic exceeds a threshold – Alert triggered when we observe a specific application rate exceeds a user-defined threshold
  • Application traffic falls below a threshold – Alert that can provide visibility when an application “goes off the air” and stops communicating
  • Application traffic appears in the “TopN” list of applications – This alert triggers when application traffic increases suddenly relative to other applications
  • Application traffic drops from the “TopN” list of applications – Likewise, alert triggers for a sudden reduction relative to other applications
  • Flow data stops from a configured flow source – Alerts on the loss of flow instrumentation, and prompts to take action to help restore visibility

 

Contextual Alerting

The approach we're using to create alerts is built to guide users into a particular context—a source of flow where we see the application traffic—and then offers a simple user experience to create the alert.

To create an alert based upon any these triggers, we must first select a source of flow data as a point of reference. We can do these one of two ways.

 

We can visit the NTA Summary Page, and navigate to a particular source of flow data:

 

 

If the application of interest is in the TopN, we can expand it to see where this application is visible and select that source. That will take us to a detail page, which is already filtered by both application and source of the flow data.

 

We can also select our source of flow data directly in the Flow Navigator. We can build our alert based upon a node that reports flow, or upon a specific interface:

 

 

Once we have a context for an alert, we can select an application. If we use the "TopN Applications" resource, we have already identified both the application and the node or interface where it's visible.

Another way to arrive at this context can make use of the Flow Navigator, where we can explicitly select the application we’re interested in:

 

 

 

We can select either Applications, or NBAR2 Applications, to help describe the traffic. With the context now fully described, we are able to open the "Create a Flow Alert" panel and create our first alert:

 

 

At the top of the panel, we'll see the source of the flow data that we'll evaluate, and a default alert name prefix. We can customize the alert name to help make searching simpler. The severity of the alert is configurable:

 

For the Trigger Condition, we'll select one of the options described above. In this case, we'll select "Application Traffic exceeds Threshold," and we'll set a threshold of 50MBps on the ingress. We'll evaluate the last five minutes of traffic; this is configurable. This threshold will trigger when our traffic rate averages greater than 50MBps over the five min. time period.

 

Finally, we can specify one or several protocols; if we specify more than one, we'll sum the traffic volumes for all the protocols.

 

To create the alert, there are two options. We can select the "Create Alert" immediately, and this will simply log the alert when it triggers. Or, we can check the box to open the alert in the Advanced Alert Editor and then select "Create Alert." Selecting this option will redirect us to the last step in the "Add New Alert" wizard, where we can modify the trigger actions, reset actions, or time of day schedule.

 

 

The trigger condition is an advanced SWQL query, pre-populated with the contextual information on the source and application.

 

Before submitting this new alert, we'll see a message indicating whether the alert will trigger immediately.

 

Practical Alert Scenarios

Use the "exceeds threshold" alert for application traffic levels that average above or below the specified threshold.

Use the operation for ">" (greater than) or "<=" (less than or equal to) to determine then you can alert above or below the threshold. For example:

  • To determine when backup application traffic is running out of schedule
  • To identify large file transfers in the middle of the day
  • To identify DDOS attacks, or when Port 0 traffic is present at all

Use the <= “exceeds threshold” to help detect when an application server process goes offline and stops sending traffic.

  • The application service may have crashed
  • An intermediate connectivity problem (firewall or outage) may have reduced traffic

Use alerts related to applications appearing in—or dropping out of—the TopN can be useful for detecting sudden changes in traffic volume relative to other applications. Examples include:

  • Detecting streaming or peer-to-peer file sharing applications that are transient
  • Detecting changes in the mix of applications that usually traverse an interface

 

You can also set up an alert for each of your NetFlow sources to help take action if the configuration is modified, or firewall rules block flow traffic.

 

User Experience Improvements

This release of NTA also includes a number of small but significant improvements in the user interface to help enhance scalability and improve ease of use. Several long lists are now uniformly ordered, and we’ve changed how we label certain features to be clearer in the navigation.

 

Additional Resources

Check out the Release Notes, download the new release on the Customer Portal, and get additional help with the upgrade at the Success Center.

 

You can see these new features in action in the webcast, “Up, Down, and Gone: A Tale of Applications and Flow.”

 

This is an initial introduction of the traffic alerting feature. Be sure to enter additional feature requests and expanded functionality that you'd like to see with this capability!

 

jreves

NPM 12.4 is available today, December 4, on the Customer Portal! The release notes are a great place to get a broad overview of everything in the release. Here, I'd like to go into greater depth on the brand-new Cisco ACI support. Let’s talk a bit about how software-defined networks are different than traditional networks, what that means for monitoring, and how to get the most out of the new ACI monitoring feature.

 

What is SDN?

 

The first time I heard the term Software Defined Network, I thought it was stupid. All networks are defined by software. Software moves packets and frames, or programs the hardware that does it. Software is used to manually configure networks via CLI. Software is used to automatically configure networks with protocols like OSPF, STP, and LLDP. Networks were alreadysoftware-defined!

 

Whether SDN is a good name or not, it is an important concept. There’s a lot of people trying to define SDN, usually with some ulterior motive of placing themselves in a favorable position. For a slightly less biased view, check out the Wikipedia definition. The thing that stands out to me is:

SDN suggests to centralize network intelligence in one network component by disassociating the forwarding process of network packets (data plane) from the routing process (control plane).

 

This is a big change. In an SDN environment, network devices like routers and switches become simple devices that just move traffic at a high rate. All the intelligence is in a separate device called the controller. The controller learns how everything is connected, what connectivity applications need, and writes instructions to all of the network devices so they know how to forward traffic.

 

There are a ton of SDN solutions available today. The two most popular commercial solutions seem to be Cisco ACI and VMware NSX. Cisco ACI is more commonly requested by our customers (see NPM Monitor Cisco ACI and Support of a Cisco ACI networks in Network Performance Monitor compared to Vmware NSX Support), so we’ve built support for it first.

 

How Do I Monitor SDN?

 

An SDN fabric consists of a data plane and a control plane. The data plane is comprised of physical devices, Nexus switches, and, in the case of Cisco ACI, cabling. The control plane is comprised of many logical components that fit together to define what endpoints are allowed to send network traffic to each other. The modular nature of the configuration reminds me of Cisco’s MQC. To make sure your SDN environment is running well, you need to monitor both layers.

 

Data Plane (aka Underlay aka Infrastructure Layer)

 

AKA the boring stuff. This is not the glamorous part of SDN. It’s the stuff you’ve been doing for years: power supplies, fans, temperatures, CPU, RAM, and interface stats. The fact of the matter is, these things all need to function properly for your SDN environment to be performant and reliable.

 

The data plane for Cisco ACI environments is made up of the Cisco Nexus model line. Fortunately, NPM 12.3, the release before this one, introduced Network Insight for Nexus. This gave NPM better than ever support for this hardware.

 

It’s easy to set up. Navigate over to Settings (top menu bar) -> Manage Nodes -> Add Node. Add your spine switches and leaf switches as SNMP nodes. On the last step, make sure to check this box:

 

 

If you already have your switches in NPM, you can find the same checkbox when you edit a node.

 

You’ll be prompted for your CLI credentials. CLI is the only way some of this very important data is available, so that’s how NPM gets it. This will cover the basics like power supplies, fans, temperature sensors, CPU, RAM, and interface statistics, plus the advanced stuff like VPC.  Those of you with NCM can also get access list version control and analysis. Those of you with NTA will get flow analysis. You can check all of that out on our demo site here.

 

Okay, let’s get to the new interesting stuff.

 

 

Control Plane (aka Overlay aka Control Layer)

 

In an SDN environment, the controller has all the intelligence. This has a big impact on monitoring. Instead of polling dozens or hundreds of devices that each have their own very narrow view of the network, we can poll the controller directly. It has to know where everything is or it couldn’t control it. This means we can learn a lot from monitoring it.

 

This part is also easy to set up. Navigate again to Settings (top menu bar) -> Manage Nodes -> Add Node. In a Cisco ACI environment, the controller is called an APIC. Add your controller as SNMP nodes. At the bottom of the first screen you’ll see this checkbox:

 

 

Check it! If you’ve already got your APIC added, edit the node and you can find the same box to check.

 

Cisco strongly recommends each ACI fabric have three APICs. Since each APIC must be able to control the entire network if necessary, each APIC has a complete view of the network. Polling them all results in a lot of duplication of work and potentially duplicate alerts. You have a choice in how you approach monitoring of these devices:

  1. 1)    Add all three APICs to monitor but enable API-based ACI polling (the checkbox) for only one controller.
    1. a.    Pros: efficient for the APICs and efficient for NPM.
    2. b.    Cons: if the controller you’re doing API-based polling on goes down, you’ll see the APIC is down, but you’ll lose visibility to the control plane until you fix it or enable API-based polling for another controller.
  2. 2)    Add all three APICs to monitoring and enable API based ACI polling for all three controllers.
    1. a.    Pros: Control plane monitoring works, even if one or two of the three APICs go down.
    2. b.    Cons: NPM has to poll the same data three times. APICs have to provide the same data three times. You will get duplicate alerts and reporting data unless you’re careful to write your alerts in consideration of the duplicate data. More on this in a future post.

 

Our recommendation is to do #1, but either way will work.

 

The API-based polling runs over TLS. If you have a valid cert on your controllers, everything will add fine and you’ll be good to go. If you have a self-signed cert, you will receive a warning about it and you’ll have to accept the risk or replace it with a properly signed cert before proceeding. You do have a real cert on your APIC, right?

 

Once you complete the add node wizard, navigate on over to Node Details for one of your APICs with API-based polling enabled. You can click along with me right now on the Online Demo.  On the left side, you’ll see two new views: Members and Map. Let’s look at Members first.

 

 

The Members view shows all of the logical components we have discovered. This includes Tenants, Application Profiles, and EndPoint Groups. It also includes the APIC’s view of the physical components: leaf switches and spine switches.

 

 

This uses the framework’s List View, which is a polished way to deal with large lists. You can do multilevel filtering on the left, like sort, and search. The list contains the name of the component (example: Tenant3), the type of component (example: Tenant), and the distinguished name (example: uni/tn-Tenant3). On the right, we see the health score. Let’s talk about that.

 

Since the controller has visibility into all components and their relationships, for the first time, part of the network infrastructure is in a position to accurately assess its health. Cisco ACI does this by assigning a health score. The health score is an integer from 1 to 100, where 100 is perfectly healthy and less than 100... isn’t. The health score takes into consideration both parents and descendants in the ACI model. You can check out the exact formula here. Since health scores represent status, they’re polled at the status interval in NPM. As always, you can adjust this interval. All of this data is polled via Cortex, incidentally, our new polling framework that you previously saw powering PerfStack Real-Time Polling.

 

Health scores will be colored red, yellow, or green according to thresholds. There are thresholds on the APIC already for this that determine what color that score is in the APIC GUI. To stay consistent, NPM learns the thresholds from the APIC and applies those. If you customize the thresholds on the APIC, NPM will learn and apply the new threshold settings.

 

You can click on a health score to get the history in the PerfStack dashboard:

 

 

Thanks to this being in PerfStack, it’s easy to start correlating other metrics about the APIC, leaf switches, and spine switches. It gets more interesting when you start correlating to end node availability, latency, and other data NPM has. If you own other modules on the Orion Platform, you can correlate that data too; for example, application counters, database wait time, IOPs, logs, and all the rest. Seeing all this data normalized on the same shared timeline is powerful for troubleshooting. If a health score is in bad shape and you think the issue is on the controller, it’s time to log in to the APIC itself. The APIC can tell you what is causing the score to be what it is and has a bunch of additional ways to troubleshoot.

 

Returning to the sub-view menu on the left, let’s check out the Map tab.

 

When you first open the map, you’re only going to see the APIC in the center. To get more on the map, select the APIC. On the right side, the inspector panel will open. Here you can check the box next to related entities and press Add at the bottom to add them to the map. You can use this method to continue to spider through your ACI environment. This works well for creating a map of a small ACI environment or of a specific section of a larger ACI environment, like a tenant or an app. Once you’ve got a map you like, you can select to Save as a group in the top right. From that point forward, you can navigate to that group and press the Map tab to see the map again. Here’s an example of one I saved in my lab:

 

 

Pretty slick! One important note: the APIC GUI already has some capability to map an ACI environment. In talking to NPM users who run ACI environments, I frequently heard that they would like to grant read-only access via a common platform for folks who don’t have access to the APIC directly, like NOC engineers. This accomplishes that goal and lets you correlate and visualize with all of the other data currently available in Orion Maps.

 

Next Steps

 

To upgrade now, customers with NPM under active maintenance can head over to the Customer Portal and download NPM 12.4. Thanks to the improved Orion Installer, upgrade is faster than ever with centralized upgrade of additional polling engines. Once you’re installed, add those ACI nodes and reply here to let us know how it’s working for you!

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.