Skip navigation

Geek Speak

10 Posts authored by: ghostinthenet

Challenge

 

In previous weeks, we talked about application-aware monitoring, perspective monitoring and agent/responder meshes to get a decentralized view of how our network is functioning.

 

With our traditional network monitoring system (NMS), we have a device-level and interface-level view. That's becoming less and less true as modern software breaks the mould of tradition, but it's still the core of its functionality. Now we have added perspective monitoring and possibly some agent/responder monitoring to the mix. How do we correlate these so that we have a single source of meaningful information?

 

Maybe We Don't?

 

Describing the use of the phrase "Single Pane of Glass" (SPoG) in product presentations as "overused" is an understatement. The idea of bringing everything back to a single view has been the holy grail of product interface design for some time. This makes sense, as long as all of that information is relevant to what we need at the time. With our traditional NMS, that SPoG is usually the dashboard that tells us whether the network is operating at baseline levels or not.

 

Perspective monitoring and agent/responder meshes can gather a lot more data on what's going on in the network as a whole. We have the option of feeding all that directly into the NMS, but is that where we're going to get the best perspective?

 

Data Visualization

 

We're living in a world of big data. The more we get, the less likely it becomes that we will be able to consume it in a meaningful way. Historically, we have searched for the relevant information in our network and filtered out what isn't immediately relevant. Big data is teaching us that it's all relevant, at least when taken as a whole.

 

Enter log aggregators and data visualization systems. Most of the information that we're getting from our decentralized tools can be captured in such a way as to feed these systems. Instead of just feeding it into the NMS, we have the option of collecting all of this data into custom visuals. These can give us single view not only of where the network is experiencing chronic problems, but of where we need to adjust our baselines.

 

Whether we're looking at Elastic Stack, Splunk, Tableau or other tools. The potential to capture the gestalt of our network's data and present it usefully is worthwhile.

 

Which?

 

What if there's something in all this that indicates unacceptable performance or a failure? Yes, that should raise alerts in our NMS.

 

This isn't an either/or thing. It's a complementary approach. There's no reason why the data from our various agents and probes can't feed both. Depending on the tool that's used, the information can even be forwarded directly from the visualizer, simplifying the collection process.

 

The Whisper in the Wires

 

Depending on what we’re looking for, there’s more than one tool for the job. Traditionally, we’re observing the network for metrics that fall outside of our baselines, particularly when those have catastrophic impact to operations. This is essential for timely response to immediate problems. Moving forward, we also want a bird’s eye view of how our applications and links are behaving, which may require a more flexible tool.

 

Has anyone else looked at implementing data visualization tools to complement their NMS dashboards?

Last week, we talked about monitoring the network from different perspectives. By looking at how applications perform from different points in the network, we get an approximation of the users' experience. Unfortunately, most of those tools are short on the details surrounding why there's a problem or are limited in what they can test.

On one end of our monitoring spectrum, we have traditional device-level monitoring. This is going to tell us everything we need to know that is device-specific. On the other end, we have the application-level monitoring discussed in the last couple of weeks. Here, we're going to approximate a view of how the end users see their applications performing. The former gives us a hardware perspective and the latter gives us a user perspective. Finding the perspective of the network as a whole is still somewhere between.

Using testing agents and responders on the network at varying levels can provide that intermediate view. They allow us to test against all manner of traffic, factoring in network latency and variances (jitter) in the same.

Agents and Responders

Most enterprise network devices have built-in functions for initiating and responding to test traffic. These allow us to test and report on the latency of each link from the device itself. Cisco and Huawei have IP Service Level Agreement (SLA) processes. Juniper has Real-Time Performance Monitoring (RPM) and HPE has its Network Quality Analyzer (NQA) functions, just to list a few examples. Once configured, we can read the data from them via Simple Network Management Protocol (SNMP) and track their health from our favourite network monitoring console.

Should we be in the position of having an all-Cisco shop, we can have a look at SolarWinds' IP SLA Monitor and VoIP and Network Quality Manager products to simplify setting things. Otherwise, we're looking at a more manual process if our vendor doesn't have something similar.

Levels

Observing test performance at different levels gives us reports of different granularity. By running tests at the organization, site and link levels, we can start with the bigger picture's metrics and work our way down to specific problems.

Organization

Most of these will be installed at the edge devices or close to them. They will perform edge-to-edge tests against a device at the destination organization or cloud hosting provider. There shouldn't be too many of these tests configured.

Site

Site-to-site tests will be configured close to the WAN links and will monitor overall connectivity between sites. The point of these tests is to give a general perspective on intersite traffic, so they shouldn't be installed directly on the WAN links. Depending on our organization, there could be none of these or a large number.

Link

Each network device has a test for each of its routed links to other network devices to measure latency. This is where the largest number of tests are configured, but is also where we are going to find the most detail.

Caveats

Agent and responder testing isn't passive. There's always the potential for unwanted problems caused by implementing the tests themselves.

Traffic

Agent and responder tests introduce traffic to the network for purposes of testing. While that traffic shouldn't be significant enough to cause impact, there's always the possibility that it will. We need to keep an eye on the interfaces and queues to be sure that there isn't any significant change.

Frequency and Impact

Running agents and responders on the network devices themselves are going to generate additional CPU cycles. Network devices as a whole are not known for having a lot of processing capacity. So, the frequency for running these tests may need to be adjusted to factor that in.

Processing Delay

Related to the previous paragraph, most networking devices aren't going to be performing these tests quickly. The results from these tests may require a bit of a "fudge factor" at the analysis stage to account for this.

The Whisper in the Wires

Having a mesh of agents and responders at the different levels can provide point-in-time analysis of latencies and soft failures throughout the network. But, it needs to be managed carefully to avoid having negative impacts to the network itself.

Thanks to Thwack MVP byrona for spurring some of my thinking on this topic.

Is anyone else building something along these lines?

Last week we talked about application-aware monitoring. Rather than placing our focus on the devices and interfaces, we discussed getting data that approximates our users' experiences. These users, are going to be distributed around the organization at least.  They may even be scattered around the Internet, depending on the scope of our application. We need to examine application performance from different perspectives to get a complete picture.

Any way we look at it, we're going to need active remote probes/agents to accomplish what we're looking for. Those should be programmable to emulate application behaviour, so that we can get the most relevant data. At the least, having something that can measure basic network performance from any point on the network is necesary. There are a few options.

NetPath

Last week, I was invited to Tech Field Day 12 as a delegate and had the opportunity to sit in on the first session of Networking Field Day 13 as a guest. Coincidentally, SolarWinds was the first presenter. Even more coincidentally, they were showing off the NetPath feature of Network Performance Monitor (NPM) 12. This product, while not yet fully programmable to emulate specific applications, provides detailed hop-by-hop analysis from any point at which an agent/probe can be placed. In addition, it maintains a performance history for those times when we get notification of a problem well after the fact. For those of you working with NPM 12, I'm going to recommend you have a very close look at NetPath as a beginning for this sort of monitoring. One downside of the NetPath probes is the requirement to have a Windows Professional computer running at each agent location. This makes it a heavier and more costly option, but well worth it for the information that it provides. Hopefully, the SolarWinds folks will look into lightweight options for the probe side of NetPath in the future. We're only at 1.0, so there's a lot of room for growth and development.

Looking at lighter, though less full-featured options, we have a few. They're mostly roll-your own solutions, but this adds flexibility at the cost of ease.

Lightweight VMs and ARM Appliances

If there's a little bit of room on a VM somewhere, that's enough space for a lightweight VM to be installed. Regular application performance probes can be run from these and report directly to a monitoring station via syslog or SNMP traps. These custom probes can even be controlled remotely by executing them via SSH.

In the absence of VM space, the same sort of thing can be run from a small ARM computer, like a Raspberry Pi. The probe device itself can even be powered by the on-board USB port of another networking device nearby.

Going back to NetPath for a moment, one option for SolarWinds is to leverage Windows Embedded and/or Windows IoT as a lightweight option for NetPath probes. This is something I think would be worth having a look at.

On-device Containers

A few networking companies (Cisco's ISR 4K line, for example) have opened up the ability to run small custom VMs and containers on the device itself. This extends the availability of agents/probes to locations where there are no local compute resources available.

Built-in Router/Switch Functions

Thwack MVP byrona had a brilliant idea with his implementation of IP SLA in Cisco routers and having Orion collect the statistics, presumably via SNMP. This requires no additional hardware and minimal administrative overhead. Just set up the IP SLA process and read the statistics as they're generated.

The Whisper in the Wires

NetPath is looking like a very promising approach to monitoring from different points of view. For most other solutions, we're unfortunately still mostly at the roll-your own stage. Still, we're seeing some promising solutions on the horizon.

What are you doing to get a look at your application performance from around the network?

Over the next few posts, I'm going to explore some newer thinking in network monitoring. We start by designing centralized management stations and remote agents, but is this sufficient? We look at things from a device and interface perspective and establish baselines of how the network should operate. This works, but is something of a one-size-fits-all solution. Where do we look when we want more than this?

 

A Little Bit of SDN

 

Over the last few years, Software Defined Networking (SDN) has generated a lot of buzz. Ask 10 different people what it's all about and you'll likely get 10 different answers, none of which is really incorrect. It's a term that has a definition that tends to exceed usability. Still, SDN maintains mind share because the components people tend to associate with it are desirable. These include, amongst others, centralized management and/or control, programmable configuration management and application-aware networking. This last one is of the most immediate interest for our current topic.

 

The network's performance as it relates to the key applications running on it is immediately relevant to the business. This isn't just the performance of the application across a single device. This is a look at the gestalt of the application's performance across the entire network. This allows  detection of problems and performance where they matter most. Drilling down to the specific devices and interfaces that have impact can come later.

 

Network Tomography

 

Recently, I ran across a new term, or at least it was a new term to me: Network Tomography. This describes the gathering of a network's characteristics from endpoint data. The "tomography" part of the term comes from medical technologies like Magnetic Resonance Imaging (MRI), where the internal characteristics of an object are derived from the outside. Network tomography isn't really tomography, but the term conveys the meaning fairly well. The basic idea is to detect loss or delay over a path by using active probes from the endpoints and recording the results.

 

Monitoring loss and delay over a path is a beginning. Most application performance issues are going to be covered by this approach. We can track performance from many locations in the network and still report the results centrally, giving us a more complete picture of how the business is using the network.

 

Application Metrics

 

If we're going to look at network performance from the applications' view, we'll need a fingerprint of what those applications do. Many pieces of software use a single network access method, such as Hypertext Transfer Protocol (HTTP) and can be tracked easily. Others use many different data access methods and will need more complex definition. Either way, we need to monitor all of the components in order to recognize where the problems lie. Getting these details may be as simple as speaking to the vendor or more complex and requiring some packet analysis. Regardless, if we have problems with one aspect of the applications' communications but not another, the experience is still sub-par and we may not know why.

 

The Whisper in the Wires

 

We've all run into frustrated users claiming that "the network is slow" and have torn our hair out trying to find out the specifics of what that means. Ultimately, technical specifics aside, that is specifically what it means. We're just not used to looking at it that way.

 

What do we need to take this into practice? Smarter agents? Containers or VMs that allow us to either locally or remotely test application performance? How to we automate this? I'll be giving my own perspective over the next few weeks, but would like to hear your thoughts on it in the meantime.

A network monitor uses many mechanisms to get data about the status of its devices and interconnected links. In some cases, the monitoring station and its agents collect data from devices. In others, the devices themselves report information to the station and agents. Our monitoring strategy should use the right means to get the information we need with the least impact to the network's performance. Let's have a look at the tools available to us.

Pull Methods

We'll start with pull methods, where the network monitoring station and agents query devices for relevant information.

SNMP

SNMP, the Simple Network Management Protocol has been at the core of query-based network monitoring since the early 1990s. It began with a simple method for accessing a standard body of information called a Management Information Base or MIB. Most equipment and software vendors have embraced and extended this body to varying degrees of consistency.

Further revisions (SNMPv2c, SNMPv2u and SNMPv3) came along in the 19d early 2000s. These respectively added some bulk information retrieval functionality and improved privacy and security.

SNMP information is usually polled from the monitoring station at five-minute intervals. This allows the station to compile trend data for key device metrics. Some of these metrics will be central to device operation: CPU use, free memory, uptime, &c. Others will deal with the connected links: bandwidth usage, link errors and status, queue overflows, &c.

We need to be careful when setting up SNMP queries. Many networking devices don't have a lot of spare processor cycles for handling SNMP queries, so we should minimize the frequency and volume of retrieved information. Otherwise, we risk impact to network performance just by our active monitoring.

Query Scripts

SNMP is an older technology and the information that we can retrieve can be limited. When we need to get information that isn't available through a query, we need to resort to other options. Often, script access to the device's command-line interface (CLI) is the simplest method. Utilities like expect or scripting languages like python or go will allow information to be extracted by filtering CLI output to extract necessary data.

Like SNMP, we need to be careful about taxing the devices we're querying. CLI output is usually much more detailed than an SNMP query and requires more work on the part of the device to produce it.

Push Methods

Push methods are the second group of information gathering techniques. With these, the device is sending the information to the monitoring station or its agents without first being asked.

SNMP

SNMP has a basic push model where the device sends urgent information to the monitoring station and/or agents as events occur. These SNMP traps cover most changes in most categories that we want to know about right away. For the most part, they trigger on fixed occurrences: interface up/down, routing protocol peer connection status, device reboot, &c.

RMON

RMON, Remote Network MONitoring was developed as an extension to SNMP. It puts more focus on the information flowing across the device than on the device itself and is most often used to define conditions under which an SNMP trap should be sent. Where SNMP will send a trap when a given event occurs, RMON can have more specific triggers based on more detailed information. If we're concerned about CPU spikes, for example, we can have RMON send a trap when CPU usage goes up too quickly.

Syslog

Most devices will send a log stream to a remote server for archiving and analysis. By tuning what is sent at the device level, operational details can be relayed to the monitoring station in near real time. The trick is to keep this information filtered at the transmitting device so that only the relevant information is sent.

Device Scripting

Some devices, particularly the Linux-based ones, can run scripts locally and send the output to the monitoring server via syslog or SNMP traps. Cisco devices, for example can use the Tool Command Language (TCL) or Embedded Event Manager (EEM) applets to provide this function.

Your Angle

Which technologies are you considering for your network monitoring strategies? Are you sticking with the old tried and true methods and nothing more, or are you giving some thought to something new and exciting?

Getting Started

The network monitoring piece of network management can be a frightening proposition. Ensuring that we have the information we need is an important step, but it's only the first of many. There's a lot of information out there and the picking and choosing of it can be an exercise in frustration.

A Story

I remember the first time I installed an intrusion detection system (IDS) on a network. I had the usual expectations of a first-time user. I would begin with shining a spotlight on denizens of the seedier side of the network as they came to my front door. I would observe with an all-seeing eye and revel in my newfound awareness. They would attempt to access my precious network and I would smite their feeble attempts with.... Well, you get the idea.

 

It turns out there was a lot more to it than I expected and I had to reevaluate my position. Awareness without education doesn't help much. My education began when I realized that I had failed to trim down the signatures that the IDS was using. The floodgates had opened, and my logs were filling up with everything that had even a remote possibility of being a security problem. Data was flowing faster than I could make sense of it. I had the information I needed and a whole lot more, but no more understanding of my situation than I had before. I won't even get into how I felt once I considered that this data was only a single device's worth.

 

After a time, I learned to tune things so that I was only watching for the things I was most concerned about. This isn't an unusual scenario when we're just getting started with monitoring. It's our first jump into the pool and we often go straight to the deep end, not realizing how easy it is to get in over our heads. We only realize later on that we need to start with the important bits and work our way up.

The Reality

Most of us are facing the task of monitoring larger interconnected systems. We get data from many sources and attempt to divine meaning out of the deluge. Sometimes the importance of what we're receiving is obvious and relevant. eg. A message with critical priority telling us that a device is almost out of memory. In other cases, the applicability of the information isn't as obvious. It just becomes useful material when we find out about a problem through other channels.

 

That obvious and relevant information is the best place to start. When the network is on the verge of a complete meltdown, those messages are almost always going to show up first. The trick is in getting them in time to do something about them.

Polling

Most network monitoring installations begin with polling devices for data. This may start with pinging the device to make sure it's accessible. Next, comes testing the connections to the services on the device to make sure that none of them have failed. Querying the device's well-being with Simple Network Management Protocol (SNMP) usually accompanies this too. What do these have in common? The management station is asking the network devices, usually at five minute intervals, how things are going. This is essential for collecting data for analysis and getting a picture of how things are going when everything is running. For critical problems, something more is needed.

Alerting

This is where syslog and SNMP traps come into play. This data is actively sent from the monitored devices as events occur. There is no waiting for five minute intervals to find out that the processor has suddenly spiked to 100% or that a critical interface has gone down. The downside is that there is usually a lot more information presented than is immediately necessary. This is the same kind of floodgate scenario I ran into in my earlier example. Configuring syslog to send messages at the "error" level and above is an important sanity-saving measure. SNMP traps are somewhat better for this as they report on actual events instead of every little thing that happens on the device.

The Whisper in the Wires

Ultimately, network monitoring is about two things:

 

  1. Knowing where the problems are before anyone else knows they're there and being able to fix them.

  2. Having all of the trend data to understand where problems are likely to be in the future. This provides the necessary justification to restructure or add capacity before they become a problem.

 

The first of these is the most urgent one. When we're building our monitoring systems, we need to focus on the critical things that will take our networks down first. We don't need to get sidetracked by the pretty pictures and graphs... at least not until that first bit is done. Once that's covered, we can worry about the long view.

 

The first truth of RFC 1925 "The 12 Networking Truths" is that it has to work. If we start with that as our beginning point, we're off to a good start.

In previous discussions about increasing the effectiveness of monitoring, it has been pointed out that having more eyes on the data will yield more insight into the problem. While this is true, part of the point of SIEM is to have more automated processes correlating the data so that the expense of additional human observation can be avoided. Still, no automated process quite measures up to the more flexible and intuitive observations of human beings. What if we look at a hybrid approach that didn’t require hiring additional security analysts?

 

In addition to the SIEM’s analytics, what if we took the access particulars of each user in a group and published a daily summary of what (generally) was accessed, where from, when and for how long to their immediate team and management? Such a summary could have a mechanism to anonymously flag access anomalies. In addition, the flagging system could have an optional comment on why this is seen as an abnormal event. e.g. John was with me at lunch at the time and couldn’t have accessed anything from his workstation.

 

Would something like this make the security analysis easier by having eyes with a vested interest in the security of their work examining the summaries? Would we be revealing too much about the target systems and data? Are we assuming that there is sufficient interest on the part of the team to even bother reading such a summary?

 

Thinking more darkly, is this a step onto a slippery slope of creating an Orwellian work environment? Or… is this just one more metre down a slope we’ve been sliding down for a long time?

When implementing a SIEM infrastructure, we’re very careful to inventory all of the possible vectors of attack for our critical systems, but how carefully do we consider the SIEM itself and its logging mechanisms in that list?

 

For routine intrusions, this isn’t really a consideration. The average individual doesn’t consider the possibility of being watched unless there is physical evidence (security cameras, &c) to remind them, so few steps are taken to hide their activities… if any.

 

For more serious efforts, someone wearing a black hat is going to do their homework and attempt to mitigate any mechanisms that will provide evidence of their activities. This can range from simple things like…

 

  • adding a static route on the monitored system to direct log aggregator traffic to a null destination
  • adding an outbound filter on the monitored system or access switch that blocks syslog and SNMP traffic

 

… to more advanced mechanisms like …

 

  • installing a filtering tap to block or filter syslog, SNMP and related traffic
  • filtering syslog messages to hide specific activity

 

Admittedly, these things require administrator-level or physical access to the systems in question, which is likely to trigger an event in the first place, but we also can’t dismiss the idea that some of the most significant security threats originate internally. I also look back to my first post about logging sources and wonder if devices like L2 access switches are being considered as potential vectors. They're not in the routing path, but they can certainly have ACLs applied to them.

 

I don’t wear a black hat, and I’m certain that the things I can think of are only scratching the surface of possible internal attacks on the SIEM infrastructure.

 

So, before I keep following this train of thought and start wearing a tin foil hat, let me ask these questions?

 

Are we adequately securing and monitoring the security system and supporting infrastructure?

If so, what steps are we taking to do so?

How far do we take this?

Security management and response systems are often high-profile investments that occur only when the impact of IT threats to the business are fully appreciated by management. At least in the small and midmarket space, this understanding only rarely happens before the pain of a security breach, and even then enlightenment comes only after repeated exposure. When it does, it's amazing how seriously the matter is taken and how quickly a budget is established. Until this occurs, however, the system is often seen as a commodity purchase rather than an investment in an ongoing business-critical process.

 

Unfortunately, before the need is realized, there is often little will on the part of the business to take some action. In many cases, organizations are highly resistant to even a commodity approach because they haven't yet suffered a breach. One might think that these cases are in the minority, but as many as 60% of businesses either have an outdated "We have a firewall, so we're safe!" security strategy or no security strategy at all.
[Source: Cisco Press Release: New Cisco Security Study Shows Canadian Businesses Not Prepared For Security Threats - December 2014]

 

Obviously, different clients will be at varying stages of security self-awareness, with some a bit further along than others. For the ones that have nothing, they need to be convinced that a security strategy is necessary. For others, they need to be persuaded that a firewall or other security appliance is only a part of the necessary plan and not the entirety of it. No matter where they stand, the challenge is in convincing them of the need for a comprehensive policy and management process before they are burned by an intrusion and without appearing to use scare tactics.

 

What approaches have you taken to ensure that the influencers and decision makers appreciate the requirements before they feel the pain?

Good morning, Thwack!

 

I'm Jody Lemoine. I'm a network architect specializing in the small and mid-market space... and for December 2014, I'm also a Thwack Ambassador.

 

While researching the ideal sweet spot for SIEM log sources, I found myself wondering where and how far one should go for an effective analysis. I've seen logging depth discussed a great deal, but where are we with with sources?

 

The beginning of a SIEM system's value is its ability to collect logs from multiple systems into a single view. Once this is combined with an analysis engine that can correlate these and provide a contextual view, the system can theoretically pinpoint security concerns that would otherwise go undetected. This, of course, assumes that the system is looking in all of the right places.

 

A few years ago, the top sources for event data were firewalls, application servers and a database servers. Client computers weren't high on the list, presumably (and understandably) because of the much larger amount of aggregate data that would need to be collected and analyzed. Surprisingly, IDS/IPS and NAS/SAN logs were even lower on the scale. [Source: Information Week Reports - IT Pro Ranking: SIEM - June 2012]

 

These priorities suggested a focus on detecting incidents that involve standard access via established methods: the user interface via the firewall, the APIs via the application server, and the query interface via the database server. Admittedly, these were the most probable sources for incidents, but the picture was hardly complete. Without the IDS/IPS and NAS/SAN logs, any intrusion outside of the common methods wouldn't even be a factor in the SIEM system's analysis.

 

We've now reached the close of 2014, two and a half years later. Have we evolved in our approach to SIEM data sources, or have the assumptions of 2012 stood the test of years? If they have, is it because these sources have been sufficient or are there other factors preventing a deeper look?

Filter Blog

By date: By tag: