1 15 16 17 18 19 Previous Next

Geek Speak

1,629 posts

All of us that have had any experience in the IT field have had to deal with patching at some point in time.  It's a necessary evil, why an evil?  Well if you've had to deal with patches then you know it can be a major pain.  When I hear words like SCCM or Patch Tuesday, I cringe, especially if I'm in charge of path management.  We all love Microsoft (ahem), but let's be honest, they have more patches than any other software vendor in this galaxy!  VMware has their patching, Linux machines are patched, but Windows Servers, there is some heavy lifting when it comes to patching.  Most of my memories or experiences of staying up past 12 am to do IT work has revolved around patching, and again, it's not something that everybody jumps to volunteer for.  While it's definitely not riveting work, it is crucial to the security of your server, network device, desktops, <plug in system here>.  Most software vendors are good about pushing out up to date patches to their systems such as Microsoft, however there are some other types of systems that we as IT staff have to go out and pull down from the vendor's site, this adds more complexity to the patching.


My question is, what are you doing to manage your organization's patching?  Are you using SCCM, WSUS or some other type of patch management?  Or are you out there still banging away at manually patching your systems, hopefully not, but maybe you aren't a full blown enterprise.  I'm curious, because to me patching is the most mundane and painful process out there, especially if you are doing it manually.

Security management and response systems are often high-profile investments that occur only when the impact of IT threats to the business are fully appreciated by management. At least in the small and midmarket space, this understanding only rarely happens before the pain of a security breach, and even then enlightenment comes only after repeated exposure. When it does, it's amazing how seriously the matter is taken and how quickly a budget is established. Until this occurs, however, the system is often seen as a commodity purchase rather than an investment in an ongoing business-critical process.


Unfortunately, before the need is realized, there is often little will on the part of the business to take some action. In many cases, organizations are highly resistant to even a commodity approach because they haven't yet suffered a breach. One might think that these cases are in the minority, but as many as 60% of businesses either have an outdated "We have a firewall, so we're safe!" security strategy or no security strategy at all.
[Source: Cisco Press Release: New Cisco Security Study Shows Canadian Businesses Not Prepared For Security Threats - December 2014]


Obviously, different clients will be at varying stages of security self-awareness, with some a bit further along than others. For the ones that have nothing, they need to be convinced that a security strategy is necessary. For others, they need to be persuaded that a firewall or other security appliance is only a part of the necessary plan and not the entirety of it. No matter where they stand, the challenge is in convincing them of the need for a comprehensive policy and management process before they are burned by an intrusion and without appearing to use scare tactics.


What approaches have you taken to ensure that the influencers and decision makers appreciate the requirements before they feel the pain?

Good morning, Thwack!


I'm Jody Lemoine. I'm a network architect specializing in the small and mid-market space... and for December 2014, I'm also a Thwack Ambassador.


While researching the ideal sweet spot for SIEM log sources, I found myself wondering where and how far one should go for an effective analysis. I've seen logging depth discussed a great deal, but where are we with with sources?


The beginning of a SIEM system's value is its ability to collect logs from multiple systems into a single view. Once this is combined with an analysis engine that can correlate these and provide a contextual view, the system can theoretically pinpoint security concerns that would otherwise go undetected. This, of course, assumes that the system is looking in all of the right places.


A few years ago, the top sources for event data were firewalls, application servers and a database servers. Client computers weren't high on the list, presumably (and understandably) because of the much larger amount of aggregate data that would need to be collected and analyzed. Surprisingly, IDS/IPS and NAS/SAN logs were even lower on the scale. [Source: Information Week Reports - IT Pro Ranking: SIEM - June 2012]


These priorities suggested a focus on detecting incidents that involve standard access via established methods: the user interface via the firewall, the APIs via the application server, and the query interface via the database server. Admittedly, these were the most probable sources for incidents, but the picture was hardly complete. Without the IDS/IPS and NAS/SAN logs, any intrusion outside of the common methods wouldn't even be a factor in the SIEM system's analysis.


We've now reached the close of 2014, two and a half years later. Have we evolved in our approach to SIEM data sources, or have the assumptions of 2012 stood the test of years? If they have, is it because these sources have been sufficient or are there other factors preventing a deeper look?

8020-rule.jpgAs we wrap up the fourth installment of my Help Desk Adventure Series (I'm going to trademark that), I've described the journey from building out a help desk, defining SLAs, and incorporating workflow automation. Looking back, I think the one resource that was left out from this discussion was time. This resource is finite, difficult to find, and often buried in other daily tasks. Ever hear of the 80/20 rule? It's a slightly modified idea taken from the Pareto Principle: that we as IT professionals spend 80% of our time on trivial tasks, and 20% innovating (and providing some real impact). In some of my positions, it might as well be the 99/1 rule (or the 100/20 rule, in which I spent nights and weekends on the innovation part).


How did my team and I get the time to build out a new help desk system? I made a few "bold statements" to management to help get this off the ground and created some daily team exercises. Again - support from the upper echelons is critical!


  • The help desk, and related SLA and workflow creation, was considered an IT project with time allotments. We could budget team time towards the help desk and could give it priority in the queue.
  • The team would cover for one another when other issues arose. For example, I would spend time working issues on the call center floor so that a member of the team could focus on building a workflow. Broken concentration kills innovation.
  • I used the term "technical debt" quite often: this means that if we put off a help desk now, we pay for it with more work later. We wanted to pay off our operational debt quickly and efficiently.
  • A morning "red zone" meeting would be held at the start of the day. We'd review the backlog of work to complete and determine what we wanted to get done that day. It was also a great time to figure out how we could best help each other with various daily tasks, and communicate progress.


Knowing that it's very difficult to carve out time for any new work, I'm curious if you have any other tips to add to my list? How have you managed to free up time for your help desk creation, updates, workflows, or just general tasks that make your help desk better?

In my last post, I took a look the DNS protocol with tcpdump; and as it turns out you can do some really useful stuff with the embedded protocol decoders. So, how far can we take troubleshooting with tcpdump? Well pretty far; but in troubleshooting you have to decide whether the fastest resolution will come from the tools you have to hand or grabbing the capture and using something better. Whilst you *can* definitely do some analysis of HTTP, as we’ll see, once the initial handshake is complete, it gets messy, real quick.

ASCII - Still a thing

The most useful switch for debugging http is -A, which decodes the traffic in ASCII format, which kinda makes it human readable. To kick of a capture on our server we run:

[~] # tcpdump -i eth0 -pnA port 80

Capture 0.PNG

For sanity’s sake; I've snipped out the initial handshake. After a few packets we can see the client's request, the ACK, and the server’s response. Probably the most interesting parts are highlighted in yellow;

  • The HTTP GET request from the client (GET / HTTP/1.1)
  • The server HTTP 200 return code (HTTP/1.1 200 OK)
  • The Content-encoding (gzip)
  • The content type returned (text/html)

Sorry, I've got no HEAD

Various other headers are displayed, but they not usually that useful. Beyond that, it’s HTML:

Capture 1.PNG

But, that doesn't look anything like HTML. There are no lovely <HEAD> or <HTML> tags. The clue is in the client and server headers. Whilst the sessions is not encrypted; with gzip compression enabled, for a human it may as well be. You can’t see the conversation between the client and server once the TCP and HTTP parameters are established. However, we can divine the following:

  1. The URL the client requested
  2. The server was happy to accept the request
  3. The parameters/session features enabled (such as gzip compression)
  4. But not much else

Somewhere in this packet exchange, a redirect sends the client to a different TCP port. However, from just the tcpdump we can’t see that. There *may* be an arcane way of getting tcpdump to decompress gzip on the fly, but I’ll be darned if I can figure it out. As a workaround, you could disable compression on your browser, or use a CLI web tool such as cURL. However, changing things just to troubleshoot is never a good idea; and that wouldn't help if your problem is with gzip compression.

404 - Not Found

Another example shows a less than healthy connection:

Capture 2.PNG

This time, the client is requesting reallybadfile.exe; so the server returns a 404 Not Found error. Random clients attempting to request executables is of course an example of virus or other malicious activity. Many firewalls can filter this stuff out at the edge, but this a job best suited to a load balancer or Application Delivery controller (posh load balancer).

If you are just interested in the negative status codes, you of course can just pipe the output to grep:

[~] # tcpdump -i eth0 -pnA port 80 | grep '404\|GET'

Capture 3.PNG

This is an especially quick and dirty method; of course you could pipe in multiple status codes, or use egrep and a regex, but from the CLI you run a pretty big risk of missing something.

(insert Wireshark pun)

Sometimes, it’s best to admit defeat and make a permanent capture for analysis elsewhere. To do this we use the -w switch to write the packets to disk. The verbose switch is also helpful here as it reports the number of packets received during the capture so you know you've actually caught something..

[~] # tcpdump -i eth0 -pnAv port 80 -w httpcapture.pcap

Capture 3.PNG

Then, the  session can then be analysed in Wireshark with minimal effort. Just grab the file any way you can, and open it; the built in decoders will work their magic. Once loaded, go to Analyze and Follow TCP Stream. This will show you a cleaned up version of the capture we took in the beginning; but with the payload still encoded.


No problem, just go and find the HTTP response packet, dig through and the payload should be in the clear:

Capture 6.PNG

And here, now highlighted in yellow we can see a tiny piece of JavaScript that redirects my HTTP session to another location. By not using a HTTP/1.1 300 status code to redirect the session, it became much harder (if not impossible) to realise what was going on just using tcpdump. With some elite-ninja regex skills, and a perfect recollection of the HTTP protocol, maybe you could have figured this out without bringing in the Wireshark heavy artillery. However, for mere mortals such as myself, it’s just a case of knowing when to bring in the really big hammer.

So, that’s my little treatise on the deep packet analysis, and with some practical applications. Please let me know in the comments your thoughts, and any tips you can share with tcpdump, Wireshark, or any of the other excellent tools out there.

Day in day out, network engineers and admins tackle many complex issues. Often, these issues stem directly from an IP conflict. While IP conflicts might seem like one of the simplest problems that occur, they can actually be quite troublesome. How often do you find yourself spending an excess amount of time troubleshooting network issues and then find the reason to be an IP conflict?


We all know what an IP conflict is and ways it can be triggered. The simplest one being the manual assignment of static IP addresses. A misconfigured DHCP server is another reason, where more than one similarly configured DHCP hands out overlapping IP addresses. There is the possibility of IP conflicts occurring if you have multiple wireless access points or other network devices with an embedded DHCP server turned on by default.  DHCP servers also cause conflicts when a statically assigned IP coincides with an address that’s already been assigned by the server. And the most recurrent cause, more so nowadays, is with the increase in devices frequently entering and leaving a network. When nodes like a virtual machine reconnect after an extended period in stand-by or hibernate mode, a conflict is most likely to occur if the address configured on that system was previously assigned to another system that's already in use on the network.


With IP conflicts occurring in various ways across the network, its difficulty lies in troubleshooting and recognizing that the cause of a particular network issue is an IP conflict. Moreover, it’s even more difficult and time consuming to identify and physically locate the devices in conflict.


The Pain in Troubleshooting IP Conflicts

Recent survey results revealed that IT professionals on an average spent 16% of their total time troubleshooting IP problems. One basic and important criterion for efficient IP address management is to avoid IP duplication. IP conflicts pose a threat because they can cause connectivity issues for the systems in conflict. These connection issues are difficult to troubleshoot as the systems experience erratic connectivity. The administrator often ends up looking in a variety of places for the cause before finally identifying the culprit to be IP duplication. Again, the seriousness of the network issue differs when the conflicting system is a simple work station or a critical application server.


For example, say someone brings a notebook into work and plugs it into the network. Soon thereafter, all remote connections to the accounting server go down. Not knowing why, IT starts to investigate the problem. First they reboot the remote access server.  Next, they change the switch port and network cables. Then, as a last resort, they try unplugging all devices from the switch. After that, the problem just goes away. Problem solved, another device was connected to the switch, in turn causing the problem.  In this scenario, IT started looking for an IP conflict and much later found the outside computer plugged into the network to be the problem. Consider all the time spent troubleshooting this issue? In the end, many hours are lost troubleshooting and the business likely suffers downtime.


As a network engineer or admin, you’re often under the gun to resolve issues with the least amount of downtime. But hey, you don’t have to be…there’s always a peaceful resolution. So, how do you avoid excessive troubleshooting time or at least limit it? Do you use spreadsheets or an IP management tool?

Note: This post originally appears in Information Week: Network Computing

Why do we hear of new security breaches so frequently? Make sure your organization follows these best-practices and considers these important questions to protect itself.

Three big things have been happening with great frequency of late: earthquakes, volcanoes, and data breaches, most of the latter involving point-of-sale (PoS) systems and credit card information. While I'm certainly curious about the increase in earthquakes and volcanic activity, I simply do not understand the plethora of PoS breaches.

The nature and extent of the breach at Target a year ago should have been a wake-up call to all retailers and online stores that accept credit card payments. I get the feeling that it was not, but I'm not here to point fingers in hindsight. I do, however, want to call your attention to what you are, or are not, learning from these incidents, and how those lessons are being applied and leveraged within your own organization.

Lessons from Target, et al.
Let's revisit the Target breach. In short, it happened because vendor credentials were compromised and subsequently used to inject malware onto Target's systems. At the time, a number of security professionals also suggested that the retailer was likely not the only target (no pun intended).

As a result, three actions should have occurred immediately in every organization around the globe:

  • An audit of every accounts repository throughout every organization to disable/eliminate unused accounts, ensure active accounts were properly secured, and determine if any existing accounts showed any evidence of compromise
  • A full malware scan on every system, including explicit checks for the specific malware identified on the Target systems
  • A reevaluation of network connectivity, with these questions in mind:
    • How could a service vendor's credentials be used to access our PoS network?
    • Which of our networks are connected to which networks?
    • How are they connected?
    • Do firewalls exist where they should?

And yet, in the subsequent weeks after the Target announcement, a litany of big-name retailers, including Neiman Marcus, Michaels, Sally Beauty Supply, P.F. Chang's, Goodwill Industries, and Home Depot have all reported breaches that occurred around the same time or after the Target breach was disclosed.

If you haven't done the three things listed above in your organization, go do them right now!

Patching is a no-brainer
Then there was Heartbleed, perhaps the most saturated vulnerability threat in the history of network computing. Who hasn't heard about Heartbleed? It was a threat with an immediately available and simple to deploy patch. Most organizations deployed the patch immediately (or at least took their OpenSSL devices off the Internet).

And yet, despite this, Community Health Systems managed to give up 4.5 million customer healthcare records to Chinese hackers in an attack that started a week after the Heartbleed announcement. Now, while we might forgive the April attack, this theft actually continued through June! To date, this is the only known major exploit of that vulnerability. (And yet, there are still a quarter-million unpatched devices on the Internet!)

What is your plan for ensuring highly critical security updates are deployed to your devices as soon as possible -- and if not, protecting those devices from known threats?

When is compliance not compliant?
The final aspect of all of this is the alleged value of our compliance regulations, which raises some interesting questions. For example, what good comes from the PCI-DSS regulations in the face of so many breaches? Is this a failure of the compliance standards to actually define things that should be compliant? Is this a case of businesses fudging the compliance audits? Finally, where's the meat in PCI-DSS for organizations failing to be compliant?

And how responsible is management? Perhaps the most infuriating thing about the Home Depot incident is the recent report that management had been warned for years that there were known vulnerabilities, and yet did nothing.

Is your management resistant to acting responsibly about data security? Do you have a plan for changing this resistance?

The bottom line is this: Don't be the next story in this long train of disasters. Go check your systems, networks, accounts, and employees. Most of all, learn from the tribulations of others.

In my last post I looked at how flags can pull useful information out of packet that we otherwise we might struggle to see.  This time, we’re going to use tcpdump to look into the actual applications.

The first application I'm going to look at is the humble Domain Name Service (DNS), the thing that needs to work flawlessly before any other application can get out of bed. Because DNS lookups are typically embedded in the OS IP stack, a packet capture is often the only way to get any visibility.

The NS in a Haystack

In my scenario, I suspect that something is amiss with DNS, but I'm not sure what. So to pick up just DNS traffic, we only need to capture with simple filter:

[~] # tcpdump -i eth0 -pn port 53


In the above example, even with minimal options selected, we can see some really useful information. The built-in decoder pulls out the transaction ID from the client (equivalent to a session ID), the query type (A Record) and the FQDN we are looking for. What is unusual in this example is that we can see not one, but two queries, about 5 seconds apart. Given that we are filtering on all port 53 traffic, we should have seen a reply. It would appear that my local DNS proxy ( for some reason failed to respond. The client-side resolver timed out and tried the Google Public DNS. This may be a one-time event, but but it certainly bears monitoring. If the client configuration has an unresponsive or unreliable DNS server as first port of call, a the very least, this will manifest in a frustrating browsing experience.

Selection of the Fittest (and Fastest)

Selection of DNS servers is pretty important; I hadn't realised that my test Linux box was using Google as a secondary resolver. Whilst it is reliable; it’s actually four hops and a dozen milliseconds further away than my ISP service. When your broadband is as crappy as mine, every millisecond counts.

Anyway, as you can see, Google returns eight A records for google.co.uk; any of them should be fine.

Another thing to look for is what happens when we make an invalid query, or there is no valid response:


In this case we get a NXDomain (non-existent domain) error. This case is an obvious typo on my part, but if we turn up the logging with the very verbose (vv) switch the response is still interesting:

[~] # tcpdump -i eth0 -pnvv port 53


Highlighted above is the SOA (start of authority) record for the domain .ac.uk; this is far as the server was able to chase the referral before it got the NXDomain response. 

Edit - Contributor JSwan Pointed out a small mistake; I've fixed the below version.

Whilst a bunch of stuff is revealed with very verbose enabled; not all of it is useful. One thing to look for is the IP time to live (TTL); this shows how many hops the packet has made since leaving the source. If this number is low, it can be an indicator of routing problems or high latency (I did say it wasn't very useful!).

Furthermore, the DNS protocol-specific TTL can be seen highlighted in yellow, after the serial number in date format. The DNS TTL specifies how long the client (or referring server) should cache the record before checking again. For static services such as mail, TTLs can be 24 hours or more. However, for dynamic web services this can be as low as 1 second. TTLs that low are not a great idea; they generate HUGE amounts of DNS traffic which can snowball out of control. The moral is, make sure that the TTLs you are getting (or setting) are appropriate to your use-case. If you failover to your backup data centre, with a DNS TTL of a week, it will be a long time before all the caches will be flushed.

As JSwan points out in the comments, if you use the very very verbose switch (-vvv), for A records tcpdump will display the DNS TTL in hours, minutes and seconds:


Apparently Google has very short TTL. Interestingly, tcpdump doesn't print the DNS TTL for NXDOMAIN result, although it is still visible in the capture.


Why is capturing in context important?

Imagine trying to troubleshoot connectivity for a network appliance. You have configured IP addressing, routing and DNS, but yet it cannot dial-home to it’s cloud service. Unless the vendor has been really thorough in documenting their error messages, a simple DNS fault can leave you stumped. Once again, tcpdump saves the day; even on non-TCP traffic. The built in protocol decoder gives vital clues as to what may be borking a simple communication problem.

In my next and final blog of this series, I’m going to look at another common protocol, HTTP.

flow_charts.pngNow that the IT team has established a help desk and have SLAs defined for various requests and workflows, it was time to get proactive. Perhaps using the data stored in the help desk could further build a sort of analytics engine for making better decisions? This idea came from one of the IT team members, as he saw certain trends that none of us really could.


For example:

  • Which activities are the most frequent, or the most time consuming, and is there a correlation?
  • Could we find ways to proactively automate a number of frequently occurring requests, such as password resets?
  • Was there a way to extrapolate our hardware replacement rates to build a reasonably accurate forecast for next quarter (or next year) budget numbers?


It turned out that a number of workflows could be created to solve / remediate requests, with our help desk system acting as the front end interface. Password requests ended up being the most common, and so we assigned non-IT supervisor staff the ability to issue time sensitive password resets for their pods (their teams) once per day. The resets were still audited to ensure no funny business took place, but alleviated a fair bit of pain as a call staff member would forget his or her password right as they were about to begin their shift. Interestingly enough, we found out that many employees were getting by this by sharing a password or requiring a supervisor override the login for our VoIP system. As such, our delegation of password resets closed this loop and gave us further visibility into the problem.


What sort of workflows have you built, or wish you could build, around the help desk system in place? How would you use workflows to offload some of the "manual, heavy lifting" off you or your team?

A Guide to Navigating Around the Deadly Icebergs that Threaten Virtualized Databases

When it comes to virtualized workloads, databases can be in a size class all to themselves. This and other factors can lead to several unique challenges that on the surface might not seem all that significant, but in reality can quickly sink a project or implementation if not given proper consideration. If you're put in charge of navigating such a virtual ocean liner-sized workload through these iceberg-infested waters, you'll want to make sure you're captain of the Queen Mary and not the Titanic. How do you do that? The first step is understanding where the icebergs are.I recently had a conversation with Thomas LaRock, president of the Professional Association for SQL Server (PASS) who also happens to be one of our Head Geeks here at SolarWinds, to get his input on this very topic. Here's what I learned:


CPU & Memory Allocation

First, don't treat a database like a file server when it comes to configuration and virtual resource allocation. Typical configurations allow over allocation of both memory and CPU. However, configuration of CPU shouldn't be more than 1.5-2 times the number of logical cores you have. When it comes to memory, don't over allocate at all if possible, instead going to at most 80 percent. As memory utilization gets near 100 percent, you may not have enough resources to even reboot the host. If you do push your systems, make sure that you not only have a virtualization monitoring tool, but that you actively use it.


High Availability Strategy

Most virtualization admins use snapshots and vMotion-Thomas' preferred option-as a primary approach to address high availability concerns. On the Windows side specifically, clustering and availability groups are also common. While either technology can be effective, they probably shouldn't be used together. An example of why not is a database VM being vMotioned to another VM as a result of a performance problem, but the availability group just seeing that as a server instance no longer responding. If you do use both, make sure that you don't allow automatic vMotion so there is always an operator in the mix to (hopefully) prevent problems, otherwise bad things can happen.


To Monster VM or Not?

You might wonder if an easy way to overcome the challenges of virtualized databases is simply to allocate one of VMware or Hyper-V's "monster VMs" to the database instance and just solve problems by throwing hardware at them. However, a better approach is to put database VMs on a mixed use host that includes a range of production, development, test and other workload types. The rationale being that if you have a host with nothing but one or more database servers running on it, you have no options if you accidently run out of resources. With a typical mixed use host, you're less likely to be simultaneously hammering one resource type, and if you do start to hit a resource bottleneck, the impact of turning off a development or test VM to provide short term resources will typically be less than shutting down a production database.


Taking these considerations and tips into account can help make sure your virtualized databases stay afloat a long time rather than being lost due to a potentially avoidable iceberg on the maiden voyage.

If you're looking for additional information on virtualizing databases, SolarWinds has a number of great white papers available online, including "5 Risks for Databases on VMware" and "Monitoring Database Performance on VMware."


Note: This post originally appears in VMBlog at http://vmblog.com/archive/2014/10/23/would-you-rather-captain-the-queen-mary-or-the-titanic.aspx#.VGUUxfnF8uK


tumblr_lxysu8rO3k1rnwfzro1_50031.pngIn an earlier post, I discussed the path to successfully implementing a help desk system for a mid sized organization that was new to the idea. Luckily, the hard work of my team and positive reinforcement received from management helped sell the idea further. In fact, I selected a handful of power users - you know, the folks that are "friends of IT" that are out there trying to get their work done - to test the help desk system. I also let it leak that they were using some very cool new system that gave them additional access to IT resources. Those not using the system were curious about this new system and actively sought us out to put in those "cool new tickets" instead of the old email system. It was a type of reverse psychology exercise. :-)


Now that users were actively engaged and entering data into the system via tickets, we had some more tough decisions to make around Service Level Agreements or SLAs. The top brass wanted to see results, after all, and those are easiest to swallow if wrapped in some simple numbers. It is hard, however, to define SLAs when all of your data is new and there is not much historical information to work with.


A few things we learned along the way:


  1. Allowing user-defined ticket priorities to influence which SLAs we could be held against was a mistake. Everyone feels like their ticket is high priority and requires the highest SLA. In the end, our internal IT team ended up learning how to prioritize tickets that entered the queue. The user defined priority was left in to make people feel better. :-)
  2. We ended up mimicking some SLAs from our vendors; if our server vendor offered a 4 hour replacement window, we would match that value (or go slightly above it, to allow for break/fix time after the replacement arrived).
  3. Having historical metadata wrapped around ticket objects - such as knowing how many times a specific phone had stopped working - gave us increased confidence to make actionable decisions. This helped go far beyond our "standard" 4 hour SLA because we could quickly pitch old hardware into a replacement queue, discard queue, or "fix it" queue. Hardware now told us stories about themselves.
  4. Being able to show our SLAs in hard numbers provided valuable protection against fallible human memory. It also pointed out our weak spots that needed improvement, training, or additional skill sets.


With that said, I'm curious how you offer Service Level Agreements to your help desk users. Is it a general tiered system (high, medium, low) or something more granular? How did you pick your SLA time tables?

In my last post I talked through some of the reasons why mastering tcpdump is useful. Building on our previous example in this post I’ll focus on using TCP flags for troubleshooting.

Even with our cleaned up filter, we can still see quite a lot of traffic that we don’t care about.  When troubleshooting connectivity issues, the first packet is the hardest; especially when you start involving firewalls. I’m sure you will recall, a TCP packet flagged with SYN is the first sent when client tries to establish a layer-7 connection to a remote host.  On a LAN, this is simple, but when the destination is on a different network, there is more to go wrong with the inevitable NAT and routing.

Troubleshooting approaches differ, but I prefer to jump on the target console and work backwards as traffic is usually only policed on ingress. We need to know whether our packet is reaching our target. This way we can look down the stack (at routing and firewalling) or up to application itself.

This is where a working knowledge of TCP flags and a copy of tcpdump is helpful. 

Signalling with Flags

Each TCP packet contains the source and destination port, the sequence  and acknowledgement numbers as well as a series of Flags (or control bits) which indicate one or more properties of the packet. Each flag is a single bit flipped on or off. The below diagram shows the fields in a TCP packet. The source and destination ports are 16 bit integers (and is where we get the maximum 65535 from), but if you look at offset 96, bits 8 through 15 are individual flag, SYN, ACK, RST, etc.

Capture 0.PNG

When a connection is setup between a client and server, the first three packets will have the SYN and ACK flags set. If you look at them in any packet analyser, it’ll look something like this:

Client -> Server (SYN bit set)

Server -> Client (SYN and ACK bits set)

Client -> Server (ACK bit set)

To make sure we are actually getting the traffic, we want to see the first packet in the connection handshake. To capture packets where only the SYN flag is set, we use the following command from the server console:

[~] # tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-syn) != 0 and port not 22’

The above tcp option filters on specific packet fields. In this case we are using tcpdump’s built-in shorthand to look at the bits associated with TCP SYN (the ‘Flags [S]’ in yellow). 

The other thing we are introducing is using ‘single’ quotes on the filter. This prevents the shell from trying to interpret anything within brackets.

In the below example we can see three packets sent ~500ms apart (also highlighted in yellow). You must consider that almost everything on the wire is filtered out, we are only hearing one side of the conversation.


Three packets with the same source and destination ports transmitted 500ms apart us a clue to what is happening. This is the typical behaviour of a client connection that received no response, assumed the packet was lost in transit and tried twice more.

What does the flag say?

Having taken this capture from the server; we know the outbound communication is working, so it unlikely that an intermediate firewall is causing a problem. My hunch is the server is not completing the connection handshake for some reason. A quick and dirty way is to check for TCP Reset packets; the host’s universal way of asking for a “do-over” and restarting the handshake. Hosts will respond with a TCP reset when there is no application listening; the lights are on; but no-one is home.

[~] # tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst) != 0 and port not 22'


I took the above captures a few minutes apart, but for every TCP SYN, there is a TCP RESET from the server. Whether the server is actually listening on any ports or interfaces on that target destination port (3333) is easily confirmed with:

[~] # netstat -na | grep 3333

If no results are returned, the service ain't running. When taking captures from a firewall, you should expect different behaviour. In 99% of cases if a packet doesn't a match policy; it will be dropped without an acknowledgement (ACK) or reset (RST) packet.

These are not the flags you are looking for

With the tcpflags option we can pipe in additional matches. For example, we can look for all traffic where the tcp-syn and tcp-ack flags are not set to 0.

tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst|tcp-ack) != 0 and port not 22'

However, everything with SYN or ACK set  doesn’t constitute much of a filter; you are going to be picking up a lot of traffic.

Rather than just filtering on source/destination ports, why do I care about TCP flags? I'm looking for behaviour rather than specific sources or destinations. If you are troubleshooting back-to-back LAN PCs, you wouldn't bother with all the Vexillology. However, with hosts either side of a firewall, you can’t take much for granted. When firewalls and dynamic routing is involved traffic may cross NAT zones or enter an unexpected interface.

It’s easy to switch to source/destination filtering once you've “found” the flows you are looking for; I try and avoid making assumptions when troubleshooting.

In my next post I’ll dig into the payload of two common protocols to see what we can learn about layer 7 using only tcpdump.

It’s the best of times. It’s the worst of times. Well...not quite! However, it certainly is the age of technology offering immediate ROI, sky high cost savings, and even magic that can help add to an organization’s bottom line. It’s also the time when new technology is wreaking havoc on data delivery when implemented without considering the additional traffic load it adds to the network. To think of it, global IP traffic is expected to increase 8-fold before the end of 2015. All of this is making it trickier to deliver data to the cloud, a remote site, or even just out of the edge router.


When network engineers need to police and drop unwanted traffic, prioritize business traffic, and ensure data delivery, the answer is QoS or Quality of Service. QoS can provide preferential treatment to desired traffic within your LAN, at the network edge, and even over the WAN if the ISP respects your QoS markings. ISPs have always used QoS to support their own (preferred) services or to offer better chances of delivery at a premium. While ‘end-to-end QoS’ in its real sense (from a system in your LAN, over the WAN, peered links and multiple Autonomous Systems to an end-point sitting thousands of miles away) is challenging, it’s wise to use QoS to ensure that your data at least reaches the PE device without packet loss, jitter, and errors.


Alright, now comes the fun part, implementing Cisco QoS! Some network engineers and SMBs are wary of implementing QoS for the fear of breaking something that already works. But fear not, here is some help for beginners to get started with Cisco QoS, its design & implementation strategies.


QoS Design and Implementation:

QoS design consists of 3 strategies:

  • Best Effort: Default design with no differentiation or priority for any traffic. All traffic works under the best effort.
  • IntServ: A signaling protocol such as RSVP is used to signal to routers along a path about an application or service that needs QoS. This reserves bandwidth for the application and cannot be re-allocated even when the specific application is not in use.
  • DiffServ: The most widely used option. Allows a user to group traffic packets into classes and provide a desired level of service.


The choices for QoS implementation range from traditional CLI and MQC to AutoQoS. For a beginner, the easiest would be to start with a DiffServ design strategy and use Cisco’s MQC (Modular QoS CLI) for implementation. MQC based QoS configuration involves:

  • Class-Maps: Used to match and classify your traffic into groups, say web, peer-to-peer, business-critical, or however you think it should be classified. Traffic is classified into class-maps using match statements.
  • Policy-Maps: Describes the action to be taken on the traffic classified using class-maps. Actions can be to limit the bandwidth used by a class, queue the traffic, drop it, set a QoS value, and so forth.
  • Service-Policy: The last stage is to attach the policy-map to an interface on whose traffic you wish to perform the QoS actions defined earlier. The actions can be set to act on either Ingress or Egress traffic.

MQC QoS structure.png

Now, I would like to show you a sample configuration to put unwanted traffic and a business app in two different classes and  set their priorities using IP precedence.


Creating class-maps to group traffic based on user requirements:

Rtr(config)#class-map match-any unwanted

Rtr(config-cmap)#match protocol ftp

Rtr(config-cmap)#match protocol gnutella

Rtr(config-cmap)#match protocol kazaa2


Rtr(config)#class-map videoconf

Rtr(config-cmap)#match protocol rtp


Associating the class-map to a policy and defining the action to be taken:

Rtr(config)#policy-map business

Rtr(config-pmap)#class unwanted

Rtr(config-pmap-c)#set precedence 0

Rtr(config-pmap)#class videoconf

Rtr(config-pmap-c)#set precedence 5


Assigning the policy to an interface:

Rtr(config)#interface Se0/0

Rtr(config-if)service-policy output business


QoS Validation:

The next thoughts after implementation should be on how to make sure the QoS policies you created are working - Are they dropping the traffic they are supposed to or are the QoS policies affecting the performance of your business applications?


This is where Cisco’s Class-Based QoS MIB, better known as CBQoS steps in. SNMP capable monitoring tools can collect information from the CBQoS MIB to report on the pre and post-policy statistics for every QoS policy on a device. CBQoS reports help determine the volume of traffic dropped or queued and confirms that the classifying and marking of traffic is working as expected.


Well, that completes the basics of QoS using MQC and implementation ideas for your network. While we talked about QoS configuration using classification and marking in this blog, there are more options such as congestion management, congestion avoidance, and shaping which we have not explored because they can be complex when starting out. If you have got the hang of QoS configuration using MQC, be sure to explore all options for classifying and marking traffic from here before your first QoS implementation.


Good luck creating a better network!


mail.jpgI've held a number of different positions at companies of varying size, but one instance clearly stands out in my mind. Several years ago, I had the distinct pleasure of managing a very sharp IT team for a mid sized call center. This was the first time I had ever officially managed a team, being traditionally a very hands-on tech guy.


When I began my adventures in this role, the company had rapidly grew from a small-sized operation into an environment with hundreds of staff members handling calls in each shift. It also meant that the current status quo for reporting and tracking issues (sending emails to a distribution list) would have to go; it simply didn't scale and had no way of providing the rich sets of metadata that one expects when handling problem resolution like a help desk ticket can provide.


I was challenged with fighting a system that was very simple for the end users but mostly worthless for IT. Broaching the subject of a ticketing system was met with tales of woe and that it "wouldn't work" based on past attempts. I felt, however, that introducing a help desk ticketing system simply required a few core principles to be successful:


  1. A simple process with as much abstraction away from "technical jargon" as possible.
  2. Buy-in and participation from the top echelons of the company; if the top brass were on board, their subordinates would follow suit.
  3. An empowered IT staff that could influence the look, feel, selection, and direction of the help desk system without fear of retribution or being iced out as wrong or "stupid."
  4. And, probably most important of all, faster resolution times on issues that went through the help desk through various synergies from related metadata (tracking hardware, software, and historical trends).

These were my four ideas, and I'll share how things went in my next blog post.


What about your ideas? Do you have a story to share that explains how you got your team on-board a new help desk system, and if so - how did you do it? :-)

Can something exist without being perceived?

No, this blog hasn't taken a turn into the maddening world of metaphysics. Not yet.


I'm talking about event and performance logging, naturally. In the infrastructure racket profession (well, I think it's a vocation, but I'll get to that in a later post), we're conditioned to set up logging for all of our systems. It's usually for the following reasons:

  1. You were told to do it.
  3. Security told you to do it.


So you dutifully, begrudgingly configure your remote log hosts, or you deploy your logging agents, or you do some other manner of configuration to enable logging to satisfy a requirement. And then you go about your job. Easy.But what about that data? Where does it go? And what happens to it once it's there? Do you use any tools to exploit that data? Or does it just consume blocks on a spinning piece of rust in your data center? Have I asked enough rhetorical questions yet?
*  *  *
The pragmatic engineer seeks to acquire knowledge of all of her or his systems, and in times of service degradation or outage, such knowledge can reduce downtime. But knowledge of a system typically requires an understanding of "normal" performance. And that understanding can only come from the analysis of collected events and performance data.

If you send your performance data, for example, to a logging system that is incapable of presenting and analyzing that data, then what's the point of logging in the first place? If you can't put that data to work, and exploit the data to make informed decisions about your infrastructure, what's the point? Why collect data if you have no intent (or capacity) to use it?

Dashboarding for Fun and Profit (but mostly for Fun)

One great way to make your data meaningful is to present it in the only way that those management-types know: dashboards. It's okay if you just rolled your eyes. The word "dashboard" was murdered by marketing in the last 10 years. And what a shame. Because we all stare at a dashboard while we're driving to and from work, and we likely don't realize how powerful it is to have all of the information we need to make decisions about how we drive right in front of us. The same should be true for your dashboards at work.So here are a few tips for you, dear readers, to guide you in the creation of meaningful dashboards:

  1. Present the data you need, not the data you want. It's easy to start throwing every metric you have available at your dashboard. And most tools will allow you to do so. You certainly won't get an error that says, "dude, lay off the metrics." But just because you can display certain metrics, doesn't mean you should. For example, CPU and memory % utilization are dashboard stalwarts. Use them whenever you need a quick sense of health for a device. But do you really need to display your disk queue length for every system on the main dashboard? No.
  2. Less is more. Be selective not only in the types of data you present, but also in the quantity of data you present. Avoid filling every pixel with a gauge or bar chart; these aren't Victorian works, and horror vacuidoes not apply here. When you develop a dashboard, you're crossing into the realm of information architecture and design. Build your spaces carefully.
  3. Know your audience. You'll recall that I called out the "management-types" when talking about the intended audience for your dashboards. That was intentional. Hard-nosed engineers are often content with function over form; personally, I'll take a shell with grep, sed, and awk and I can make /var/log beg for mercy. But The Suits want form over function. So make the data work for them.
  4. Think services, not servers. When you spend your 8 hours a days managing hosts and devices, you tend to think about the infrastructure as a collection of servers, switches, storage, and software. But dashboards should focus on the services that these devices, when cooperating, provide. Again, The Suits don't care if srvw2k8r2xcmlbx01 is running at 100% CPU; they care that email for the Director's office just went down.


Don't ignore the dashboard functionality of your monitoring solution just because you're tired of hearing your account rep say "dashboard" so many times that the word loses all meaning. When used properly, and with a little bit of work on your part, a dashboard can put all of that event and performance data to work.


Note: This post originally appears at eager0.com.


Filter Blog

By date:
By tag: