In my last post I looked at how flags can pull useful information out of packet that we otherwise we might struggle to see. This time, we’re going to use tcpdump to look into the actual applications.
The first application I'm going to look at is the humble Domain Name Service (DNS), the thing that needs to work flawlessly before any other application can get out of bed. Because DNS lookups are typically embedded in the OS IP stack, a packet capture is often the only way to get any visibility.
The NS in a Haystack
In my scenario, I suspect that something is amiss with DNS, but I'm not sure what. So to pick up just DNS traffic, we only need to capture with simple filter:
[~] # tcpdump -i eth0 -pn port 53
In the above example, even with minimal options selected, we can see some really useful information. The built-in decoder pulls out the transaction ID from the client (equivalent to a session ID), the query type (A Record) and the FQDN we are looking for. What is unusual in this example is that we can see not one, but two queries, about 5 seconds apart. Given that we are filtering on all port 53 traffic, we should have seen a reply. It would appear that my local DNS proxy (172.16.10.1) for some reason failed to respond. The client-side resolver timed out and tried the Google Public DNS. This may be a one-time event, but but it certainly bears monitoring. If the client configuration has an unresponsive or unreliable DNS server as first port of call, a the very least, this will manifest in a frustrating browsing experience.
Selection of the Fittest (and Fastest)
Selection of DNS servers is pretty important; I hadn't realised that my test Linux box was using Google as a secondary resolver. Whilst it is reliable; it’s actually four hops and a dozen milliseconds further away than my ISP service. When your broadband is as crappy as mine, every millisecond counts.
Anyway, as you can see, Google returns eight A records for google.co.uk; any of them should be fine.
Another thing to look for is what happens when we make an invalid query, or there is no valid response:
In this case we get a NXDomain (non-existent domain) error. This case is an obvious typo on my part, but if we turn up the logging with the very verbose (vv) switch the response is still interesting:
[~] # tcpdump -i eth0 -pnvv port 53
Highlighted above is the SOA (start of authority) record for the domain .ac.uk; this is far as the server was able to chase the referral before it got the NXDomain response.
Edit - Contributor JSwan Pointed out a small mistake; I've fixed the below version.
Whilst a bunch of stuff is revealed with very verbose enabled; not all of it is useful. One thing to look for is the IP time to live (TTL); this shows how many hops the packet has made since leaving the source. If this number is low, it can be an indicator of routing problems or high latency (I did say it wasn't very useful!).
Furthermore, the DNS protocol-specific TTL can be seen highlighted in yellow, after the serial number in date format. The DNS TTL specifies how long the client (or referring server) should cache the record before checking again. For static services such as mail, TTLs can be 24 hours or more. However, for dynamic web services this can be as low as 1 second. TTLs that low are not a great idea; they generate HUGE amounts of DNS traffic which can snowball out of control. The moral is, make sure that the TTLs you are getting (or setting) are appropriate to your use-case. If you failover to your backup data centre, with a DNS TTL of a week, it will be a long time before all the caches will be flushed.
As JSwan points out in the comments, if you use the very very verbose switch (-vvv), for A records tcpdump will display the DNS TTL in hours, minutes and seconds:
Apparently Google has very short TTL. Interestingly, tcpdump doesn't print the DNS TTL for NXDOMAIN result, although it is still visible in the capture.
Why is capturing in context important?
Imagine trying to troubleshoot connectivity for a network appliance. You have configured IP addressing, routing and DNS, but yet it cannot dial-home to it’s cloud service. Unless the vendor has been really thorough in documenting their error messages, a simple DNS fault can leave you stumped. Once again, tcpdump saves the day; even on non-TCP traffic. The built in protocol decoder gives vital clues as to what may be borking a simple communication problem.
In my next and final blog of this series, I’m going to look at another common protocol, HTTP.