Skip navigation
1 2 Previous Next

Geek Speak

28 Posts authored by: jgherbert

I am fascinated by the fact that in over twenty years, the networking industry still deploys firewalls in most of our networks exactly the way it did back in the day. And for many networks, that's it. The reality is that the attack vectors today are different from what they were twenty years ago, and we now need something more than just edge security.

 

The Ferro Doctrine

Listeners to the Packet Pushers podcast may have heard the inimitable Greg Ferro expound on the concept that firewalls are worthless at this point because the cost of purchasing, supporting, and maintaining them exceeds the cost to the business of any data breach that may occur as a result. To some extent, Greg has a point. After the breach has been cleared up and the costs of investigation, fines, and compensatory actions have been taken into account, the numbers in many cases do seem to be quite close. With that in mind, if you're willing to bet on not being breached for a period of time, it might actually be a money-saving strategy to just wait and hope. There's a little more to this than meets the eye, however.

 

Certificate of Participation

It's all very well to argue that a firewall would not have prevented a breach (or delayed it any longer than it already took for a company to be breached), but I'd hate to be the person trying to make that argument to my shareholders, or (in the U.S.) the Securities Exchange Commission or the Department of Health and Human Services, to pick a couple of random examples. At least if you have a firewall, you get to claim "Well, at least we tried." As a parallel, imagine that two friends have their bicycles stolen from the local railway station where they had left them. One friend used a chain and padlock to secure their bicycle, but the other just left their bicycle there because the thieves can cut through the chain easily anyway. Which friend would you feel more sympathy for? The chain and padlock at least raised the barrier of entry to only include thieves with bolt cutters.

 

The Nature Of Attacks

Greg's assertion that firewalls are not needed does have a subtle truth to it -- if it's coupled with the idea that some kind of port-based filtering at the edge is still necessary. But perhaps it doesn't need to be stateful and, typically, expensive. What if edge security was implemented on the existing routers using (by definition, stateless) access control lists instead? The obvious initial reaction might be to think, "Ah, but we must have session state!" Why? When's the last TCP sequence prediction attack you heard of? Maybe it's a long time ago, because we have stateful firewalls, but maybe it's also because the attack surface has changed.

 

Once upon a time, firewalls protected devices from attacks on open ports, but I would posit that the majority of attacks today are focused on applications accessed via a legitimate port (e.g. tcp/80 or tcp/443), and thus a firewall does little more than increment a few byte and sequence counters as an application-layer attack is taking place. A quick glance at the OWASP 2017 Top 10 List release candidate shows the wide range of ways in which applications are being assaulted. (I should note that this release candidate, RC1, was rejected, but it's a good example of what's at stake even if some specifics change when it's finally approved.)

 

If an attack takes place using a port which the firewall will permit, how is the firewall protecting the business assets? Some web application security might help here too, of course.

 

Edge Firewalls Only Protect The Edge

Another change which has become especially prevalent in the last five years is the idea of using distributed security (usually firewalls!) to move the enforcement point down toward the servers. Once upon a time, it was sometimes necessary to do this simply because centralized firewalls simply did not scale well enough to cope with the traffic they were expected to handle. The obvious solution is to have more firewalls and place them closer to the assets they are being asked to protect.

 

Host-based firewalls are perhaps the ultimate in distributed firewalls, and whether implemented within the host or at the host edge (e.g. within a vSwitch or equivalent within a hypervisor), flows within a data center environment can now be controlled, preventing the spread of attacks between hosts. VMWare's NSX is probably the most commonly seen implementation of a microsegmentation solution, but whether using NSX or another solution, the key to managing so many firewalls is to have a front end where policy is defined, then let the system figure out where to deploy which rules. It's all very well spinning up a Juniper cSRX (an SRX firewall implemented as a container) for example, on every virtualization host, but somebody has to configure the firewalls, and that's a task, if performed manually, that would rapidly spiral out of control.

 

Containers bring another level of security angst too since they can communicate with each other within a host. This has led to the creation of nanosegmentation security, which controls traffic within a host, at the container level.

 

Distributed firewalls are incredibly scalable because every new virtualization host can have a new firewall, which means that security capacity expands at the same rate as the compute capacity. Sure, licensing costs likely grow at the same rate as well, but it's the principal that's important.

 

Extending the distributed firewall idea to end-user devices isn't a bad idea either. Imagine how the spread of a worm like wannacry could have been limited if the user host firewalls could have been configured to block SMB while the worm was rampant within a network.

 

Trusted Platforms

In God we trust; all others must pay cash. For all the efforts we make to secure our networks and applications, we are usually also making the assumption that the hardware on which our network and computer runs is secure in the first place. After the many releases of NSA data, I think many have come to question whether this is actually the case. To that end, trusted platforms have become available, where components and software are monitored all the way from the original manufacturer through to assembly, and the hardware/firmware is designed to identify and warn about any kind of tampering that may have been attempted. There's a catch here, which is that the customer always has to decide to trust someone, but I get the feeling that many people would believe a third-party company's claims of non-interference over a government's. If this is important to you, there are trusted compute platforms available, and now even some trusted network platforms with a similar chain of custody-type procedures in place to help ensure legitimacy.

 

There's Always Another Tool

The good news is that security continues to be such a hot topic that there is no shortage of options when it comes to adding tools to your network (and there are many I have chosen not to mention here for the sake of brevity). There's no perfect security architecture, and whatever tools are currently running, there's usually another that could be added to fill a hole in the security stance. Many tools, at least the inline ones, add latency to the packet flows; it's unavoidable. In an environment where transaction speed is critical (e.g. high-speed trading), what's the trade off between security and latency?

 

Does this mean that we should give up on in-depth security and go back to ACLs? I don't think so. However, a security posture isn't something that can be created once then never updated. It has to be a dynamic strategy that is updated based on new technologies, new threats, and budgetary concerns. Maybe at some point, ACLs will become the right answer in a given situation. It's also not usually possible to protect against every known threat, so every decision is going to be a balance between cost, staffing, risk, and exposure. Security will always be a best effort given the known constraints.

 

We've come so far since the early firewall days, and it looks like things will continue changing, refreshing, and improving going forward as well. Today's security is not your mama's security architecture, indeed.

I'm not aware of an antivirus product for network operating systems, but in many ways, our routers and switches are just as vulnerable as a desktop computer. So, why don't we all protect them in the same way as our compute assets? In this post, I'll look at some basic tenets of securing the network infrastructure that underpins the entire business.

 

Authentication, authorization, and accounting (AAA)

Network devices intentionally leave themselves open to user access, so controlling who can get past the login prompt (authentication) is a key part of securing devices. Once logged in, it's important to control what a user can do (authorization). Ideally, what the user does should also be logged (accounting).

 

Local accounts are bad, mkay?

Local accounts (those created on the device itself) should be limited solely to backup credentials that allow access when the regular authentication service is unavailable. The password should be complex and changed regularly. In highly secure networks, access to the password should be restricted (kind of a "break glass for password" concept). Local accounts don't automatically disable themselves when an employee leaves, and far too often, I've seen accounts still active on devices for users who left the company years ago, with some of those accessible from the internet. Don't do it.

 

Use a centralized authentication service

If local accounts are bad, then the alternative is to use an authentication service like RADIUS or TACACS. Ideally, those services should, in turn, defer authentication to the company's existing authentication service, which in most cases, is Microsoft Active Directory (AD) or a similar LDAP service. This not only makes it easier to manage who has access in one place, but by using things like AD groups, it's possible to determine not just who is allowed to authenticate successfully, but what access rights they will have once logged in. The final, perhaps obvious, benefit is that it's only necessary to grant a user access in one place (AD), and they are implicitly granted access to all network devices.

 

The term process

A term (termination) process defines the list of steps to be taken when an employee leaves the company. While many of the steps relate to HR and payroll, the network team should also have a well-defined term process to help ensure that after a network employee leaves, things such as local fall back admin passwords are changed, or perhaps SNMP read/write strings are changed. The term process should also include disabling the employee's Active Directory account, which will also lock them out of all network devices because we're using an authentication service that authenticates against AD. It's magic! This is a particularly important process to have when an employee is terminated by the company, or may for any other reason be disgruntled.

 

Principal of least privilege

One of the basic security tenets is the principal of least privilege, which in basic terms, says Don't give people access to things unless they actually need it; default to giving no access at all. The same applies to network device logins, where users should be mapped to the privileged group that allows them to meet their (job) goals, while not granting permissions to do anything for which they are not authorized. For example, an NOC team might need read-only access to all devices to run show commands, but they likely should not be making configuration changes. If that's the case, one should ensure that the NOC AD group is mapped to have only read-only privileges.

 

Command authorization

Command authorization is a long-standing security feature of Cisco's TACACS+, and while sometimes painful to configure, it can allow granular control of issued commands. It's often possible to configure command filtering within the network OS configuration, often by defining privilege levels or user classes at which a command can be issued, and using RADIUS or TACACS to map the user to that group or user class at login. One company I worked for created a "staging" account on Juniper devices, which allowed the user to enter configuration mode and enter commands, and allowed the user to run commit check to validate the configuration's validity, but did not allow an actual commit to make the changes active on the device. This provided a safe environment in which to validate proposed changes without ever having the risk of the user forgetting to add check to their commit statement. Juniper users: tell me I'm not the only one who ever did that, right?

 

Command accounting

This one is simple: log everything that happens on a device. More than once in the past, we have found the root cause of an outage by checking the command logs on a device and confirming that, contrary to the claimed innocence of the engineer concerned, they actually did log in and make a change (without change control either, naturally). In the wild, I see command accounting configured on network devices far less often than I would have expected, but it's an important part of a secure network infrastructure.

 

Network time protocol (NTP)

It's great to have logs, but if the timestamps aren't accurate, it's very difficult to align events from different devices to analyze a problem. Every device should be using NTP to ensure that they have an accurate clock to use. Additionally, I advise choosing one time zone for all devices—servers included—and sticking to it. Configuring each device with its local time zone sounds like a good idea until, again, you're trying to put those logs together, and suddenly it's a huge pain. Typically, I lean towards UTC (Coordinated Universal Time, despite the letters being in the wrong order), mainly because it does not implement summer time (daylight savings time), so it's consistent all year round.

 

Encrypt all the things

Don't allow telnet to the device if you can use SSH instead. Don't run an HTTP server on the device if you can run HTTPS instead. Basically, if it's possible to avoid using an unencrypted protocol, that's the right choice. Don't just enable the encrypted protocol; go back and disable the unencrypted one. If you can run SSHv2 instead of SSHv1, you know what to do.

 

Password all the protocols

Not all protocols implement passwords perfectly, with some treating them more like SNMP strings. Nonetheless, consider using passwords (preferably using something like MD5) on any network protocols that support it, e.g., OSPF, BGP, EIGRP, NTP, VRRP, HSRP.

 

Change defaults

If I catch you with SNMP strings of public and private, I'm going to send you straight to the principal's office for a stern talking to. Seriously, this is so common and so stupid. It's worth scanning servers as well for this; quite often, if SNMP is running on a server, it's running the defaults.

 

Control access sources

Use the network operating system's features to control who can connect to them in the first place. This may take the form of a simple access list (e.g., a vty access-class in Cisco speak) or could fall within a wider Control Plane Policing (CoPP) policy, where the control for any protocol can be implemented. Access Control Lists (ACLs) aren't in themselves secure, but it's another step to overcome for any bad actor wishing to illicitly connect to the devices. If there are bastion management devices (aka jump boxes), perhaps make only those devices able to connect. Restrict from where SNMP commands can be issued. This all applies doubly for any internet-facing devices, where such protections are crucial. Don't allow management connections to a network device on an interface with a public IP. Basically, protect yourself at the IP layer as well by using passwords and AAA.

 

Ideally, all devices would be managed using their dedicated management ports, accessed through a separate management network. However, not everybody has the funding to build an out-of-band management network, and many are reliant on in-band access.

 

Define security standards and audit yer stuff

It's really worth creating a standard security policy (with reference configurations) for the network devices, and then periodically auditing the devices against it. If a device goes out of compliance is that a mistake or did somebody intentionally weaken the device security posture? Either way, just because a configuration was implemented once, it would be risky to assume it had remained in place from then on, so a regular check is worthwhile.

 

Remember why

Why are we doing all of this? The business runs over the network. If the network is impacted by a bad actor, the business can be impacted in turn. These steps are one part of a layered security plan; by protecting the underlying infrastructure, we help maintain the availability of the applications. Remember the security CIA triad —Confidentiality, Integrity, and Availability? The steps I have outlined above—and much more that I can think of—help maintain network availability and ensure that the network is not compromised. This means that we have a higher level of trust that the data we entrust to the network transport is not being siphoned off or altered in transit.

 

What steps do you take to keep your infrastructure secure?

Whatever business one might choose to examine, the network is the glue that holds everything together. Whether the network is the product (e.g. for a service provider) or simply an enabler for business operations, it is extremely important for the network to be both fast and reliable.

 

IP telephony and video conferencing have become commonplace, taking communications that previously required dedicated hardware and phone lines and moving them to the network. I have also seen many companies mothball their dedicated Storage Area Networks (SANs) and move them closer to Network Attached Storage, using iSCSI and NFS for data mounts. I also see applications utilizing cloud-based storage provided by services like Amazon's S3, which also depend on the network to move the data around. Put simply, the network is critical to modern companies.

 

Despite the importance of the network, many companies seem to have only a very basic understanding of their own network performance even though the ability to move data quickly around the network is key to success. It's important to set up monitoring to identify when performance is deviating from the norm, but in this post, I will share a few other thoughts to consider when looking at why network performance might not be what people expect it to be.

 

MTU

MTU (Maximum Transmission Unit) determines the largest frame of data that can be sent over an ethernet interface. It's important because every frame that's put on the wire contains overhead; that is, data that is not the actual payload. A typical ethernet interface might default to a physical MTU of around 1518 bytes, so let's look at how that might compare to a system that offers an MTU of 9000 bytes instead.

 

What's in a frame?

A typical TCP datagram has overhead like this:

 

  • Ethernet header (14 bytes)
  • IPv4 header (20 bytes)
  • TCP header (usually 20 bytes, up to 60 if TCP options are in play)
  • Ethernet Frame Check Sum (4 bytes)

 

That's a total of 58 bytes. The rest of the frame can be data itself, so that leaves 1460 bytes for data. The overhead for each frame represents just under 4% of the transmitted data.

 

The same frame with a 9000 byte MTU can carry 8942 bytes of data with just 0.65% overhead. Less overhead means that the data is sent more efficiently, and transfer speeds can be higher. Enabling jumbo frames (frames larger than 1500 bytes) and raising the MTU to 9000 if the hardware supports it can make a huge difference, especially for systems moving a lot of data around the network, such as the Network Attached Storage.

 

What's the catch?

Not all equipment supports a high MTU because it's hardware dependent, although most modern switches I've seen can handle 9000-byte frames reasonably well. Within a data center environment, large MTU transfers can often be achieved successfully, with positive benefits to applications as a result.

 

However, Wide Area Networks (WANs) and the internet are almost always limited to 1500 bytes, and that's a problem because those 9000-byte frames won't fit into 1500 bytes. In theory, a router can break large packets up into appropriately sized smaller chunks (fragments) and send them over links with reduced MTU, but many firewalls are configured to block fragments, and many routers refuse to fragment because of the need for the receiver to hold on to all the fragments until they arrive, reassemble the packet, then route it toward its destination. The solution to this is PMTUD (Path MTU Discovery). When a packet doesn't fit on a link without being fragmented, the router can send a message back to the sender saying, It doesn't fit, the MTU is... Great! Unfortunately, many firewalls have not been configured to allow the ICMP messages back in, for a variety of technical or security reasons, but with the ultimate result of breaking PMTUD. One way around this is to use one ethernet interface on a server for traffic internal to a data center (like storage) using a large MTU, and another interface with a smaller MTU for all other traffic. Messy, but it can help if PMTUD is broken.

 

Other encapsulations

The ethernet frame encapsulations don't end there. Don't forget there might be an additional 5 bytes required for VLAN tagging over trunk links, VXLAN encapsulation (50 bytes) and maybe even GRE or MPLS encapsulations (4 bytes each). I've found that despite the slight increase in the ratio of overhead to data, 1460 bytes is a reasonably safe MTU for most environments, but it's very dependent on exactly how the network is set up.

 

Latency

I had a complaint one time that while file transfers between servers within the New York data center were nice and fast, when the user transferred the same file to the Florida data center (basically going from near the top to the bottom of the Eastern coast of the United States) transfer rates were very disappointing, and they said the network must be broken. Of course, maybe it was, but the bigger problem without a doubt was the time it took for an IP packet to get from New York to Florida, versus the time it takes for an IP packet to move within a data center.

 

AT&T publishes a handy chart showing their current U.S. network latencies between pairs of cities. The New York to Orlando current shows that it has a 33ms latency, which is about what we were seeing on our internal network as well. Within a data center, I can move data in a millisecond or less, which is 33 times faster. What many people forget is that when using TCP, it doesn't matter how much bandwidth is available between two sites. A combination of end-to-end latency and congestion window (CWND) size will determine the maximum throughput for a single TCP session.

 

TCP session example

If it's necessary to transfer 100,000 files from NY to Orlando, which is faster:

 

  1. Transfer the files one by one?
  2. Transfer ten files in parallel?

 

It might seem that the outcome would be the same because a server with a 1G connection can only transfer 1Gbps, so whether you have one stream at 1Gbps or ten streams at 100Mbps, it's the same result. But actually, it isn't because the latency between the two sites will effectively limit the maximum bandwidth of each file transfer's TCP session. Therefore, to maximize throughput, it's necessary to utilize multiple parallel TCP streams (an approach taken very successfully for FTP/SCP transfers by the open source FileZilla tool). It's also the way that tools like those from Aspera can move data faster than a regular Windows file copy.

 

The same logic also applies to web browsers, which typically will open five or six parallel connections to a single site if there are sufficient resource requests to justify it. Of course, each TCP session requires a certain amount of overhead for connection setup. Usually a three-way handshake, and if the session is encrypted there may be a certificate or similar exchange to deal with as well. Another optimization that is available here is pipelining.

 

Pipelining

Pipelining uses a single TCP connection to issue multiple requests back to back. In HTTP protocol, this is accomplished by the HTTP header Connection: keep-alive, which is a default in HTTP/1.1. This request asks the destination server to keep the TCP connection open after completing the HTTP request in case the client has another request to make. Being able to do this allows the transfer of multiple resources with only a single TCP connection overhead (or, as many TCP connection overheads as there are parallel connections). Given that a typical web page may make many tens of calls to the same site (50+ is not unusual), this efficiency stacks up quite quickly. There's another benefit too, and that's the avoidance of TCP slow start.

 

TCP slow start

TCP is a reliable protocol. If a datagram (packet) is lost in transit, TCP can detect the loss and resend the data. To protect itself against unknown network conditions, however, TCP starts off each connection being fairly cautious about how much data it can send to the remote destination before getting confirmation back that each sent datagram was received successfully. With each successful loss-free confirmation, the sender exponentially increases the amount of data it is willing to send without a response, increasing the value of its congestion window (CWND). Packet loss causes CWND to shrink again, as does an idle connection during which TCP can't tell if network conditions changed, so to be safe it starts from a smaller number again. The problem is, as latency between endpoints increases, it takes progressively longer for TCP to get to its maximum CWND value, and thus longer to achieve maximum throughput. Pipelining can allow a connection to reach maximum CWND and keep it there while pushing multiple requests, which is another speed benefit.

 

Compression

I won't dwell on compression other than to say that it should be obvious that transferring compressed data is faster than transferring uncompressed data. For proof, ask any web browser or any streaming video provider.

 

Application vs network performance

Much of the TCP tuning and optimization that can take place is a server OS/application layer concern, but I mention it because even on the world's fastest network, an inefficiently designed application will still run inefficiently. If there is a load balancer front-ending an application, it may be able to do a lot to improve performance for a client by enabling compression or Connection: keep-alive, for example, even when an application does not.

 

Network monitoring

In the network itself, for the most part, things just work. And truthfully, there's not much one can do to make it work faster. However, the network devices should be monitored for packet loss (output drops, queue drops, and similar). One of the bigger causes of this is microbursting.

 

Microbursting

Modern servers are often connected using 10Gbps ethernet, which is wonderful except they are often over-eager to send out frames. Data is prepared and buffered by the server, then BLUURRRGGGGHH it is spewed at the maximum rate into the network. Even if this burst of traffic is relatively short, at 10Gbps it can fill a port's frame buffer and overflow it before you know what's happened, and suddenly the latter datagrams in the communication are being dropped because there's no more space to receive them. Anytime the switch can't move the frame from input to output port at least as fast as it's coming in on a given port, the input buffer comes into play and puts it at risk of getting overfilled. These are called microbursts because a lot of data is sent over a very short period. Short enough, in fact, for it to be highly unlikely that it will ever be identifiable in the interface throughput statistics that we all like to monitor. Remember, an interface running between 100% for half the time and 0% for the rest will likely show up as running at 50% capacity in a monitoring tool. What's the solution? MOAR BUFFERZ?! No.

 

Buffer bloat

I don't have space to go into detail here, so let me point you to a site that explains buffer bloat, and why it's a problem. The short story is that adding more buffers in the path can actually make things worse because it actively works against the algorithms within TCP that are designed to handle packet loss and congestion issues.

 

Monitor capacity

It sounds obvious, but a link that is fully utilized will lead to slower network speeds, whether through higher delays via queuing, or packet loss leading to connection slowdowns. We all monitor interface utilization, right? I thought so.

 

The perfect network

There is no perfect network, let's be honest. However, having an understanding not only of how the network itself (especially latency) can impact throughput, as well as an understanding of the way the network is used by the protocols running over it, might help with the next complaint that comes along. Optimizing and maintaining network performance is rarely a simple task, but given the network's key role in the business as a whole, the more we understand, the more we can deliver.

 

While not a comprehensive guide to all aspects of performance, I hope that this post might have raised something new, confirmed what you already know, or just provided something interesting to look into a bit more. I'd love to hear your own tales of bad network performance reports, application design stupidity, crazy user/application owner expectations (usually involving packets needing to exceed the speed of light) and hear how you investigated and hopefully fixed them!

It sounds obvious, perhaps, but without configurations, our network, compute, and storage environments won't do very much for us. Configurations develop over time as we add new equipment, change architectures, improve our standards, and deploy new technologies. The sum of knowledge within a given configuration is quite high. Despite that, many companies still don't have any kind of configuration management in place, so in this article, I will outline some reasons why configuration management is a must, and look at a some of the benefits that come with having it.

 

Recovery from total loss

As THWACK users, I think we're all pretty technically savvy, yet if I were to ask right now if you had an up-to-date backup of your computer and its critical data, what would the answer be? If your laptop's hard drive died right now, how much data would be lost after you replaced it?

 

Our infrastructure devices are no different. Every now and then a device will die without warning, and the replacement hardware will need to have the same configuration that the (now dead) old device had. Where's that configuration coming from?

 

Total loss is perhaps the most obvious reason to have a system of configuration backups in place. Configuration management is an insurance policy against the worst eventuality, and it's something we should all have in place. Potential ways to achieve this include:

 

 

At a minimum, having the current configuration safely stored on another system is of value. Some related thoughts on this:

 

  • Make sure you can get to the backup system when a device has failed.
  • Back up / mirror / help ensure redundancy of your backup system.
  • If "rolling your own scripts," make sure that, say, a failed login attempt doesn't overwrite a valid configuration file (he said, speaking from experience). In other words, some basic validation is required to make sure that the script output is actually a configuration file and not an error message.

 

Archives

Better than a copy of the current configurations, a configuration archive tracks all -- or some number of -- the previous configurations for a device.

 

An archive gives us the ability to see what changes occurred to the configuration and when. If a device doesn't support configuration rollback natively, it may be possible to create a kind of rollback script based on the difference between the two latest configurations. If the configuration management tool (or other systems) can react to SNMP traps to indicate a configuration change, the archive can be kept very current by triggering a grab of the configuration as soon as a change is noted.

 

Further, home-grown scripts or configuration management products can easily identify device changes and generate notifications and alerts when changes occur. This can provide an early warning of unauthorized configurations or changes made outside scheduled maintenance windows.

 

Compliance / Audit

Internal Memo

We need confirmation that all your devices are sending their syslogs to these seventeen IP addresses.

 

-- love from, Your Friendly Internal Security Group xxx

"Putting the 'no' in Innovation since 2003"

 

A request like this can be approached in a couple of different ways. Without configuration management, it's necessary to log in to each device and check the syslog server configuration. With a collection of stored configurations, however, checking this becomes a matter of processing configurations files. Even grepping them could extract the necessary information. I've written my own tools to do the same thing, using configuration templates to allow support for the varying configuration stanzas used by different flavors of vendor and OS to achieve the same thing.

 

Some tools — Solarwinds NCM is one of them — can also compare the latest configuration against a configuration snippet and report back on compliance. This kind of capability makes configuration audits extremely simple.

 

Even without a security group making requests, the ability to audit configurations against defined standards is an important capability to have. Having discussed the importance of configuration consistency, it seems like a no-brainer to want a tool of some sort to help ensure that the carefully crafted standards have been applied everywhere.

 

Pushing configuration to devices

I'm never quite sure whether the ability to issue configuration commands to devices falls under automation or configuration management, but I'll mention it briefly here since NCM includes this capability. I believe I've said in a previous Geek Speak post that it's abstractions that are most useful to most of us. I don't want to write the code to log into a device and deal with all the different prompts and error conditions. Instead, I'd much rather hand off to a tool that somebody else wrote and say, Send this. Lemme know how it goes. If you have the ability to do that and you aren't the one who has to support it, take that as a win. And while you're enjoying the golden trophy, give some consideration to my next point.

 

Where is your one true configuration source?

Why do we fall into the trap of using hardware devices as the definitive source of each configuration? Bearing in mind that most of us claim that we're working toward building a software-defined network of some sort, it does seem odd that the configuration sits on the device. Why does it not sit in a database or other managed repository that has been programmed based on the latest approved configuration in that repo?

 

Picture this for example:

 

  • Configurations are stored in a git repo
  • Network engineers fork the repo so they have a local copy
  • When a change is required, the engineer makes the necessary changes to their fork, then issues a pull request back to the main repo.
  • Pull requests can be reviewed as part of the Change Control process, and if approved, the pull-request is accepted and merged into the configuration.
  • The repo update triggers the changes to be propagated to the end device

 

Such a process would give us a configuration archive with a complete (and commented) audit trail for each change made. Additionally, if the device fails, the latest configuration is in the git repo, not on the device, so by definition, it's available for use when setting up the replacement device. If you're really on the ball, it may be possible to do some form of integration testing/syntax validation of the change prior to accepting the pull request.

 

There are some gotchas with this, not the least of which is that going from a configuration diff to something you can safely deploy on a device may not be as straightforward as it first appears. That said, thanks to commands like Junos' load replace and load override and IOS XR's commit replace, such things are made a little easier.

 

The point of this is not really to get into the implementation details, but more to raise the question of how we think about network device configurations in particular. Compute teams get it; using tools like Puppet and Chef to build and maintain the state of a server OS, it's possible to rebuild an identical server. The same applies to building images in Docker. The configuration should not be within the image becuase it's housed in the Dockerfile. So why not network devices, too? I'm sure you'll tell me, and I welcome it.

 

Get. Configuration. Management. Don't risk being the person everybody feels pity for after their hard drive crashes.

As a network engineer, I don't think I've ever had the pleasure of having every device configured consistently in a network. But what does that even mean? What is consistency when we're potentially talking about multiple vendors and models of equipment?

 

There Can Only Be One (Operating System)

 

Claim: For any given model of hardware there should be one approved version of code deployed on that hardware everywhere across an organization.

 

Response: And if that version has a bug, then all your devices have that bug. This is the same basic security paradigm that leads us to have multiple firewall tiers comprising different vendors for extra protection against bugs in one vendor's code. I get it, but it just isn't practical. The reality is that it's hard enough upgrading device software to keep up with critical security patches, let alone doing so while maintaining multiple versions of code.

Why do we care? Because different versions of code can behave differently. Default command options can change between versions; previously unavailable options and features are added in new versions. Basically, having a consistent revision of code running means that you have a consistent platform on which to make changes. In most cases, that is probably worth the relatively rare occasions on which a serious enough bug forces an emergency code upgrade.

 

Corollary: The approved code version should be changing over time, as necessitated by feature requirements, stability improvements, and critical bugs. To that end, developing a repeatable method by which to upgrade code is kind of important.

 

Consistency in Device Management

 

Claim: Every device type should have a baseline template that implements a consistent management and administration configuration, with specific localized changes as necessary. For example, a template might include:

 

  • NTP / time zone
  • Syslog
  • SNMP configuration
  • Management interface ACLs
  • Control plane policing
  • AAA (authentication, authorization, and accounting) configuration
  • Local account if AAA authentication server fails*

 

(*) There are those who would argue, quite successfully, that such a local account should have a password unique to each device. The password would be extracted from a secure location (a break glass type of repository) on demand when needed and changed immediately afterward to prevent reuse of the local account. The argument is that if the password is compromised, it will leave all devices susceptible to accessibility. I agree, and I tip my hat to anybody who successfully implements this.

 

Response: Local accounts are for emergency access only because we all use a centralized authentication service, right? If not, why not? Local accounts for users are a terrible idea, and have a habit of being left in place for years after a user has left the organization.

 

NTP is a must for all devices so that syslog/SNMP timestamps are synced up. Choose one timezone (I suggest UTC) and implement it on your devices worldwide. Using a local time zone is a guaranteed way to mess up log analysis the first time a problem spans time zones; whatever time zone makes the most sense, use it, and use it everywhere. The same time zone should be configured in all network management and alerting software.

 

Other elements of the template are there to make sure that the same access is available to every device. Why wouldn't you want to do that?

 

Corollary: Each device and software version could have its own limitations, so multiple templates will be needed, adapted to the capabilities of each device.

 

Naming Standards

 

Claim: Pick a device naming standard and stick with it. If it's necessary to change it, go back and change all the existing devices as well.

 

Response: I feel my hat tipping again, but in principle this is a really good idea. I did work for one company where all servers were given six-letter dictionary words as their names, a policy driven by the security group who worried that any kind of semantically meaningful naming policy would reveal too much to an attacker. Fair play, but having to remember that the syslog servers are called WINDOW, BELFRY, CUPPED, and ORANGE is not exactly friendly. Particularly in office space, it can really help to be able to identify which floor or closet a device is in. I personally lean toward naming devices by role (e.g. leaf, access, core, etc.) and never by device model. How many places have switches called Chicago-6500-01 or similar? And when you upgrade that switch, what happens? And is that 6500 a core, distribution, access, or maybe a service-module switch?

 

Corollary: Think the naming standard through carefully, including giving thought to future changes.

 

Why Do This?

 

There are more areas that could and should be consistent. Maybe consider things like:

 

  • an interface naming standard
  • standard login banners
  • routing protocol process numbers
  • vlan assignments
  • CDP/LLDP
  • BFD parameters
  • MTU (oh my goodness, yes, MTU)

 

But why bother? Consistency brings a number of obvious operational benefits.

 

  • Configuring a new device using a standard template means a security baseline is built into the deployment process
  • Consistent administrative configuration reduces the number of devices which, at a critical moment in troubleshooting, turn out to be inaccessible
  • Logs and events are consistently and accurately timestamped
  • Things work, in general, the same way everywhere
  • Every device looks familiar when connecting
  • Devices are accessible, so configurations can be backed up into a configuration management tool, and changes can be pushed out, too
  • Configuration audit becomes easier

 

The only way to know if the configurations are consistent is to define a standard and then audit against it. If things are set up well, such an audit could even be automated. After a software upgrade, run the audit tool again to help ensure that nothing was lost or altered during the process.

 

What does your network look like? Is it consistent, or is it, shall we say, a product of organic growth? What are the upsides -- or downsides -- to consistency like this?

You may be wondering why, after creating four blog posts encouraging non-coders to give it a shot, select a language and break down a problem into manageable pieces, I would now say to stop. The answer is simple, really: not everything is worth automating (unless, perhaps, you are operating at a similar scale to somebody Amazon).

 

The 80-20 Rule

 

Here's my guideline: figure out what tasks take up the majority (i.e. 80%) of your time in a given time period (in a typical week perhaps). Those are the tasks where making the time investment to develop an automated solution is most likely to see a payback. The other 20% are usually much worse candidates for automation where the cost of automating it likely outweighs the time savings.

 

As a side note, the tasks that take up the time may not necessarily be related to a specific work request type. For example, I may spend 40% of my week processing firewall requests, and another 20% processing routing requests, and another 20% troubleshooting connectivity issues. In all of these activities, I spend time identifying what device, firewall zone, or VRF various IP addresses are in, so that I can write the correct firewall rule, or add routing in the right places, or track next-hops in a traceroute where DNS is missing. In this case, I would gain the most immediate benefits if I could automate IP address research.

 

I don't want to be misunderstood; there is value in creating process and automation around how a firewall request comes into the queue, for example, but the value overall is lower than for a tool that can tell me lots of information about an IP address.

 

That Seems Obvious

 

You'd think that it was intuitive that we would do the right thing, but sometimes things don't go according to plan:

 

Feeping Creatures!

 

Once you write a helpful tool or an automation, somebody will come back and say, Ah, what if I need to know X information too? I need that once a month when I do the Y report. As a helpful person, it's tempting to immediately try and adapt the code to cover every conceivable corner case and usage example, but having been down that path, I counsel against doing so. It typically makes the code unmanageably complex due to all the conditions being evaluated and worse, it goes firmly against the 80-20 rule above. Feeping Creatures is a Spoonerism referring to Creeping Features, i.e. an always expanded feature list for a product.

 

A Desire to Automate Everything

 

There's a great story in What Do You Care What Other People Think (Richard Feynman) that talks about Mr. Frankel, who had developed a system using a suite of IBM machines to run the calculations for the atomic bomb that was being developed at Los Alamos.

 

"Well, Mr. Frankel, who started this program, began to suffer from the computer disease that anybody who works with computers now knows about. [...] Frankel wasn't paying any attention; he wasn't supervising anybody. [...] (H)e was sitting in a room figuring out how to make one tabulator automatically print arctangent X, and then it would start and it would print columns and then bitsi, bitsi, bitsi, and calculate the arc-tangent automatically by integrating as it went along and make a whole table in one operation.

 

Absolutely useless. We had tables of arc-tangents. But if you've ever worked with computers, you understand the disease -- the delight in being able to see how much you can do. But he got the disease for the first time, the poor fellow who invented the thing."

 

It's exciting to automate things or to take a task that previously took minutes, and turn it into a task that takes seconds. It's amazing to watch the 80% shrink down and down and see productivity go up. It's addictive. And so, inevitably, once one task is automated, we begin looking for the next task we can feel good about, or we start thinking of ways we could make what we already did even better. Sometimes the coder is the source of creeping features.

 

It's very easy to lose touch with the larger picture and stay focused on tasks that will generate measurable gains. I've fallen foul of this myself in the past, and have been delighted, for example, with a script I spent four days writing, which pulled apart log entries from a firewall and ran all kinds of analyses on it, allowing you to slice the data any which way and generate statistics. Truly amazing! The problem is, I didn't have a use for most of the stats I was able to produce, and actually I could have fairly easily worked out the most useful ones in Excel in about 30 minutes. I got caught up in being able to do something, rather than actually needing to do it.

 

And So...

 

Solve A Real Problem

 

Despite my cautions above, I maintain that the best way to learn to code is to find a real problem that you want to solve and try to write code to do it. Okay, there are some cautions to add here, not the least of which is to run tests and confirm the output. More than once, I've written code that seemed great when I ran it on a couple of lines of test data, but then when I ran it on thousands of lines of actual data, I discovered oddities in the input data, or in the loop that processes all the data reusing variables carelessly or similar. Just like I tell my kids with their math homework, sanity check the output. If a script claims that a 10Gbps link was running at 30Gbps, maybe there's a problem with how that figure is being calculated.

 

Don't Be Afraid to Start Small

 

Writing a Hello World! script may feel like one of the most pointless activities you may ever undertake, but for a total beginner, it means something was achieved and, if nothing else, you learned how to output text to the screen. The phrase, "Don't try to boil the ocean," speaks to this concept quite nicely, too.

 

Be Safe!

 

If your ultimate aim is to automate production device configurations or orchestrate various APIs to dance to your will, that's great, but don't start off by testing your scripts in production. Use device VMs where possible to develop interactions with different pieces of software. I also recommend starting by working with read commands before jumping right in to the potentially destructive stuff. After all, after writing a change to a device, it's important to know how to verify that the change was successful. Developing those skills first will prove useful later on.

 

Learn how to test for, detect, and safely handle errors that arise along the way, particularly the responses from the devices you are trying to control. Sanitize your inputs! If your script expects an IPv4 address as an input, validate that what you were given is actually a valid IPv4 address. Add your own business rules to that validation if required (e.g. a script might only work with 10.x.x.x addresses, and all other IPs require human input). The phrase Garbage in, garbage out, is all too true when humans provide the garbage.

 

Scale Out Carefully

 

To paraphrase a common saying, automation allows you to make mistakes on hundreds of devices much faster that you could possibly do it by hand. Start small with a proof of concept, and demonstrate that the code is solid. Once there's confidence that the code is reliable, it's more likely to be accepted for use on a wider scale. That leads neatly into the last point:

 

Good Luck Convincing People

 

It seems to me that everybody loves scripting and automation right up to the point where it needs to be allowed to run autonomously. Think of it like the Google autonomous car: for sure, the engineering team was pretty confident that the code was fairly solid, but they wouldn't let that car out on the highway without human supervision. And so it is with automation; when the results of some kind of process automation can be reviewed by a human before deployment, that appears to be an acceptable risk from a management team's perspective. Now suggest that the human intervention is no longer required, and that the software can be trusted, and see what response you get.

 

A coder I respect quite a bit used to talk about blast radius, or what's the impact of a change beyond the box on which the change is taking place? Or what's the potential impact of this change as a whole? We do this all the time when evaluating change risk categories (is it low, medium, or high?) by considering what happens if a change goes wrong. Scripts are no different. A change that adds an SNMP string to every device in the network, for example, is probably fairly harmless. A change that creates a new SSH access-list, on the other hand, could end up locking everybody out of every device if it is implemented incorrectly. What impact would that have on device management and operations?

 

However...

 

I really recommend giving programming a shot. It isn't necessary to be a hotshot coder to have success (trust me, I am not a hotshot coder), but having an understanding of coding will, I believe, will positively impact other areas of your work. Sometimes a programming mindset can reveal ways to approach problems that didn't show themselves before. And while you're learning to code, if you don't already know how to work in a UNIX (Linux, BSD, MacOS, etc.) shell, that would be a great stretch goal to add to your list!

 

I hope that this mini-series of posts has been useful. If you do decide to start coding, I would love to hear back from you on how you got on, what challenges you faced and, ultimately, if you were able to code something (no matter how small) that helped you with your job!

In this post, part of a miniseries on coding for non-coders, I thought it might be interesting to look at a real-world example of breaking a task down for automation. I won't be digging hard into the actual code but instead looking at how the task could be approached and turned into a sequence of events that will take a sad task and transform it into a happy one.

 

The Task - Deploying a New VLAN

 

Deploying a new VLAN is simple enough, but in my environment it means connecting to around 20 fabric switches to build the VLAN. I suppose one solution would be to use an Ethernet fabric that had its own unified control plane, but ripping out my Cisco FabricPath™ switches would take a while, so let's just put that aside for the moment.

 

When a new VLAN is deployed, it almost always also requires that a layer 3 (IP) gateway with HSRP is created on the routers and that VLAN needs to be trunked from the fabric edge to the routers. If I can automate this process, for every VLAN I deploy, I can avoid logging in to 22 devices by hand, and I can also hopefully complete the task significantly faster.

 

Putting this together, I now have a list of three main steps I need to accomplish:

 

  1. Create the VLAN on every FabricPath switch
  2. Trunk the VLAN from the edge switches to the router
  3. Create the L3 interface on the routers, and configure HSRP

 

Don't Reinvent the Wheel

 

Much in the same way that one uses modules when coding to avoid rewriting something that has been created already, I believe that the same logic applies to automation. For example, I run Cisco Data Center Network Manager (DCNM) to manage my Ethernet fabric. DCNM has the capability to deploy changes (it calls them Templates) to the fabric on demand. The implementation of this feature involves DCNM creating an SSH session to the device and configuring it just like a real user would. I could, of course, implement the same functionality for myself in my language of choice, but why would I? Cisco has spent time making the deployment process as bulletproof as possible; DCNM recognizes error messages and can deal with them. DCNM also has the logic built in to configure all the switches in parallel, and in the event of an error on one switch, to either roll back that switch alone or all switches in the change. I don't want to have to figure all that out for myself when DCNM already does it.

 

For the moment, therefore, I will use DCNM to deploy the VLAN configurations to my 20 switches. Ultimately it might be better if I had full control and no dependency on a third-party product, but in terms of achieving the goal rapidly, this works for me. To assist with trunking VLANs toward the routers, in my environment the edge switches facing the routers have a unique name structure, so I was also able to tweak the DCNM template so that if it detects that it is configuring one of those switches, it also adds the VLANs to the trunked list on the relevant router uplinks. Again, that's one less task I'll have to do in my code.

 

Similarly, to configure the routers (IOS XR-based), I could write a Python script based on the Paramiko SSH library, or use the Pexpect library to launch ssh and control the program's actions based on what it sees in the session. Alternatively, I could use NetMiko which already understands how to connect to an IOS XR router and interact with it. The latter choice seems like it's preferable, if for no other reason than to speed up development.

 

Creating the VLAN

 

DCNM has a REST API through which I can trigger a template deployment. All I need is a VLAN number and an optional description, and I can feed that information to DCNM and let it run. First, though, I need the list of devices on which to apply the configuration template. This information can be retrieved using another REST API call. I can then process the list, apply the VLAN/Description to each item and submit the configuration "job." After submitting the request, assuming success, DCNM will return the JobID that was created. That's handy because it will be necessary to keep checking the status of that JobID afterward to see if it succeeded. So here are the steps so far:

 

  • Get VLAN ID and VLAN Description from user
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)

 

Sound good? Wait; the script needs to login as well. In the DCNM REST API that means authenticating to a particular URL, receiving a token (a string of characters), then using that token as a cookie in all future requests within that session. Also, as a good citizen, the script should logout after completing its requests too, so the list now reads:

  • Get VLAN ID and VLAN Description from user
  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM

 

That should work for the VLAN creation but I'm also missing a crucial step which is to sanitize and validate the inputs provided to the script. I need to ensure, for example, that:

 

  • VLAN ID is in the range 1-4094, but for legacy Cisco purposes perhaps, does not include 1002-1005
  • VLAN Description must be 63 characters or less, and the rules I want to apply will only allow [a-z], [A-Z], [0-9], dash [-] and underscore [_]; no spaces and odd characters

 

Maybe the final list looks like this then:

 

  • Get VLAN ID and VLAN Description from user
  • Confirm that VLANID and VLAN Description are valid
  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM

 

Configuring IOS XR

 

In this example, I'll use Python+NetMiko to do the hard work for me. My inputs are going to be:

 

  • IPv4 Subnet and prefix length
  • IPv6 Subnet and prefix length
  • VLAN ID
  • L3 Interface Description

 

As before, I will sanity check the data provided to ensure that the IPs are valid. I have found that IOS XR's configuration for HSRP, while totally logical and elegantly hierarchical, is a bit of a mouthful to type out, so to speak, and as such it is great to have a script take the basic information like a subnet, and apply some standard rules to it (e.g. the 2nd IP is the HSRP gateway, e.g. .1 on a /24 subnet), the next address up (e.g. .2) would be on the A router, and .3 would be on the B router. For my HSRP group number, I use the VLAN ID.  The subinterface number where I'll be configuring layer 3 will match the VLAN ID also, and with that information I can also configure the HSRP BFD peer between the routers too. By applying some simple standardized templating of the configuration, I can take a bare minimum of information from the user and create configurations which would take much longer to create manually and quite often (based on my own experience) would have mistakes in it.

 

The process then might look like this:

 

  • Get IPv4 subnet, IPv6 subnet, VLAN ID and L3 interface description from user
  • Confirm that IPv4 subnet, IPv6 subnet, VLANID and interface description are valid
  • Generate templated configuration for the A and B routers
  • Create session to A router and authenticate
  • Take a snapshot of the configuration
  • Apply changes (check for errors)
  • Assuming success, logout
  • Rinse and repeat for B router

 

Breaking Up is Easy

 

Note that the sequences of actions above have been created without requiring any coding. Implementation can come next, in the preferred language, but if we don't have an idea of where we're going, especially as a new coder, it's likely that the project will go wrong very quickly.

 

For implementation, I now have a list of tasks which I can attack, to some degree, separately from one another; each one is a kind of milestone. Looking at the DCNM process again:

 

  • Get VLAN ID and VLAN Description from user

 

Perhaps this data comes from a web page but for the purposes of my script, I will assume that these values are provided as arguments to the script. For reference, an argument is anything that comes after the name of the script when you type it on the command line, e.g. in the command, sayhello.py John the program sayhello.py would see one argument, with a value of John.

 

  • Confirm that VLANID and VLAN Description are valid

 

This sounds like a perfect opportunity to write a function/subroutine which can take a VLAN ID as its own argument, and will return a boolean (true/false) value indicating whether or not the VLAN ID is valid. Similarly, a function could be written for the description, either to enforce the allowed characters by removing anything that doesn't match, or by simply validating whether what's provided meets the criteria or not. These may be useful in other scripts later too, so writing a simple function now may save time later on.

 

  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM

 

These five actions are all really the same kind of thing. For each one, some data will be sent to a REST API, and something will be returned to the script by the REST API. The process of submitting to the REST API only requires a few pieces of information:

 

  • What kind of HTML request is it? GET / POST / etc?
  • What is the URL?
  • What data needs to be sent, if any, to the URL?
  • How to process the data returned. (What format is it in?)

 

It should be possible to write some functions to handle GET and POST requests so that it's not necessary to repeat the HTTP request code every time it's needed. The idea is not to repeat code multiple times if it can be more simply put in a single function and called from many places. This also means that fixing a bug in that code only requires it to be fixed in one place.

 

For the IOS XR configuration, each step can be processed in a similar fashion, creating what are hopefully more manageable chunks of code to create and test.

 

Achieving Coding Goals

 

I really do believe that sometimes coders want to jump right into the coding itself before taking the time to think through how the code might actually work, and what the needs will be. In the example above, I've run through taking a single large task (Create a VLAN on 20 devices and configure two attached routers with an L3 interface and HSRP) which might seem rather daunting at first, and breaking it down into smaller functional pieces so that a) it's clearer how the code will work, and in what order; and b) each small piece of code is now a more achievable task. I'd be interested to know if you as a reader feel that the task lists, while daunting in terms of length, perhaps, seemed more accomplishable from a coding perspective than just the project headline. To me, at least, they absolutely are.

 

I said I wouldn't dig into the actual code, and I'll keep that promise. Before I end, though, here's a thought to consider: when is it right to code a solution, and when is it not? I'll be taking a look at that in the next, and final, article in this miniseries.

You've decided it's time to learn how to code, so the next step is to find some resources and start programming your first masterpiece. Hopefully, you've decided that my advice on which language to choose was useful, and you're going to start with either Python, Go or PowerShell. There are a number of ways to learn, and a number of approaches to take. In this post, I'll share my thoughts on different ways to achieve success, and I'll link to some learning resources that I feel are pretty good.

 

How I Began Coding

 

When I was a young lad, my first foray into programming was using Sinclair BASIC on a Sinclair ZX81 (which in the United States was sold as the Timex Sinclair 1000). BASIC was the only language available on that particular powerhouse of computing excellence, so my options were limited. I continued by using BBC BASIC on the Acorn BBC Micro Model B, where I learned to use functions and procedures to avoid repetition of code. On the PC I got interested in what could be accomplished by scripting in MS-DOS. On Macintosh, I rediscovered a little bit of C (via MPW). When I was finally introduced to NetBSD, things got interesting.

 

I wanted to automate activities that manipulated text files, and UNIX is just an amazing platform for that. I learned to edit text in vi (aka vim, these days) because it was one tool that I could pretty much guarantee was installed on every installation I got my hands on. I began writing shell scripts which looped around calling various instantiations of text processing utilities like grep, sed, awk, sort, uniq, fmt and more, just to get the results I wanted. I found that often, awk was the only tool with the power to extract and process the data I needed, so I ended up writing more and more little awk scripts to fill in. To be honest, some of the pipelines I was creating for my poor old text files were tricky at best. Finally, somebody with more experience than me looked at it and said, Have you considered doing this in Perl instead?

 

Challenge accepted! At that point, my mission became to create the same functionality in Perl as I had created from my shell scripts. Once I did so, I never looked back. Those and other scripts that I wrote at the time are still running. Periodically, I may go back and refactor some code, or extract it into a module so I can use the same code in multiple related scripts, but I have fully converted to using a proper scripting language, leaving shell scripts to history.

 

How I Learned Perl

 

With my extensive experience with BASIC and my shallow knowledge of C, I was not prepared to take on Perl. I knew what strings and arrays were, but what was a hash? I'd heard of references but didn't really understand them. In the end—and try not to laugh because this was in the very early days of the internet—I bought a book (Learn Perl in 21 Days), and started reading. As I learned something, I'd try it in a script, I'd play with it, and I'd keep using it until I found a problem it didn't solve. Then back to the book, and I'd continue. I used the book as more as a reference than I did as a true training guide (I don't think I read much beyond about Day 10 in a single stretch; after that was on an as-needed basis).

 

The point is, I did not learn Perl by working through a series of 100 exercises on a website. Nor did I learn Perl by reading through the 21 Days book, and then the ubiquitous Camel book. I can't learn by reading theory and then applying it. And in any case, I didn't necessarily want to learn Perl as such; what I really wanted was to solve my text processing problems at that time. And then as new problems arose, I would use Perl to solve those, and if I found something I didn't now how to do, I'd go back to the books as a reference to find out what the language could do for me. As a result, I did not always do things the most efficient way, and I look back at my early code and think, Oh, yuck. If I did that now I'd take a completely different approach. But that's okay, because learning means getting better over time and —  this is the real kicker — my scripts worked. This might matter more if I were writing code to be used in a high-performance environment where every millisecond counts, but for my purposes, "It works" was more than enough for me to feel that I had met my goals.

 

In my research, I stumbled across a great video which put all of that more succinctly than I did:

 

Link: How to Learn to Code - YouTube

 

In the video, (spoiler alert!) CheersKevin states that you don't want to learn a language; you want to solve problems, and that's exactly it. My attitude is that I need to learn enough about a language to be dangerous, and over time I will hone that skill so that I'm dangerous in the right direction, but my focus has always been on producing an end product that satisfies me in some way. To that end, I simply cannot sit through 30 progressive exercises teaching me to program a poker game simulator bit by bit. I don't want to play poker; I don't have any motivation to engage with the problem.

 

A Few Basics

 

Having said that you don't want to learn a language, it is nonetheless important to understand the ways in which data can be stored and some basic code structure. Here are a few things I believe it's important to understand as you start programming, regardless of which language you choose to learn:

 

ItemDefinition
scalar variablea way to store a single value, e.g. a string (letters/numbers/symbols), a number, a pointer to a memory location, and so on.
array / list / collectiona way to store an (ordered) list of values, e.g. a list of colors ("red", "blue", "green") or (1,1,2,3,5,8).
hash / dictionary / lookup table / associative arraya way to store data by associating a unique key to a value, e.g. the key might be "red", and the value might be the html hex value for that color, "#ff0000". Many key/value pairs can be stored in the same object, e.g. colors=("red"=>"#ff0000", "blue"=>"#00ff00", "green"=>"#0000ff")
zero-based numberingthe number (or index) of the first element in a list (array) is zero;  the second element is 1, and so on. Each element in a list is typically accessed by putting the index (the position in the list) in square brackets after the name. In our previously defined array colors=("red", "blue", "green") the elements in the list are colors[0] = "red", colors[1]="blue", and colors[2]="green".
function / procedure / subroutinea way to group a set of commands together so that the whole block can be called with a single command. This avoids repetition within the code.
objects, properties and methodsan object can have properties (which are information about, or characteristics of, the object), and methods (which are actually properties which execute a function when called). The properties and methods are usually accessed using dot notation. For example, I might have an object mycircle which has a property called radius; this would be accessed as mycircle.radius. I could then have a method called area which will calculate the area of the circle (πr²) based on the current value of mycircle.radius; the result would access as mycircle.area() where parentheses are conventionally used to indicate that this is a method rather than a property.

 

All three languages here (and indeed most other modern languages) use data types and structures like the above to store and access information. It's, therefore, important to have just a basic understanding before diving in too far. This is in some ways the same logic as gaining an understanding of IP before trying to configure a router; each router may have a different configuration syntax for routing protocols and IP addresses, but they're all fundamentally configuring IP ... so it's important to understand IP!

 

Some Training Resources

 

This section is really the impossible part, because we all learn things in different ways, at different speeds, and have different tolerances. However, I will share some resource which either I have personally found useful, or that others have recommended as being among the best:

 

Python

 

 

The last course is a great example of learning in order to accomplish a goal, although perhaps only useful to network engineers as the title suggests. Kirk is the author of the NetMiko Python Library and uses it in his course to allow new programmers to jump straight into connecting to network devices, extracting information and executing commands.

 

Go

 

Go is not, as I think I indicated previously, a good language for a total beginner. However, if you have some experience of programming, these resources will get you going fairly quickly:

 

 

As a relatively new, and still changing, language, Go does not have a wealth of training resources available. However, there is a strong community supporting it, and the online documentation is a good resource even though it's more a statement of fact than a learning experience.

 

PowerShell

 

 

Parting Thoughts

 

Satisfaction with learning resources is so subjective, it's hard to be sure if I'm offering a helpful list or not, but I've tried to recommend courses which have a reputation for being good for complete beginners. Whether these resources appeal may depend on your learning style and your tolerance for repetition. Additionally, if you have previous programming experience you may find that they move too slowly or are too low level; that's okay because there are other resources out there aimed at people with more experience. There are many resources I haven't mentioned which you may think are amazing, and if so I would encourage you to share those in the comments because if it worked for you, it will almost certainly work for somebody else where other resources will fail.

 

Coincidentally a few days ago I was listening to Scott Lowe's Full Stack Journey podcast (now part of the Packet Pushers network), and as he interviewed Brent Salisbury in Episode 4, Brent talked about those lucky people who can simply read a book about a technology (or in this case a programming language) and understand it, but his own learning style requires a lot of hands-on, and the repetition is what drills home his learning. Those two categories of people are going to succeed in quite different ways.

 

Since it's fresh in my mind, I'd also like to recommend listening to Episode 8 with Ivan Pepelnjak. As I listened, I realized that Ivan had stolen many of the things I wanted to say, and said them to Scott late in 2016. In the spirit that everything old is new again, I'll leave you with some of the axioms from RFC1925 (The Twelve Networking Truths) (one of Ivan's favorites) seem oddly relevant to this post, and to the art of of programming too:

 

         (6a)  (corollary). It is always possible to add another
               level of indirection.    
     (8)  It is more complicated than you think.
     (9)  For all resources, whatever it is, you need more.
    (10)  One size never fits all.
    (11)  Every old idea will be proposed again with a different
          name and a different presentation, regardless of whether
          it works.
         (11a)  (corollary). See rule 6a.  

To paraphrase a lyric from Hamilton, Deciding to code is easy; choosing a language is harder. There are many programming languages that are good candidates for any would-be programmer, but selecting the one that will be most beneficial to each individual need is a very challenging decision. In this post, I will attempt to give some background on programming languages in general, as well as examine a few of the most popular options and attempt to identify where each one might be the most appropriate choice.

 

Programming Types and Terminology

 

Before digging into any specific languages, I'm going to explain some of the properties of programming languages in general, because these will contribute to your decision as well.

 

Interpreted vs Compiled

Interpreted

 

An interpreted language is one where the language reads the script and generates machine-level instructions on the fly. When an interpreted program is run, it's actually the language interpreter that is running with the script as an input. Its output is the hardware-specific bytecode (i.e. machine code). The advantages of interpreted languages are that they are typically quick to edit and debug, but they are also slower to run because the conversion to bytecode has to happen in real-time. Distributing a program written in an interpreted language effectively means distributing the source code.

 

sw_interpreter.png

Compiled

 

A compiled language is one where the script is processed by the language compiler and turned into an executable file containing the machine-specific bytecode. It is this output file that is run when the script is executed. It isn't necessary to have the language installed on the target machine to execute bytecode, so this is the way most commercial software is created and distributed. Compiled code runs quickly because the hard work of determining the bytecode has already been done, and all the target machine needs to do is execute it.

 

sw_compiler.png

 

Strongly Typed vs Weakly Typed

What is Type?

 

In programming languages, type is the concept that each piece of data is of a particular kind. For example, 17 is an integer. John is a string. 2017-05-07 10:11:17.112 UTC is a time. The reason languages like to keep track of type is to determine how to react when operations are performed on them.

 

As an example, I have created a simple program where I assign a value of some sort to a variable (a place to store a value), imaginatively called x. My program looks something like this:


x = 6
print x + x

I tested my script and changed the value of x to see how each of five languages would process the answer. It should be noted that putting a value in quotes (") implies that the value is a string, i.e. a sequence of characters.John is a string, but there's no reason678" can't be a string too. The values of x are listed at the top, and the table shows the result of adding x to x:

 

66six6sixsix6
Perl12120120
Bash12120*error*0
Python1266sixsix6six6sixsix6six6
Ruby1266sixsix6six6sixsix6six6
PowerShell1266sixsix6six6sixsix6six6

 

Weakly Typed Languages

Why does this happen? Perl and Bash are weakly (or loosely) typed; that is, while they understand what a string is and what an integer is, they're pretty flexible about how those are used. In this case, Perl and bash made a best effort guess at whether to treat the strings as numbers or strings; although the value 6 was defined in quotes (and quotes mean a string), the determination was that in the context of a plus sign, the program must be trying to add numbers together. Python and Ruby, on the other hand, respected 6 as a string and decided that the intent was to concatenate the strings, hence the answer of 66.

 

The flexibility of the weak typing offered by a language like Perl is both a blessing and a curse. It's great because the programmer doesn't have to think about what data type each variable represents, and can use them anywhere and let the language determine the right type to use based on context. It's awful because the programmer doesn't have to think about what data type each variable represents, and can use them anywhere. I speak from bitter experience when I say that the ability to (mis)use variables in this way will, eventually, lead to the collapse of civilization. Or worse, unexpected and hard-to-track-down behavior in the code.

 

That Bash error? Bash for a moment pretends to have strong typing and dislikes being asked to add variables whose value begins with a number but is not a proper integer. It's too little, too late if you ask me.

 

Strongly Typed Languages

In contrast, Python and Ruby are strongly-typed languages (as are C and Go). In these languages to add two numbers means adding two integers (or floating point numbers, aka floats). Concatenating strings requires two or more strings. Any attempt to mix and match the types will generate an error. For example in Python:


>>> a = 6
>>> b = "6"
>>> print a + b Traceback (most recent call last):   File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Strongly typed languages have the advantage that accidentally adding the wrong variable to an equation, for example, will not be permitted if the type is incorrect. In theory, it reduces errors and encourages a more explicit programming style. It also ensures that the programmer is clear that the value of an int(eger) will never have decimal places. On the other hand, sometimes it's a real pain to have to convert variables from one format to another to use its value in a different context.

PowerShell appears to want to pretend to be strongly typed, but a short test reveals some scary behavior. I've included a brief demonstration at the end in the section titled Addendum: PowerShell Bonus Content.

 

Dynamic / Statically Typed

 

There's one more twist to the above definitions. While functionally the language may be strongly typed, for example, it's possible to allow a variable to change its type at any time. For example, it is just fine in Perl to initialize a variable with an integer, then give it a new value which is a string:


$a = 1;
$a = "hello";

Dynamic typing is typically a property of interpreted languages, presumably because they have more flexibility to change memory allocations at runtime. Compiled languages, on the other hand, tend to be statically typed; if a variable is defined as a string, it cannot change later on.

 

Modules / Includes / Packages / Imports / Libraries

 

Almost every language has some system whereby the functionality can be expanded by installing and referencing some code written by somebody else. For example, Perl does not have SSH support built in, but there is a Net::SSH module which can be installed and used. Modules are the easiest way to avoid reinventing the wheel and allow us to ride the back of somebody else's hard work. Python has packages, Ruby has modules which are commonly distributed in a format called a "gem," and Go has packages. These expansion systems are critical to writing good code; it's not a failure to use them, it's common sense.

 

Choosing a Language

 

With some understanding of type, modules and interpreted/compiled languages, now it's time to figure out how to choose the best language. First, here's a quick summary of the most common scripting languages:

 

C / ITypeS / DExpansion
PerlInterpretedWeakDynamicModules
PythonInterpretedStrongDynamicPackages
RubyInterpretedStrongDynamicModules
PowerShellInterpretedIt's complicatedDynamicModules
GoCompiledStrongStaticPackages

 

I've chosen not to include Bash mainly because I consider it to be more of a wrapper than a fully fledged scripting language suitable for infrastructure tasks. Okay, okay. Put your sandals down. I know how amazing Bash is. You do, too. will

 

Perl

sw_perl_logo.png

 

Ten years ago I would have said that Perl (version 5.x, definitely not v6) was the obvious option. Perl is flexible, powerful, has roughly eleventy-billion modules written for it, and there are many training guides available. Perl's regular expression handling is exemplary and it's amazingly simple and fast to use. Perl has been my go-to language since I first started using it around twenty-five years ago, and when I need to code in a hurry, it's the language I use because I'm so familiar with it. With that said, for scripting involving IP communications, I find that Perl can be finicky, inconsistent and slow. Additionally, vendor support for Perl (e.g. providing a module for interfacing with their equipment) has declined significantly in the last 5-10 years, which also makes Perl less desirable. Don't get me wrong; I doubt I will stop writing Perl scripts in the foreseeable future, but I'm not sure that I could, in all honesty, recommend it for somebody looking to control their infrastructure with code.

 

Python

sw_python_logo.png

It probably won't be a surprise to learn that for network automation, Python is probably the best choice of language. I'm not entirely clear why people love Python so much, and why even the people who love Python seem stuck on v2.7 and are avoiding the move to v3.0. Still, Python has established itself as the de facto standard for networking automation. Many vendors provide Python packages, and there is a strong and active community developing and enhancing packages. Personally, I have had problems adjusting to the use of whitespace (indent) to indicate code block hierarchy, and it makes my eyes twitch that a block of code doesn't end with a closing brace of some kind, but I know I'm in the minority here. Python has a rich library of packages to choose from, but just like Perl, it's important to choose carefully and find a modern, actively supported package. If you think that semicolons at the end of lines and braces surrounding code make things look horribly complicated, then you will love Python. A new Python user really should learn version 3, but note that v3 code is not backward compatible with v2.x, and it may be important to check the availability of relevant vendor packages in a Python3-compatible form.

 

Ruby

sw_ruby_logo.png

 

Oh Ruby, how beautiful you are. I look at Ruby as being like Python, but cleaner. Ruby is three or four years younger than Python, and borrows parts of its syntax from languages like Perl, C, Java, Python, and Smalltalk. At first, I think Ruby can seem a little confusing compared to Python, but there's no question that it's a terrifically powerful language. Coupled with Rails (Ruby on Rails) on a web server, Ruby can be used to quickly create database-driven web applications, for example. I think there's almost a kind of snobbery surrounding Ruby, where those who prefer Ruby look down on Python almost like it's something used by amateurs, whereas Ruby is for professionals. I suspect there are many who would disagree with that, but that's the perception I've detected. However, for network automation, Ruby has not got the same momentum as Python and is less well supported by vendors. Consequently, while I think Ruby is a great language, I would not recommend it at the moment as a network automation tool. For a wide range of other purposes though, Ruby would be a good language to learn.

 

PowerShell

sw_powershell_logo.png

PowerShell – that Microsoft thing – used to be just for Windows, but now it has been ported to Linux and MacOS as well. PowerShell has garnered strong support from many Windows system administrators since its release in 2009 because of the ease with which it can interact with Windows systems. PowerShell excels at automation and configuration management of Windows installations. As a Mac user, my exposure to PowerShell has been limited, and I have not heard about it being much use for network automation purposes. However, if compute is your thing, PowerShell might just be the perfect language to learn, not least because it's native in Windows Server 2008 onwards. Interestingly, Microsoft is trying to offer network switch configuration within PowerShell, and released its Open Management Infrastructure (OMI) specification in 2012, encouraging vendors to use this standard interface to which PowerShell could then interface. As a Windows administrator, I think PowerShell would be an obvious choice.

 

Go

sw_golang_logo.png

 

Go is definitely the baby of the group here, and with its first release in 2012, the only one of the languages here created in this decade! Go is an open source language developed by Google, and is still mutating fairly quickly with each release, as new functionality is being added. This is a good because things that are perceived as missing are frequently added in the next release. It's bad because not all code will be forward compatible (i.e. will run in the next version). As Go is so new, the number of packages available for use is much more limited than for Ruby, Perl, or Python. This is obviously a potential downside because it may mean doing more work for one's self.

 

Where Go wins, for me, is on speed and portability. Because Go is a compiled language, the machine running the program doesn't need to have Go installed; it just needs the compiled binary. This makes distributing software incredibly simple, and also makes Go pretty much immune to anything else the user might do on their platform with their interpreter (e.g. upgrade modules, upgrade the language version, etc). More to the point, it's trivial to get Go to cross-compile for other platforms; I happen to write my code on a Mac, but I can (and do) compile tools into binaries for Mac, Linux, and Windows and share them with my colleagues. For speed, a compiled language should always beat an interpreted language, and Go delivers that in spades. In particular, I have found that Go's HTTP(S) library is incredibly fast. I've written tools relying on REST API transactions in both Go and Perl, and Go versions blow Perl out of the water. If you can handle a strongly, statically typed language (it means some extra work at times) and need to distribute code, I would strongly recommend Go. The vendor support is almost non-existent, however, so be prepared to do some work on your own.

 

Conclusions

 

There is a lot to consider when choosing a language to learn, and I feel that this post may only scrape the surface of all the potential issues to take into account. Unfortunately, sometimes the issues may not be obvious until a program is mostly completed. Nonetheless, my personal recommendations can be summarized thus:

 

  • For Windows automation: PowerShell
  • For general automation: Python (easier), or Go (harder, but fast!)

 

If you're a coder/scripter, what would you recommend to others based on your own experience? In the next post in this series, I'll look at ways to learn a language, both in terms of approach and some specific resources.

 

Addendum: Powershell Bonus Content

 

In the earlier table where I showed the results from adding x + x, PowerShell behaves perfectly. However, when I started to add int and string variable types, it was not so good:


PS /> $a = 6
PS /> $b = "6"
PS /> $y = $a + $b
PS /> $y 12

In this example, PowerShell just interpreted the string 6 as an integer and added it to 6. What if I do it the other way around and try adding an integer to a string?


PS /> $a = "6"
PS /> $b = 6
PS /> $y = $a + $b
PS /> $y 66

This time, PowerShell treated both variables as strings; whatever type the first variable is, that's what gets applied to the other. In my opinion that is a disaster waiting to happen. I am inexperienced with PowerShell, so perhaps somebody here can explain to me why this might be desirable behavior because I'm just not getting it.

How should somebody new to coding get started learning a language and maybe even automating something? Where should they begin?

 

It's probably obvious that there are a huge number of factors that go into choosing a particular language to learn. I'll look at that particular issue in the next post, but before worrying about which arcane programming language to choose, maybe it's best to take a look at what programming really is. In this post, we'll consider whether it's going to be something that comes naturally to you, or require a very conscious effort.

 

If you're wondering what I mean by understanding programming rather than understanding a language, allow me to share an analogy. When I'm looking for, say, a network engineer with Juniper Junos skills, I'm aware that the number of engineers with Cisco skills outnumbers those with Juniper skills perhaps at a ratio of 10:1 based on the résumés that I see. So rather than looking for who can program in Cisco IOS and who can program in Junos OS, I look for engineers who have an underlying understanding of the protocols I need. The logic here is that I can teach an engineer (or they can teach themselves) how to apply their knowledge using a new configuration language, but it's a lot more effort to go back and teach them about the protocols being used. In other words, if an engineer understands, say, the theory of OSPF operation, applying it to Junos OS rather than IOS is simply a case of finding the specific commands that implement the design the engineer already understands. More importantly, learning the way a protocol is configured on a particular vendor's operating system is far less important than understanding what those commands are doing to the protocol.

 

Logical Building Blocks

 

Micro Problem: Multiply 5 x 4

 

Here's a relatively simple example of turning a problem into a logical sequence. Back in the days before complex instruction sets, many computer CPUs did not have a multiply function built in, and offered only addition and subtraction as native instructions. How can 5x4 be calculated using only addition or subtraction? The good news for anybody who has done common core math (a reference for the readers in the USA) is that it may be obvious that 5x4 is equivalent to 5+5+5+5. So how should that be implemented in code? Here's one way to do it:


answer = 0  // create a place to store the eventual answer
answer = answer + 5  // add 5
answer = answer + 5  // add 5 (2nd time)
answer = answer + 5  // add 5 (3rd time)
answer = answer + 5  // add 5  (4th time)

At the end of this process, answer should contain a value of 20, which is correct. However, this approach isn't very scalable. What if next time I need to know the answer to 5 x 25? I really don't want to have to write the add 5 line twenty-five times! More importantly, if the numbers being multiplied might be determined while the program is running, having a hard-coded set of additions is no use to us at all. Instead, maybe it's possible to make this process a little more generic by repeating the add command however many times we need to in some kind of loop. Thankfully there are ways to achieve this. Without worrying about exactly how the loop itself is coded, the logic of the loop might look something like this:


answer = 0
number_to_add = 5
number_of_times = 4
do the following commands [number_of_times] times:
  answer = answer + [number_to_add]
done

Hopefully that makes sense written as pseudocode. We define the number that we are multiplying (number_to_add), and how many times we need to add it to the answer (number_of_times), and the loop will execute the addition the correct number of times, giving us the same answer as before. Now, however, to multiply different pairs of numbers, the addition loop never needs to change. It's only necessary to change number_to_add and number_of_times.

 

This is a pretty low level example that doesn't achieve much, but understanding the logic of the steps taken is something that can then be implemented across multiple languages:

 

sw_code_example_1.png

 

I will add (before somebody else comments!) that there are other ways to achieve the same thing in each of these languages. The point I'm making is that by understanding a logical flow, it's possible to implement what is quite clearly the same logical sequence of steps in multiple different languages in order to get the result we wanted.

 

Macro Problem: Automate VLAN Creation

 

Having looked at some low level logic, let's look at an example of a higher level construct to demonstrate that the ability to follow (and determine) a logical sequence of steps applies all the way up to higher levels as well.

 

In this example, I want to define a VLAN ID and a VLAN Name, and have my script create that VLAN on a list of devices. At the very highest level, my theoretical sequence might look like this:


Login to router
Create VLAN
Logout

After some more thought, I realize that I need to do those steps on each device in turn, so I need to create some kind of loop:


do the following for each switch in the list (s1, s2 ...  sN ):
  Login to router 
  Create VLAN
  Logout
done

It occurs to me that before I begin creating the VLANs, I ought to confirm that it doesn't exist on any of the target devices already, and if it does, I should stop before creating it. Now my program logic begins to look like this:


do the following for each switch in the list (s1, s2 ...  sN ):
  Login to router
  Check if VLAN exists
  Logout
  IF the chosen VLAN exists on this device, THEN stop!
  ENDIF
done

do the following for each switch in the list (s1, s2 ...  sN ):
  Login to router
  Create VLAN
  Logout
done

The construct used to determine whether to stop or not is referred to as an if/then/else clause. In this case, IF the vlan exists, THEN stop (ELSE, implicitly, keep on running).

Each step in the sequence above can then be broken down into smaller parts and analyzed in a similar way. For example:


Login to router
IF login failed THEN:
| log an error
| stop ELSE:
| log success
ENDIF

Boolean (true/false) logic is the basic for all these sequences, and multiple conditions can be tested simultaneously and even nested within other clauses. For example, I might expand the login process to cope with RADIUS failure:


Login to router
IF login failed THEN:
| IF error message was "RADIUS server unavailable" THEN:
| | attempt login using local credentials
| ELSE:
| | log an error
| | stop
| ENDIF
ELSE:
| log success
ENDIF

 

So What?

 

The so what here is that if following the kind of nested logic above seems easy, then in all likelihood so will coding. Much of the effort in coding is figuring out how to break a problem down into the right sequence of steps, the right loops, and so forth. In other words, the correct flow of events. To be clear, choosing a language is important too, but without having a grasp of the underlying sequence of steps necessary to achieve a goal, expertise in a programming language isn't going to be very useful.

 

My advice to a newcomer taking the first steps down the Road to Code (cheese-y, I know), is that it's good to look at the list of tasks that would ideally be automated and see if they can be broken down into logical steps, and then break those steps down even further until there's a solid plan for the approach. Think about what needs to happen in what order. If information is being used in a particular step, where did that information come from? Start thinking about the problems with a methodical, programming mindset.

 

In this series of posts, it's obviously not going to be possible to teach anybody how to code, but instead I'll be looking at how to select a linguistic weapon of choice, how to learn to code, ways to get started and build on success, and finally to offer some advice on when coding is the right solution.

The story so far:

 

  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
  5. It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)
  6. It's Not Always The Network! Or is it? Part 6 -- by Tom Hollingsworth (networkingnerd)
  7. It's Not Always The Network! Or is it? Part 7 -- by John Herbert (jgherbert)

 

As 2016 draws to a close, Amanda finds that there's always still room for the unexpected. Here's the eighth—and final—installment, by Tom Hollingsworth (networkingnerd).

 

The View From The Top: James (CEO)

 

The past year has been interesting to say the least. We had a great year overall as a company and hit all of our sales goals. Employee morale seems to be high and we're ready to push forward with some exciting new initiatives as soon as we get through the holidays. I think one thing that has really stood out to me as a reason for our success is the way in which our IT staff has really started shining.

 

Before, I just saw IT as a cost center of the business. They kept asking for more budget to buy things that were supposed to make the business run faster and better, but we never saw that. Instead, we saw the continual issues that kept popping up that caused our various departments to suffer delays and, in some cases, real work stoppages. I knew that I had to make a change and get everyone on board.

 

Bringing Amanda into a leadership position was one of the best decisions I could have made. She took the problematic network and really turned it around. She took the source of all our problems and made it the source of all the solutions to them. Her investment in the right tools really helped speed along resolution time on the major issues we faced.

 

I won't pretend that all the problems in this business will ever go away. But I think I'm starting to see that developing the right people along the way can do a great job of making those problems less impactful to our business.

 

The View From The Trenches: Amanda (Sr Network Manager)

 

Change freezes are the best time of the year. No major installations or work mean maintenance tickets only ... and a real chance for us all to catch our breath. This year was probably one of the most challenging that I've ever had in IT. Getting put in a leadership role was hard. I couldn't keep my head down and plug away at the issues. I had to keep everyone in the loop and keep working toward finding ways to fix problems and keep the business running at the same time.

 

One thing that helped me more than I could have ever realized was getting the right tools in place. Too often in the past, I found myself just guessing at solutions to issues based on my own experiences. As soon as I faced a problem that I hadn't seen before, my level of capability was reset to zero and I had to start from scratch. By getting the tools that gave me the right information about the problems, I was able to reduce the time it took to get things resolved. That made the stakeholders happy. And when I shared those tools with other IT departments, they were able to resolve issues quickly as well, which meant the network stopped getting blamed for every little thing that went wrong.

 

I think in the end my biggest lesson was that IT needs to support the business. Sales, accounting, and all the other areas of this company all have a direct line into the bottom line. IT has always been more about providing support at a level that's hard to categorize. I know that James and the board would always groan when we asked for more budget to do things, but we did it because we could see the challenges that needed to be solved. By finding a way to equate those challenges to business issues and framing the discussion around improving processes and making us more revenue, I think James has finally started to realize how important it is for IT to be a part of the bigger picture.

 

That's not to say there aren't challenges today. I've already seen how we need to have some proper change control methods around here. My networking team has already implemented these ideas, and I plan on getting the CTO to pass them around to the other departments as well. Another thing that I think is critical based on my workload is getting the various teams here to train across roles. I saw it first hand when James would call me for a network issue that ended up being a part of the storage or virtualization team. I learned a lot about those technologies as I helped troubleshoot. They aren't all that different from what we do. I think a little cross training for every team would go a long way in helping us pinpoint issues when they come up instead of dumping the problem on the nearest friendly face.

 

The View From The Middle

 

James called Amanda to his office. She went in feeling hopeful and looking forward to the new year. James and Amanda sat down with one of the other Board members to discuss some items related to Amanda's desire to cross-train the departments, as well as improving change controls and documentation. James waited until Amanda had gone through her list of discussion items. Afterwards, he opened with, "These are some great ideas Amanda, and I know you want to bring them to the CTO. However, I just got word from him that he's going be moving on at the end of the year to take a position in a different company. You're one of the first people outside the Board to know."

 

Amanda was a bit shocked by this. She had no idea the CTO was ready to move on. She said, "That's great for him! Leaves us in a bit of a tough spot though. Do you have someone in mind to take his spot? Mike has been here for quite a while and would make a great candidate." James chuckled as he glanced over at the Board member in the room. He offered, "I told you she was going to suggest Mike. You owe me $5."

 

James turned back to Amanda and said, "I know that Mike has been here for quite a while. He's pretty good at what he does but I don't think he's got what it takes to make it as the CTO. He's still got that idea that the storage stuff is the most important part of this business. He can't see past the end of his desk sometimes." James continued, "No, I think we're going to be opening up applications for the CTO position outside the company. There are some great candidates out there that have some experience and ideas that could be useful to the company."

 

Amanda nodded her head in agreement with James's idea.

 

James then said, "However, that doesn't fix our problem of going without a CTO in the short term. We need someone that has proven that they have visibility across the IT organization; that they can respond well to problems and get them fixed while also having the ability to keep the board updated on the situation."

 

James grinned widely as he slid a folder across the table to Amanda. He said, "That's why the board and I want you to step in as the Interim CTO until we can finish interviewing candidates. Those are some big shoes to fill, but you have our every confidence. You also have the support of the IT department heads. After the way you helped them with the various problems throughout the year, they agreed that they would like to work with you for the time being. We’ll get some professional development scheduled for you as soon as possible. If you’re going to be overseeing the CTO’s office for now, we want to help you succeed with the kind of training that you’ll need. It’s not something you get fixing networks every day, but you’ll find it useful in your new role when dealing with the other department heads."

 

Amanda was speechless. It took her a few moments to find her own words. She thanked James profusely. She said, "Thank you for this! I think it's going to be quite the challenge but I know that I can help you keep the IT department working while you interview for a new CTO. I won't let you down."

 

James replied, "That's exactly what I wanted to hear. And I fully expect to see your application in the pile as well. There's nothing stopping us for taking that "interim" title away if you're the right person for the job. Show us what you're capable of and we'll evaluate you just like the other candidates. Your experience so far shows that you've got a lot of the talents that we're looking for."

 

As Amanda stood up to leave with her new title and duties, James called after her, "Thanks for being a part of our team Amanda. You've done a great job around here and helped show me that it's not always the network."

The story so far:

 

  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
  5. It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)
  6. It's Not Always The Network! Or is it? Part 6 -- by Tom Hollingsworth (networkingnerd)

 

What happens when your website goes down on Black Friday? Here's the seventh installment, by John Herbert (jgherbert).

 

The View From Above: James, CEO

 

It's said, somewhat apocryphally, that Black Friday is so called because it's the day where stores sell so much merchandise and make so much money that it finally puts them 'in the black' for the year. In reality, I'm told it stems from the terrible traffic on the day after Thanksgiving which marks the beginning of the Christmas shopping season. Whether it's high traffic or high sales, we are no different from the rest of the industry in that we offer some fantastic deals to our consumer retail customers on Black Friday through our online store. It's a great way for us to clear excess inventory, move less popular items, clear stock of older models prior to a new model launch, and to build brand loyalty with some simple, great deals.

 

The preparations for Black Friday began back in March as we looked ahead to how we would cope with the usual huge influx of orders both from an IT perspective and in terms of the logistics of shipping so many orders that quickly. We brought in temporary staff for the warehouse and shipping operations to help with the extra load, but within the head office and the IT organization it's always a challenge to keep anything more than a skeleton staff on call and available, just because so many people take the Friday off as a vacation day.

 

I checked in with Carol, our VP of Consumer Retail, about an hour before the Black Friday deals went live. She confirmed that everything was ready, and the online store update would happen as planned at 8AM. Traffic volumes to the web site were already significantly increased (over three times our usual page rate) as customers checked in to see if the deals were visible yet, but the systems appeared to be handling this without issue and there were no problems being reported. I thanked her and promised to call back just after 8AM for an  initial update.

 

When I called back at about 8.05AM, Carol did not sound happy. "Within a minute of opening up the site, our third party SLA monitoring began alerting that the online store was generating errors some of the time, and for the connections that were successful, the Time To First Byte (how long it takes to get the first response content data back from the web server) is varying wildly." She continued "It doesn't make sense; we built new servers since last year's sale, we have a load balancer in the path, and we're only seeing about 10% higher traffic that last year and we had no trouble then." I asked her who she had called, and I was relieved to hear that Amanda had been the first to answer and was pulling in our on call engineers from her team and others to cover load balancing, storage, network, database, ecommerce software, servers, virtualization and security. This would be an all hands on deck situation until it was resolved, and time was not on the team's side. Heaven only knows how much money we were losing in sales every minute the site was not working for people.

 

The View From The Trenches: Amanda (Sr Network Manager)

 

So much for time off at Thanksgiving! Black Friday began with a panicked call from Carol about problems with the ecommerce website; she said that they had upgraded the servers since last year so she was convinced that it had to be the network that was overloaded and this was what was causing the problems. I did some quick checks in Solarwinds and confirmed that there were not any link utilization issues, so it really had to be something else. I told Carol that I would pull together a team to troubleshoot, and I set about waking up engineers across a variety of technical disciplines so we could make sure that everybody was engaged.

 

I asked the team to gather a status on their respective platforms and report back to the group. The results were not promising:

  • Storage: no alerts
  • Network: no alerts
  • Security: no alerts relating to capacity (e.g. session counts / throughput)
  • Database: no alerts, CPU and memory a little higher than normal but not resource-limited.
  • Load Balancing: No capacity issues showing.
  • Virtualization: All looks nominal.
  • eCommerce: "The software is working fine; it must be the network."

 

I had also asked for a detailed report on the errors showing up with our SLA measurement tool so we knew what out customers might be seeing. Surprisingly, rather than outright connection failures, the tool reported receiving a mixture of 504 (Gateway Timeout) errors and TCP resets after the request was sent. That information suggested that we should look more closely at the load balancers, as a 504 error occurs when the load balancer can't get a response from the back end servers in a reasonable time period. As for the hung sessions, that was less clear. Perhaps there was packet loss between the load balancer and those servers causing sessions to time out?

 

The load balancer engineers dug in to the VIP statistics and were able to confirm that they did indeed see incrementing 504 errors being generated, but they didn't have a root cause yet. They also revealed that of the 10 servers behind the ecommerce VIP, one of them was taking fewer sessions over time than the others, although the peak concurrent session load was roughly the same as the other servers. We ran more tests to the website for ourselves but were only able to see 504 errors, and never a hung/reset session. We decided therefore to focus on the 504 errors that we could replicate. The client to VIP communication was evidently working fine because after a short delay, the 504 error was sent to us without any problems, so I asked the engineers to focus on the communication between the load balancer and the servers.

 

Packet captures of the back end traffic confirmed the strange behavior. Many sessions were establishing without problem, while others worked but with a large time to first byte. Others still got as far as completing the TCP handshake, sending the HTTP request, then would get no response back from the server. We captured again, this time including the client-side communication, and we were able to confirm that these unresponsive sessions were the ones responsible for the 504 error generation. But why were the sessions going dead? Were the responses not getting back for some reason? Packet captures on the server showed that the behavior we had seen was accurate; the server was not responding. I called on the server hardware, virtualization and ecommerce engineers to do a deep dive on their systems to see if they could find a smoking gun.

 

Meanwhile the load balancer engineers took captures of TCP sessions to the one back end server which had the lower total session count. They were able to confirm that the TCP connection was established ok, the request was sent, then after about 15 seconds the web server would send back a TCP RST and kill the connection. This was different behavior to the other servers, so there were clearly two different problems going on. The ecommerce engineer looked at the logs on the server and commented that their software was reporting trouble connecting to the application tier, and the hypothesis was that when that connection failed, the server would generate a RST. But again, why? Packet captures of the communication to the app tier showed an SSL connection being initiated, then as the client sent its certificate to the server, the connection would die. One of my network engineers, Paul, was the one who figured out what might be going on. That sounds a bit like something I've seen when you have a problem with BGP route exchange...the TCP connection might come up, then as soon as the routes start being sent, it all breaks. When that happens, it usually means we have an MTU problem in the communication path which is causing the BGP update packets to be dropped.

 

Sure enough, once we started looking at MTU and comparing the ecommerce servers to one another, we discovered that the problem server had a much larger MTU than all the others. Presumably when it sent the client certificate,  it maxed out the packet size which caused it to be dropped. We could figure out why later, but for now, tweaking the MTU to match the other servers resolved that issue and let us focus back on the 504 errors which the other engineers were looking at.

 

Thankfully, the engineers were working well together, and they had jointly come up with a theory. They explained that the web servers ran apache, and used something called prefork. The idea is that rather than waiting for a connection to come in before forking a process to handle its communication, apache could create some processes ahead of time and use those for new connections because they'd be ready. The configuration specifies how many processes should be pre-forked (hence the name), the maximum number of processes that could be forked, and how many spare processes to keep over and above the number of active, connected processes. They pointed out that completing a TCP handshake does not mean apache is ready for the connection, because that's handled by the TCP/IP stack before being handed off to the process. They added that they actually used TCP Offload so that whole process was taking place in the NIC, not even on the server CPU itself.

 

So what if the session load meant that the apache forking process could not keep up with the number of sessions coming inbound? TCP/IP would connect regardless, but only those sessions able to find a forked process could continue to be processed. The rest would wait in a queue for a free process, and if one could not be found, the load balancer would decide that the connection was dead and would issue a 504. When they checked the apache configuration, however, not only was the number of preforked processes low, but the maximum was nowhere near where we would have expected it to be, and the number of 'spare' processes was only set to 5. The end result was that when there was a burst of traffic, we quickly hit the maximum number of processes on the server, so new connections were queued. Some connections got lucky and were attached to a process before timing out; others were not so lucky. The heavier the load, the worse the problem got, but when there was a lull in traffic, the server caught up again but now when traffic hit hard again, it only had 5 processes ready to go, and connections were delayed while waiting for new processes to be forked. I had to shake my head at how they must have figured this out.

 

Their plan of attack was to increase the max session count and the spare session count on one server at a time. We'd lose a few active sessions, but avoiding those 504 errors would be worth it. They started on the changes, and within 10 minutes we had confirmed that the errors had disappeared.

 

I reported back to Carol and to James that the issues had been resolved, and when I got off the phone with them, I asked the team to look at two final issues:

 

  1. Why did we not see any session RST problems when we tested the ecommerce site ourselves; and
  2. Why did PMTUd not automatically fix the MTU problem with the app tier connection?

 

It took another thirty minutes but finally we had answers. The security engineer had been fairly quiet on the call so far, but he was able to answer the second question. There was a firewall between the web tier and the app tier, and the firewall had an MTU matching the other servers. However, it was also configured not to allow though, nor to generate the ICMP messages indicating an MTU problem. We had shot ourselves in the foot by blocking the mechanism which would have detected an MTU issue and fixed it! For the RST issue, one of my engineers came up with the answer again. He pointed out that while we were using the VPN to connect to the office, our browsers had to use the web proxy to access the Internet, and thus our ecommerce site (another Security rule!). The proxy made all our sessions appear from a single source IP address, and through bad luck if nothing else, the load balancer had chosen one of the 9 working servers, then kept using that some server because it was configured with session persistence (sometimes known as 'sticky' sessions).

 

I'm proud to say we managed to get all this done within an hour. Given some of the logical leaps necessary to figure this out, I think the whole team deserve a round of applause. For now though, it's back to turkey leftovers, and a hope that I can enjoy the rest of the day in peace.

 

 

>>> Continue to the conclusion of this story in Part 8

The story so far:

 

  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
  5. It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)

 

Things always crop up when you least expect them, don't they? Here's the sixth installment, by Tom Hollingsworth (networkingnerd).

 

The View From Above: James, CEO

 

One of the perks of being CEO is that I get to eat well. This week was no exception, and on Tuesday night I found myself at an amazing French restaurant with the Board of Directors. The subject of our recent database issues came up, and the rest of the Board expressed how impressed they were with the CTO's organization, in particular the technical leadership and collaboration shown by Amanda. It's unusual that they get visibility of an individual in that way, so she has clearly made a big impact. Other IT managers have also approached me and told me how helpful she is; I think she has a great career ahead of her here. As dessert arrived and the topic of conversation moved on, I felt my smartwatch buzzing as a text message came in. I glanced down at my wrist and turned pale at the first lines of the message on the screen:

 

URGENT! We have a security breach...

 

I excused myself from the table and made a call to find out more. The news was not good. Apparently, we had been sent a message saying that our customer data has been obtained, and it will be made available on the black market if we don't pay them a pretty large sum of money. It made no sense; we have some of the best security tools out there, and we follow all those compliance programs to the letter. At least, I thought we did. How did this data get out? More to the point, would we be able to avoid paying the ransom? And even if we paid it, would the data be sold anyway? If this gets out, the damage to our reputation alone will cause us to lose new business, and I dread to think how many of our affected customers won't trust us with their data any more. The security team couldn't answer my questions, so I hung up and made another call, this time to Amanda.

 

 

The View From The Trenches: Amanda (Sr Network Manager)

 

I used to flinch every time I picked up phone calls from James. Now I can't help but wonder what problem he wants me to solve next. I must admit that I'm learning a lot more about the IT organization around here and it's making my ship run a lot tighter. We're documenting more quickly and anticipating the problems before they happen, and we have the Solarwinds tools to thank for a large portion of that. So I was pretty happy to answer a late evening call from James earlier this week, but this call was different. The moment he started speaking I knew something bad had happened, but I wasn't expecting to hear that our customer data had been stolen and was being ransomed. How far did this go? Did they just take customer data, or have they managed to extract the whole CRM database?

 

It's one thing to be fighting a board implementing bad ideas, but fighting hackers? This is huge! We're about to be in for a lot of bad press, and James is going to be spending a lot of time apologizing and hoping we don't lose all our customers. James told me that I am part of the Rapid Response Team being set up by Vince, the Head of IT Security, and I have the authority to do whatever I need to do to help them find out how to get this fixed. James says he's willing to pay the ransom if the team is unable to track the breach, but he's worried that unless we find the source, he'll just be asked to pay again a week later. I grabbed my keys and drove to the office.

 

I had barely sat down at my desk when Vince ran into my office. He was panting as he fell into one of my chairs, and breathlessly explained the problem in more detail. The message from the hacker included an attachment - a 'sample' containing a lot of sensitive customer data, including credit card numbers and social security numbers. The hacker wanted thousands of dollars in exchange for not selling it on the black market, and there was a deadline of just two days. I asked Vince if he had verified the contents of the attachment. He nodded his head slowly. There's no question about it. Somebody has access to our data.

 

I asked Vince when the last firewall audit happened. Thankfully, Vince said that his team audited the firewalls about once a month to make sure all the rules were accurate. I smiled to myself that we finally had someone in IT that knew the importance of regular checkups. Vince told me that the kept things up to date just in case he had to pull together a PCI audit. I told him to put the firewalls on the back burner and think about how the data could have been exfiltrated. He told me he wasn't sure on that one. I asked if he had any kind of monitoring tool like the ones I used on the network. He told me that he a Security Incident and Event Management (SIEM) tool budgeted for next year. Isn't that always the way? I told him it was time we tried something out to get some data about this breach fast. We only had a couple of days before the hacker's deadline, so we needed to get some idea of what was going on, and quickly.

 

While the security engineers on the Rapid Response team continued their own investigations, Vince and I downloaded the Solarwinds Log and Event Manager (LEM) trial and installed it on my Solarwinds server. It only took an hour to get up and running. We pointed it at our servers and other systems and had it start collecting log data. We decided to create some rules for basic things, like best practices, to help us sort through the mountain of data we just started digesting. Vince and I worked to put in the important stuff, like our business policies about access rights and removable media, as well as telling the system to start looking for any strange file behavior.

 

As we let the system do its thing for a bit, I asked Vince if the hacker could have emailed the files out of the network. He smiled and told me he didn't think that was possible because they had just finished installing Data Loss Prevention (DLP) systems a couple of months ago. It had caught quite a few people in accounting sending social security numbers in plain text emails, so Vince was sure that anything like that would have been caught quickly. I was impressed that Vince clearly knew what he was doing. He only took over as Head of IT Security about nine months back, and it seems like he has been transforming the team and putting in just the right processes and tools. His theory was that it was some kind of virus that was sending the data out a covert channel. Being in networking, I often hear things being blamed on the latest virus of the week, so I reserved my judgement until we knew more. All we could do now was wait while LEM did its thing, and the other security engineers continued their efforts as well. By this time it was well after midnight, and I put on a large pot of coffee.

 

When morning came and people started to come into work, we looked at the results from the first run at the data. Vince noted a few systems which needed to be secured to fall completely within PCI compliance rules. There was nothing major found, though; just a couple of little configurations that were missed. As we scrolled down the list though, Vince found a potential smoking gun. LEM had identified a machine in sales that had some kind of unknown trojan. On the same screen, the software offered the option to isolate the machine until it could be fixed. We both agreed that it needed to be done, so we removed the network connectivity for the machine through the LEM interface until we could send a tech down to remove the virus in person. More and more people were coming online now, so perhaps one of those systems would provide another possible cause.

 

We kept pushing through the data; we were now 18 hours into the two-day deadline. I was looking over the list of things we needed to check on when a new event popped up on the screen. I scrolled up to the top and read through it. A policy violation had occurred in our removable device policy rule. It looked like someone had unplugged a removable USB drive from their computer, and the system was powered off right after that. I checked the ID on the machine: it was one of the sales admins. I asked Vince if they had a way of tracking violations of the USB device policy. He told me that there shouldn't have been any violations as they had set a group policy in AD to prevent USB drives from being usable. I asked him about this machine in particular. Vince knitted his eyebrows together as he thought about the machine. He told me he was sure that it was covered too, but we both decided to walk down and take a look at it anyway.

 

We booted up the machine, and everything looked fine as it did the usual POST and came up to the Windows login screen. Wait, though; the background for the login screen was wrong. We have a corporate image on our machines with the company logo as the wallpaper. It wasn't popular but it also prevented incidents with more colorful pictures ... like the one I was looking at right now. Wow. Somehow this user had figured out how to change their wallpaper. I wondered what else this could mean. Vince and I spent an hour combing through the system. There were lots of non-standard things we found; lots of changes that shouldn't have been possible with our group policies (including the USB device policy), and the browser history of the user was clean. Not just clean from a perspective of sites visited, but completely cleared. Vince and I started to think that this system's user was someone we wanted to chat with.

 

I called James and told him we had a couple of possibilities to check out. He asked us to get back to him quickly; he had notified the rest of the Board, and they were pushing to hear that we had a solution as quickly as possible. Vince and I returned to my office and I scanned the SIEM tool for any new events while Vince contacted one of his team to arrange to have the suspect computer removed and re-imaged. Five minutes in, another event popped up. The same suspect system with the group policy had triggered an event for the insertion of a USB drive. I printed out the event, and Vince and I hurried back to the sales office to find out who had turned the computer on. We found the user hard at work, typing away; until, that is, we walked up to his desk. A flurry of mouse clicks later, he was back at his desktop. Vince asked him if he had anything plugged into his computer that wasn't supposed to be there. The user, a young man called Josh, said that he didn't. Vince showed him the event printout showing a USB drive being plugged in to the computer, but Josh shook his head and said that he didn't know what that was all about.

 

Vince wasn't having any of it. He started asking the sales admin all about the unauthorized changes on the machine that violated the group policies in place on the network. The sales admin didn't have an answer. He started looking around and stammering a bit as he tried to explain it. Finally, Vince said that he had enough. It was obvious something was going on and he wanted to get to the bottom of it. He told Josh to step away from the computer. Josh stood up and moved to the side, and Vince sat down at the computer, clicking around the system and looking for anything out of place. He glanced at the report from the Solarwinds SIEM tool, which showed that the drive was mounted in a specific folder location and not as a drive. As soon as he started clicking in the folder structure, Josh got visibly nervous. He kept inching closer to the chair and looked like he was about to grab the keyboard. When Vince clicked into the folder structure of the drive, his eyes got wide. Josh's head dropped and he stared resolutely at the carpet.

 

The post-mortem after that was actually pretty easy. Josh was the hacker who had stolen the information from our database. He had stored a huge amount of customer records on the USB drive and was adding more every day. He must have hit on the idea to ask us to pay for the records as a ransom, and he might have even been planning on selling them even if we paid up, although we'll never know. Vince's team analyzed the hard drive and found the exploits Josh had used to elevate his privileges enough to reverse the group policies that prevented him from reading and copying the customer data. We later found those privilege escalations in the mountain of data the SIEM collected. If we'd only had this kind of visibility before, we might have avoided this whole situation.

 

James came down to deal with the issue personally. Josh was pretty much frog-marched into a conference room, with James following close behind. The door slammed shut and the ensuing muffled shouting gave me some uncomfortable flashbacks to the day that my predecessor, Paul, was fired. Then Sam from Human Resources arrived with two of our attorneys from Legal in tow, and half an hour later Josh was being escorted from the building. I'm not privy to the exactly what the attorneys had Josh sign, but apparently he won't be making any noise about what he did.

 

From my perspective, I've built a really good relationship with the security team now, and of course, they've asked to keep Solarwinds Log and Event Manager. LEM paid for itself many times over this week, and there's no question that at some point it will help us avoid another crisis. For now though, James told Vince and I to take the rest of the week off. I'm not going to argue; I need some sleep!

 

 

>>> Continue reading this story in Part 7

The story so far:

 

  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)

 

Easter is upon the team before they know it, and they're being pushed to make a major software change. Here's the fifth installment, by John Herbert (jgherbert).

 

The View From Above: James (CEO)

 

Earlier this week we pushed a major new release of our supply chain management (SCM) platform into production internally. The old version simply didn't have the ability to track and manage our inventory flows and vendor orders as efficiently as we wanted, and the consequence of that has been that we've missed completing a few large orders in their entirety because we have been waiting for critical components to be delivered. Despite the importance of this upgrade to our reputation for on-time delivery (not to mention all the other cost savings and cashflow benefits we can achieve by managing our inventory on a near real-time basis), the CTO has been putting this off for months because the IT teams have been holding back on giving the OK. Finally the Board of Directors had enough with the CTO's push back, and as a group we agreed that there had been plenty enough time for testing, and the directive was issued that unless there were documented faults or errors in the system, IT should proceed with the new software deployment within the month.

 

We chose to deploy the software over the Easter weekend. That's usually a quieter time for our manufacturing facilities, as many of our customers close down for the week leading up to Easter. I heard grumbling from the employees about having to work on Easter, but there's no way around it. The software has to launch, and we have to do whatever we need to do to make that happen, even if that means missing the Easter Bunny.

 

The deployment appeared to go smoothly, and the CTO was pleased to report to the Board on Monday morning that the supply chain platform had been upgraded successfully over the weekend. He reported that testing had been carried out from every location, and every department had provided personnel to test their top 10 or so most common activities after the upgrade so that we would know immediately if a mission-critical problem had arisen. Thankfully, every test passed with flying colors, and the software upgrade was deemed a success. And so it was, until Tuesday morning when we started seeing some unexplained performance issues, and things seemed to be getting worse as the day progressed.

 

The CTO reported that he had put together a tiger team to start troubleshooting, and opened an ongoing outage bridge. This had the Board's eyes on it, and he couldn't fail now. I asked him to make sure Amanda was on that team; she has provided some good wins for us recently, and her insight might just make the difference. I certainly hope so.

 

The View From The Trenches: Amanda (Sr Network Manager)

 

With big network changes I've always had a rule for myself that just because the change window has finished successfully, it doesn't mean the change was a success, regardless of what testing we might have done. I tend to wait a period of time before officially calling the change a success, all the while crossing my fingers for no big issues to arise. Some might call that paranoia, and perhaps they are right, but it's a technique that has kept me out of trouble over time. This week has provided another case study for why my rule has a place when we make more complex changes.

 

Obviously I knew about the change over the Easter weekend; I had the pleasure of being in the office watching over the network while the changes took place. Solarwinds NPM made that pretty simple for me; no red means a quiet time, and since there were no specific reports of issues, I really had nothing to do. On Monday the network looked just fine as well (not that anybody was asking), but by Tuesday afternoon it was clear that there were problems with the new software, and the CTO pulled me in to a war room where a group of us were tasked to focus on finding the cause of of performance issues being reported with the new application.

 

There didn't seem to be a very clear pattern to the performance issues, and reports were coming in from across the company. On that basis we agreed to eliminate the wide area network (WAN) from our investigations, except at the common points, e.g. the WAN ingress to our main data center. The server team was convinced it had to be a network performance issue, but when I got them to do some ping tests from the application servers to various components of the application and the data center, responses were coming back in 1 or 2ms. NPM also still showed the network as clean and green, but experience has taught me not to dismiss any potential cause until we can disprove it by finding what the actual problem is, so I shared that information cautiously but left the door open for it to still be a network issue that simply wasn't showing in these tests.

 

One of the server team suggested perhaps it was an MTU issue. A good idea, but when we issued some pings with large payloads to match the MTU of the server interface, everything worked fine. MTU was never really a likely cause--if we had MTU issues, you'd have expected the storage to fail early on--but there's no harm in quickly eliminating it, and that's what we were able to do. We double checked interface counters looking for drops and errors in case we had missed something in the monitoring, but those were looking clean too. We looked at the storage arrays themselves as a possible cause, but checking Solarwinds Storage Resource Monitor we confirmed that there were no active alerts, there were no storage objects indicating performance issues like high latency, and there were no capacity issues, thanks to Mike using the capacity planning tool when he bought this new array!

 

We asked the supply chain software support expert about the software's dependencies. He identified the key dependencies as the servers the application ran on, the NFS mounts to the storage arrays and the database servers. We didn't know about the database servers, so we pulled in a database admin and began grilling him. We discovered pretty quickly that he was out of his depth. The new software had required a shift from Microsoft SQL Server to an Oracle database. This was the first Oracle instance the DB team had ever stood up, and while they were very competent monitoring and administering SQL Server, the admin admitted somewhat sheepishly that he really wasn't that comfortable with Oracle yet, and had no idea how to see if it was the cause of our problems. This training and support issue is something we'll need to work on later, but what we needed right then and there was some expertise to help us look into Oracle performance. I was already heading to the Solarwinds website because I remembered that there was a database tool, and I was hopeful that it would do what we needed.

 

I checked the page for Solarwinds' Database Performance Analyzer (DPA), and it said: Response Time Analysis shows you exactly what needs fixing - whether you are a database expert or not. That sounded perfect given our lack of Oracle expertise, so I downloaded it and began the installation process. It wasn't long before I had DPA monitoring our Oracle database transactions (checking them every second!) and starting to populate data and statistics. Within an hour it became clear what the problem was; DPA identified that the main cause for performance problems was occurring on database updates, where entire tables were being locked rather than using more a granular lock, like row-level locking. Update queries were being forced to wait while the previous query executed and released the lock on the table, and the latency in response was having a knock-on effect on the entire application. We had not noticed this at the weekend because the transaction loads were so low out of normal business hours that this problem didn't raise its head. But why didn't this happen on Monday? On a hunch I dug into NPM and looked at the network throughput for the application servers. As I had suspected, the Monday after Easter showed the servers handling about half the traffic that hit it on the Tuesday. At a guess, a lot of people took a 4-day weekend, and when they returned to work on Tuesday, that tipped the scales on the locking/blocking issue.

 

While we discussed this discovery, our supply chain software expert had been tapping away on his laptop. You're not going to believe this, he said, It turns out we are not the first people to find this problem. The vendor says that they posted a HotFix for the query code about a week after this release came out, but I just checked, and we definitely do not have that HotFix installed. I don't know how we missed that, but we can get it installed overnight while things are quiet, and maybe we'll get lucky. I checked my watch; I couldn't believe it was 7.30PM already. We really couldn't get much more done that night anyway, so we agreed to meet at 9AM and monitor the results of the application of the HotFix.

The next morning we met as planned, and watched nervously as the load ramped up as each time zone came on line. By 1PM we had hit a peak load exceeding Tuesday's peak, and not a single complaint had come in. Solarwinds DPA now indicated that the blocking issue had been resolved, and there were no other major alerts to deal with. Another bullet dodged, though this one was a little close for comfort. We prepared a presentation for the Board explaining the issues (though we tried not to throw the software expert under the bus for missing the HotFix), and presented a list of lessons learned / actions, which included:

 

  • Set up a proactive post-change war-room for major changes
  • Monitor results daily for at least one week for changes to key business applications
  • Provide urgent Oracle training for the database team (the accelerated schedule driven by the Board meant this did not happen in time)
  • Configure DPA to monitor our SQL Server installations too

 

We wanted to add another bullet saying "Don't be bullied by the Board of Directors into doing something when we know we aren't ready yet", but maybe that's a message best left for the Board to mull on for itself. Ok, we aren't perfect, but we can get better each time we make mistakes, so long as we're honest with ourselves about what went wrong.

 

 

>>> Continue reading this story in Part 6

The story so far:

 

  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)

 

The holidays are approaching, but that doesn't mean a break for the network team. Here's the fourth installment of the story, by Tom Hollingsworth (networkingnerd).

 

The View From Above: James (CEO)

 

I'm really starting to see a turn around in IT. Ever since I put Amanda in charge of the network, I'm seeing faster responses to issues and happier people internally. Things aren't being put on the back burner until we yell loud enough to get them resolved. I just wish we could get the rest of the organization to understand that.

 

Just today, I got a call from someone claiming that the network was running slow again when they tried to access one of their applications. I'm starting to think that "the network is slow" is just code to get my attention after the unfortunate situation with Paul. I decided to try and do a little investigation of my own. I asked this app owner if this had always been a problem. It turns out that it started a week ago. I really don't want to push this off on Amanda, but a couple of my senior IT managers are on vacation and I don't have anyone else I can trust. But I know she's going to get to the bottom of it.

 

 

The View From The Trenches: Amanda (Sr Network Manager)

 

Well, that should have been expected. At least James was calm and polite. He even told me that he'd asked some questions about the problem and got some information for me. I might just make a good tech out of the CEO after all!

 

James told me that he needed my help because some of the other guys had vacation time they had to use. I know that we're on a strict change freeze right now, so I'm not sure who's getting adventurous. I hope I don't have to yell at someone else's junior admin. I decided I needed to do some work to get to the bottom of this. The app in question should be pretty responsive. I figured I'd start with the most basic of troubleshooting - a simple ping. Here's what I found out:

 

icmp_seq=0 time=359.377 ms

icmp_seq=1 time=255.485 ms

icmp_seq=2 time=256.968 ms

icmp_seq=3 time=253.409 ms

icmp_seq=4 time=254.238 ms

 

Those are terrible response times! It's like the server is on the other side of the world. I pinged other routers and devices inside the network to make sure the response times were within reason. A quick check of other servers confirmed that response times were in the single digits, not even close to the bad app. With response times that high, I was almost certain that something was wrong. Time to make a phone call.

 

Brett answered when I called to the server team. I remember we brought him on board about three months ago. He's a bit green, but I was told he's a quick learner. I hope someone taught him how to troubleshoot slow servers. Our conversation started off as well as expected. I told him what I found and that the ping time was abnormal. He said he'd check on it and call me back. I decided to go to lunch and then check in on him when I got finished. That should give him enough time to get a diagnosis. After all, it's not like the whole network was down this time, right?

 

I got back from lunch and checked in on Brett The New Guy. When I walked in, he was massaging his temples behind a row of monitors. When I asked what was up, he sighed heavily and replied, "I don't know for sure. I've been trying to get into the server ever since you called. I can communicate with vCenter, but trying to console into the server takes forever. It just keeps timing out."

 

I told Brett that the high ping time probably means that the session setup is taking forever. Any lost packets just make the problem worse. I started talking through things at Brett's desk. Could it be something simple? What about the other virtual machines on that host? Are they all having the same problem?

 

Brett shrugged his shoulders. His response, "I'm not sure? How do I find out where they are?"

 

I stepped around to his side of the desk and found a veritable mess. Due to the way the VM clusters were setup, there was no way of immediately telling which physical host contained which machines. They were just haphazardly thrown into resource pools named after comic book characters. It looked like this app server belonged to "XMansion" but there were a lot of other servers under "AsteroidM". I rolled my eyes at the fact that my network team had strict guidelines about naming things so we could find it at a glance, yet the server team could get away with this. I reminded myself that Brett wasn't to blame and kept digging.

 

It took us nearly an hour before we even found the server. In El Paso, TX. I didn't even know we had an office in El Paso. Brett was able to get his management client to connect to the server in El Paso and saw that it contained exactly one VM - The Problem App Server. We looked at what was going on and figured that it would work better if we moved it back to the home office where it belonged. I called James to let him know we fixed the problem and that he should check with the department head. James told me to close the ticket in the system since the problem was fixed.

 

I hung up Brett's phone. Brett spun his chair back to his wall of monitors and put a pair of headphones on his head. I could hear some electronic music blaring away at high volume. I tapped Brett on the shoulder and told him, "We're not done yet. We need to find out why that server was halfway across the country."

 

Brett stopped his music and we dug into the problem. I told Brett to take lots of notes along the way. As we unwound the issues, I could see the haphazard documentation and architecture of the server farm was going to be a bigger problem to solve down the road. This was just the one thing that pointed it all out to us.

 

So, how does a wayward VM wind up in the middle of Texas? It turns out that the app was one of the first ones ever virtualized. It had been running on an old server that was part of a resource pool called "SavageLand". That pool only had two members: the home server for the app and the other member of the high availability pair. That HA partner used to be here in the HQ, but when the satellite office in El Paso was opened, someone decided to send the HA server down there to get things up and running. Servers had been upgraded and moved around since then, but no one documented what had happened. The VMs just kept running. When something would happen to a physical server, HA allowed the machines to move and keep working.

 

The logs showed that last week, the home server for the app had a power failure. It rebooted about ten minutes later. HA decided to send the app server to the other HA partner in El Paso. The high latency was being caused by a traffic trombone. The network traffic was going to El Paso, but the resources the server needed to access were back here at the HQ. So the server had to send traffic over the link between the two offices, listen for the response, and then send it back over the link. Traffic kept bouncing back and forth between the two offices, which saturated the link. I was shocked that the link was even fast enough to support the failover link. According to Brett's training manuals, it barely met the minimum. We were both amused that the act of failing the server over to the backup cause more problems than just waiting for the old server to come back up.

 

Brett didn't know enough about the environment to know all of this. And he didn't know how to find the answers. I made a mental note to talk to James about this at the next department meeting after everyone was back from vacation. I hoped they had some kind of documentation for that whole mess. Because if they didn't, I was pretty sure I knew where I could find something to help them out.

 

 

>>> Continue reading this story in Part 5

Filter Blog

By date: By tag: