Skip navigation
1 2 3 4 Previous Next

Geek Speak

2,082 posts
SomeClown

The Full Stack Engineer

Posted by SomeClown Mar 9, 2017

One of the hot topics in the software engineering world right now is the idea of what is being called the “full stack developer.” In this ecosystem, a full stack developer is someone who, broadly speaking, is able to write applications that encompass the front-end interface, the core logic, and the database backend. In other words, where it is common for engineers to specialize in one area or another, becoming very proficient in one or two languages, a full stack engineer is proficient in many languages, systems, and modalities so that they can, given the time, implement an entire non-trivial application from soup to nuts.

 

In the network engineering world, this might be analogous to an engineer who has worked in many roles over a lifetime, and has developed skills across the board in storage, virtualization, compute, and networking. Such an engineer has likely worked in mid-level organizations where siloes are not as prevalent or necessary as in larger organizations. Many times these are the engineers who become senior level architects in organizations, or eventually move into some type of consulting role, working for many organizations as a strategist. This is what I’ll call the full stack IT engineer.

 

While the skills and background needed to get to this place in your career likely put you into the upper echelon of your cohort, there can be some pitfalls. The first of which is the risk of ending up with a skillset that is very broad, but not very deep. Being able to talk abut a very wide scope and scale of the IT industry, based on honest, on the ground experience is great, but it also becomes difficult to maintain that deep level of skill in every area of IT. In the IT industry, however, I do not see this as a weakness per say. If you’ve gotten to this level of skill and experience, you are hopefully in a more consultative role, and aren’t being called on to put hands on keyboard daily. The value you bring at this level is that of experience, and the ability to see the whole chess board without getting bogged down in any one piece.

 

The other pitfall along the road to becoming a full stack engineer is the often overlooked aspect of training, whether on the job or on your own. If you are not absolutely dedicated to your craft, you will never, quite frankly, get to this level in your career. You’re going to be doing a daily job, ostensibly focusing on less than a broad spectrum of technologies. While you may move into those areas later, how do you learn today? And when you do move into other technologies, how to you keep the skills of today fresh for tomorrow? Honestly, the only way you’ll get there is to study, study, study pretty much all of the time. You have to become a full time student, and develop a true passion for learning. Basically, the whole stack has to become your focus, and if you’re only seeing part of it at work, you have to find the rest at home.

 

What does all of this have with Solarwinds and PerfStack? Simple: troubleshooting using a—wait for it—full stack solution is going to expose you to how the other half lives. Since PerfStack allows, and encourages, dashboards (PerfStacks) to be passed along as troubleshooting ensues, you should have some great visibility into the challenges and remedies that other teams see. If you’re a virtualization engineer and get handed a problem to work, presumably the network team, datacenter team, facilities, and probably storage have all had a hand in ascertaining where the problem does or does not lie. Pay attention to that detail, study it, ask questions as you get the opportunity. Make time to ask what certain metrics mean, and why one metric is more important than another. Ask the voice guys whether jitter or latency is worse for their systems, or the storage guys why IOPs matter. Ask the VM team why virtual machine mobility needs 10ms or less (generally) in link latency, or why they want stretched layer-2 between data centers.

 

It may seem banal to equate the full stack IT engineer with a troubleshooting product (even as great as PerfStack is) but the reality is that you have to take advantage of everything that is put in front of you if you want to advance your career. You’re going to be using these tools on a regular basis anyhow, why not take advantage of what you have? Sure, learn the tool for what it’s designed for, and learn your job to the best of your ability, but also look for opportunities like these to advance your career and become more valuable to both your team and the next team you’re on, whether at your current company or a new one down the road.

It was a busy week for service disruptions and security breaches. We had Amazon S3 showing us that, yes, the cloud can go offline at times. And we found out that out teddy bears may be staging an uprising. And we also found that Uber has decided to use technology and data to continue operating illegally in cities and towns worldwide. Not a good week for those of us that enjoy having data safe, secure, and available.

 

So, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!

 

Data from connected CloudPets teddy bears leaked and ransomed, exposing kids' voice messages

The ignorance (or hubris) of the CloudPets CEO is on full display here. I am somewhat surprised that anyone could be this naive with regard to security issues these days.

 

Yahoo CEO Loses Bonus Over Security Lapses

Speaking of security breaches, Yahoo is in the news again. You might think that losing $2 million USD would sting a bit, but considering the $40 million she gets for running Yahoo into the ground I think she will be okay for the next few years, even with living in the Valley.

 

Hackers Drawn To Energy Sector's Lack Of Sensors, Controls

I'd like to think that someone, somewhere in our government, is actively working to keep our grid safe. Otherwise, it won't be long before we start to see blackouts as a result of some bored teenager on a lonely summer night.

 

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

Thanks to a typo, the Amazon S3 service was brought to a halt for a few hours last week. In the biggest piece of irony, the popular website Is It Down Right Now? Website Down or Not? was, itself, down as a result. There's a lot to digest with this outage, and it deserves its own post at some point.

 

How Uber Deceives the Authorities Worldwide

I didn't wake up looking for more reasons to dislike how Uber is currently run as a business, but it seems that each week they reach a new low.

 

Thirteen thousand, four hundred, fifty-five minutes of talking to get one job

A bit long, but worth the read as it helps expose the job hiring process and all the flaws in the current system used by almost every company. I've written about bad job postings before, as well as how interviews should not be a trivia contest, so I enjoyed how this post took a deeper look.

 

If the Moon Were Only 1 Pixel - A tediously accurate map of the solar system

Because I love things like this and I think you should, too.

 

Just a reminder that the cloud can, and does, go offline from time to time:

2017-03-06_11-15-52.jpg

2017-03-06_11-15-52.jpg

 

Last week Amazon Web Services S3 storage in the East region went offline for a few hours. Since then, AWS has published a summary review of what happened. I applaud AWS for their transparency, and I know that they will use this incident as a learning lesson to make things better going forward. Take a few minutes to read the review and then come back here. I'll wait.

 

Okay, so it's been a few days since the outage. We've all had some time to reflect on what happened. And, some of us, have decided that now is the time to put on our Hindsight Glasses and run down a list of lingering questions and comments regarding the outage.

 

Let's break this down!

 

"...we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."

This, to me, is the most inexcusable part of the outage. Anyone that does business continuity planning will tell you that annual checks are needed on such play books. You cannot just wave that away with, "Hey, we've grown a lot in the past four years and so the play book is out of date." Nope. Not acceptable.

 

"The servers that were inadvertently removed supported two other S3 subsystems."

The engineers were working on a billing system, and they had no idea that those billing servers would impact a couple of key S3 servers. Which brings about the question, "Why are those systems related?" Great question! This reminds me of the age-old debate regarding dedicated versus shared application servers. Shared servers sound great until one person needs a reboot, right? No wonder everyone is clamoring for containers these days. Another few years and mainframes will be under our desks.

 

"Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended."

But the command was allowed to be accepted as valid input, which means the code doesn't have any check to make certain that the command was indeed valid. This is the EXACT scenario that resulted in Jeffrey Snover adding the -WHATIF and -CONFIRM parameters into Powershell. I'm a coding hack, and even I know the value in sanitizing your inputs. This isn't just something to prevent SQL injection. It's also to make certain that as a cloud provider you don't delete a large number, or percentage, of servers by accident.

 

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly."

So, they don't ever ask themselves, "What if?" along with the question, "Why?" These are my favorite questions to ask when designing/building/modifying systems. The 5-Whys is a great tool to find the root cause, and the use of "what if" helps you build better systems that help avoid the need for root cause reviews.

 

"We will also make changes to improve the recovery time of key S3 subsystems."

Why wasn't this a thing already? I cannot understand how AWS would get to the point that it would not have high availability already built into their systems. My only guess here is that building such systems costs more, and AWS isn't interested in things costing more. In the race to the bottom, corners are cut, and you get an outage every now and then.

 

"...we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3."

The AWS dashboard for the East Region was dependent upon the East Region being online. Just let that sink in for a bit. Hey, AWS, let me know if you need help with monitoring and alerting. We'd be happy to help you get the job done.

 

"Other AWS services in the US-EAST-1 Region that rely on S3 for storage...were also impacted while the S3 APIs were unavailable."

Many companies that rely on AWS to be up and running were offline. My favorite example is the popular website Is It Down Right Now? Website Down or Not? was itself, down as a result of the outage. If you migrate your apps to the cloud, you need to take responsibility for availability. Otherwise, you run the risk of being down with no way to get back up.

 

Look, things happen. Stuff breaks all the time. The reason this was such a major event is because AWS has done amazing work in becoming the largest cloud provider on the planet. I'm not here to bury AWS, I'm here to highlight the key points and takeaways from the incident to help you make things better in your shop. Because if AWS, with all of its brainpower and resources, can still have these flaws, chances are your shop might have a few, too. 

I have been talking about the complexity of resolving performance issues in modern data centers. I’ve particularly been talking about how it is a multi-dimensional problem. Also, that virtualization significantly increases the number of dimensions for performance troubleshooting. My report of having been forced to use Excel to coordinate brought some interesting responses. It is, indeed, a very poor tool for consolidating performance data.

 

I have also written in other places about management tools that are focused on the data they collect, rather than helping to resolve issues. What I really like about PerfStack is the ability to use the vast amount of data in the various SolarWinds tools to identify the source of performance problems.

 

The central idea in PerfStack is to gain insights across all of the data that is gathered by various SolarWinds products. Importantly, PerfStack allows the creation of ad hoc data collections of performance data. Performance graphs for multiple objects and multiple resource types can be stacked together to identify correlation. My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real- time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed.  The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace.

 

I have written elsewhere about systems management tools that get too caught up in the cool data they gather. These tools typically have fixed dashboards that give pretty overviews. They often cram as much data as possible into one screen. What I tend to find is that these tools are inflexible about the way the data is combined. The result is a dashboard that is good at showing that everything is, or is not, healthy but does not help a lot with resolving problems. The dynamic nature of the PerfStack workspace lends itself to getting insight out of the data, and helping identify the root cause of problems. Being able to quickly assemble the data on the load on a hypervisor and the VM operating system, as well the application statistics speeds troubleshooting. The ability to quickly add performance counters for the other application dependencies lets you pinpoint the cause of the issue quickly. It may be that the root cause is a domain controller that is overloading its CPU, while the symptom is a SharePoint server that is unresponsive.

 

PerfStack allows very rapid discovery of issue causes. The value of PerfStack will vastly increase as it is rolled out across the entire SolarWinds product suite.

 

You can see the demonstrations of PerfStack that I saw at Tech Field Day on Vimeo: NetPath here and SAM here.

As IT professionals, we have a wide variety of tools at our disposal for any given task. The same can be said for the attackers behind the increasing strength and number of DDoS attacks. The latest trend of hijacked IoT devices, like the Mirai Botnet, deserve a lot of attention because of their prevalence and ability to scale, mostly due to a lack of security and basic protections. This is the fault of both manufacturers and consumers. However, DDoS attacks at scale are not really a new thing, because malware-infected zombie botnets have been around for a while. Some fairly old ones are still out there, and attackers don’t forget their favorites.

 

One of the largest attacks in 2016 came in October, and measured in at 517 Gbps. This attack was not a complex, application-layer hack, or a massive DNS reflection, but a massive attack from malware that has been around for more than two years, called Spike. Spike is commonly associated with x86 Linux-based devices (often routers with unpatched vulnerabilities), and is able to generate large amounts of application-layer HTTP traffic. While Mirai and other IoT botnets remained top sources of DDoS traffic in 2016, they were not alone.

 

Complexity

 

The complexity of these attacks continues to evolve. What used to be simple volumetric flooding of UDP traffic has moved up the stack over time. Akamai reports that between Q4 2015 and Q4 2016 there was a 6% increase in infrastructure layer attacks (layer 3 & 4), and a 22% increase in reflection-based attacks. At the same time, while overall web application attacks decreased, there was a 33% increase in SQLi attacks.

 

The application layer attacks become increasingly difficult to mitigate due to their ability to mimic real user behavior. They are more difficult to identify, and often have larger payloads. They are often combined with other lower-level attacks for variety and larger attack surface. This requires vigilance on the part of those responsible for the infrastructure we rely on, to protect against all possible attack vectors.

 

Source

 

Not surprising is the fact that China and the United States are the primary sources of DDoS attacks, with China dominating Q1, Q2, and Q3 of 2016. The United States “beat” China in Q4 spiking to 24% of global DDoS traffic for that quarter. The increase in the number of source IP addresses here is dramatic, with the U.S. numbers leaping from about 60K in Q3 to 180K in Q4. This is largely suspected to be due to a massive increase in IoT (Mirai) botnet sources. Black Friday sales, perhaps?

 

While attacks evolve, become larger and more complex, some simple tried-and-true methods of disrupting the internet can still be useful. Old tools can become new again. Reports from major threat centers consistently show that Conficker is still one of the most prevalent malware variants in the wild, and it has been around since 2008.

 

Malware is often modeled after real biological viruses, like the common cold, and they are not easily eliminated. A handful of infected machines can re-populate and re-infect thousands of others in short order, and this is what makes total elimination a near impossibility.

 

There is no vaccine for malware, but what about treating the symptoms?

 

A concerted effort is required to combat the looming and real threat these DDoS attacks pose. Manufacturers of infrastructure products, consumer IoT devices, mobile phones, service providers, enterprise IT organizations, and even the government are on the case. Each must actively do their part to reinforce against, protect from, and identify sources of malware to slow the pace of this growing problem.

 

The internet is not entirely broken, but it is vulnerable to the exponential scale of the DDoS threat.

By Joe Kim, SolarWinds Chief Technology Officer

 

It’s time to stop treating data as a commodity and create a secure and reliable data recovery plan by following a few core strategies.

 

1. Establish objectives

 

Establish a Recovery Point Objective (RPO) that determines how much data loss is acceptable. Understanding acceptable risk levels can help establish a baseline understanding of where DBAs should focus their recovery efforts.

 

Then, work on a Recovery Time Objective (RTO) that shows how long the agency can afford to be without its data.

 

2. Understand differences between backups and snapshots

 

There’s a surprising amount of confusion about the differences between database backups, server tape backups, and snapshots. For instance, many people have a misperception that a storage area network (SAN) snapshot is a backup, when it’s really only a set of data reference markers. Remember that a true backup, either on- or off-site, is one in which data is securely stored in the event that it needs to be recovered.

 

3. Make sure those backups are working

 

Although many DBAs will undoubtedly insist that their backups are working, the only way to know for sure is to test the backups by doing a restore. This will provide assurance that backups are running — not failing — and highly available.

 

4. Practice data encryption

 

DBAs can either encrypt the database backup file itself, or encrypt the entire database. That way, if someone takes a backup, they won’t be able to access the information without a key. DBAs must also ensure that if a device is lost or stolen, the data stored on the device remains inaccessible to users without proper keys.

 

5. Monitor and collect data

 

Combined with network performance monitoring and other analysis software, real-time monitoring and real-time data collection can improve performance, reduce outages, and maintain network and data availability.

 

Real-time collection of information can be used to do proper data forensics. This will make it easier to track down the cause of an intrusion, which can be detected through monitoring.

 

Monitoring, database analysis, and log and event management can help DBAs understand if something is failing. They’ll be able to identify potential threats through things like unusual queries or suspected anomalies. They can compare the queries to their historical information to gauge whether or not the requests represent potential intrusions.

 

6. Test, test, test

 

If you’re managing a large database, there’s simply not enough space or time to restore and test it every night. DBAs should test a random sampling taken from their databases. From this information, DBAs can gain confidence that they will be able to recover any database they administer, even if that database is in a large pool. If you’re interested in learning more, check out this post, which gets into further detail on database sampling.

 

Data is quickly becoming a truly precious asset to government agencies, so it is critical to develop a sound data recovery plan.

 

Find the full article on our partner DLT’s blog, Technically Speaking.

I’ve long held the belief that for any task there are correct approaches and incorrect ones. When I was small, I remember being so impressed by the huge variety of parts my father had in his tool chest. Once, I watched him repair a television remote control, one that had shaped and tapered plastic buttons. The replacement from RCA/Zenith, I believe at the time, cost upwards of $150. He opened the broken device, determined that the problem was that the tongue on the existing button had broken, and rather than epoxy the old one back together, he carved and buffed an old bakelite knob into the proper shape, attached it in place of the original one, and ultimately, the final product looked and performed as if it were the original. It didn’t even look different than it had. This, to me, was the ultimate accomplishment. Almost as the Hippocratic Oath dictates, above all, do no harm. It was magic.

 

When all you have is a hammer, everything is a nail, right? But that sure is the wrong approach.

 

Today, my favorite outside work activity is building and maintaining guitars. When I began doing this, I didn’t own some critical tools. For example, an entire series of “Needle Files” and crown files are appropriate for the shaping and repair of frets on the neck. While not a very expensive purchase, all other tools would fail in the task at hand. The correct Allen wrench is necessary for fixing the torsion rod on the neck. And the ideal soldering iron is critical for proper wiring of pickups, potentiometers, and the jack. Of course, when sanding, a variety of grades are also necessary. Not to mention, a selection of paints, brushes, stains, and lacquers.

 

The same can be said of DevOps. Programming languages are designed for specific purposes, and there have been many advances in the past few years pointing to what a scripting task may require. Many might use Bash, batch, or PowerShell to do their tasks. Others may choose PHP or Ruby on Rails, while still others choose Python as their scripting tools. Today, it is my belief that no one tool can accommodate every action that's necessary to perform these tasks. There are nuances to each language, but one thing is certain: many tasks require the collaborative conversation between these tools. To accomplish these tasks, the ideal tools will likely call functions back and forth from other scripting languages. And while some bits of code are required here and there, currently it's the best way to approach the situation, given that many tools don't yet exist in packaged form. The DevOps engineer, then, needs to write and maintain these bits of code to help ensure that they are accurate each time they are called upon. 

 

As correctly stated in comments on my previous posting, I need to stress that there must be testing prior to utilizing these custom pieces of code to help ensure that other changes that may have taken place within the infrastructure are accounted for each time these scripts are set to task.

 

I recommend that anyone who is in DevOps get comfortable with these and other languages, and learn which do the job best so that DevOps engineers become more adept at facing challenges.

 

At some point, there will be automation tools, with slick GUI interfaces that may address many or even all of these needs when they arise. But for the moment,  I advise learning, utilizing, and customizing scripting tools. In the future, when these tools do become available, the question is, will they surmount the automation tools you’ve already created via your DevOps? I cannot predict.

As you spend more time in security, you start to understand that keeping up with the latest trends is not easy. Security is a moving target, and many organizations simply can’t keep up. Fortunately for us, Cisco releases an annual security report that can help us out in this regard. You can find this year's report, as well as past reports, here. In this post, I wanted to share a few highlights that illustrate why I believe security professionals should be aware of these reports.

 

Major findings

A nice feature of the Cisco 2017 Annual Cyber Security Report is the quick list of major findings. This year, Cisco notes that the three leading exploit kits -- Angler, Nuclear, and Neutrino --  are vanishing from the landscape. This is good to know, because we might be spending time and effort looking for these popular attacks while other lesser-known exploit kits start working their way into the network. And based on Cisco’s findings, most companies are using several security vendors with more than five security products in their environment, and only about half of the security events received in a given day are reviewed. Of that number, 28% are deemed legitimate, and less than half that number are remediated. We’re having a hard time keeping up, and our time spend needs to be at a live target, not something that’s no longer prevalent.

 

Gaining a view to adversary activity

In the report's introduction, Cisco covers the strategies that adversaries use today. These include taking advantage of poor patching practices, social engineering, and malware delivery through legitimate online content, such as advertising. I personally feel that you can't defend your network properly unless you know how you’re being attacked. I suppose you could look at it this way. Here in the United States, football is one of the most popular sports. It’s common practice for a team to study films of their opponents before playing them. This allows them to adjust their offensive and defensive game plan ahead of time. The same should be true for security professionals. We should be prepared to adjust to threats, and reviewing Cisco’s security report is similar to watching those game films.

 

In the security report, Cisco breaks down the most commonly observed malware by the numbers. It also discusses how attackers pair remote access malware with exploits in deliverable payloads. Some of what I gleaned from the report shows that the methods being used are the same as what was brought out in previous reports, with some slight modifications.

 

My take

From my point of view, the attacks are sophisticated, but not in a way that’s earth shattering. What I get from the report is that the real issue is that there are too many alerts from too many security devices, and security people can't sort through them efficiently. Automation is going to play a key role in security products. Until our security devices are smart enough to distinguish noise from legitimate attacks, we’re not going to be able to keep up. However, reading reports like this can better position our security teams to look in the right place at the right time, cutting down on some of the breaches we see. So, to make a long story short, be sure to read up on the Cisco Annual Security report. It’s written well, loaded with useful data, and helps security professionals stay on top of the security landscape.

In our pursuit of Better IT, I bring you a post on how important data is to functional teams and groups. Last week we talked aboutnti-patterns in collaboration, covering things like data mine-ing and other organizational dysfunctions. In this post we will be talking about the role shared data, information, visualizations, and analytics play in helping ensure your teams can avoid all those missteps from last week.

 

Data! Data! Data!

These days we have data. Lots and lots of data. Even Big Data, data so important we capitalize it!. As much as I love my data, we can't solve problems with just raw data, even if we enjoy browsing through pages of JSON or log data. That's why we have products like NPM Network Performance Monitor Release Candidate , SAM Server & Applications Monitor Release Candidate and DPADatabase Performance Analyzer RC,  to help us collect and parse all that data.  Each of those products have specialized metrics they collect, meaning they apply to them and visualizations to help specialized SySadmins to leverage that data. These administrators probably don't think of themselves as data professionals, but they are. They choose which data to collect, which levels to be alerted on, and which to report upon. They are experts in this data and they have learned to love it all.

Shared Data about App and Infrastructure Resources

Within the SolarWinds product solutions, data about the infrastructure and application graph is collected and displayed on the Orion Platform. This means that cross-team admins share the same set of resources and components and the data about their metrics. Now we havePerfStack Livecast with features to do cross-team collaboration via data. We can see entities we want to analyze, then see all the other entities related them. This is what I call the Infrastructure and Application Graph, which I'll be writing about later. After choosing Entities, we can discover the metrics available for each of the entities and choose the ones that make the most sense to analyze based on the troubleshooting we are doing now.

 

PerfstackAnalysis.png

 

Metrics Over Time

 

Another data feature that's critical to analyzing infrastructure issues is the ability to see data *over time." It's not enough to know how CPU is doing right now, we need to know what it was doing earlier today, yesterday, last week, and maybe even last month, on the same day of the month. By having a view into the status of resources over time, we can intelligently make sense of the data we are seeing today. End-of-month processing going on? Now we know why there might be slight spike in CPU pressure.

 

Visualizations and Analyses

 

The beauty of Perfstack is that by choosing these Entities and metrics we can easily build data visualizations of the metrics and overlay them to discover correlations and causes. We can then interact with the information we now have by working with the data or the visualizations. By overlaying the data, we can see how statuses of resources are impacting each other. This collaboration of data means we are performing "team troubleshooting" instead of silo-based "whodunits." We can find the issue, which until now might have been hiding in data in separate products.

perfstackComplex.png

Actions

 

So we've gone from data to information to analysis in just minutes. Another beautiful feature of PerfStack is that once we've built the analyses that show our troubleshooting results, we can copy the URL, send it off to team members, and they can see the exact same analysis -- complete with visualizations -- that we saw. If we've done similar troubleshooting before and saved projects, we might be doing this in seconds.

Save Project.png

This is often hours, if not days, faster than how we did troubleshooting in our previous silo-ed, data mine-ing approach to application and infrastructure support. We accomplished this by having quick and easy access to shared information that united differing views of our infrastructure and application graph.

 

Data -> Information -> Visualization -> Analysis -> Action

 

It all starts with the data, but we have to love the data into becoming actions. I'm excited about this data-driven workflow in keeping applications and infrastructure happy.

Needles and haystacks have a storied past, though never in a positive

sense. Troubleshooting network problems comes as close as any to the

process to which that pair alludes. Sometimes we just don't know what we

don't know, and that leaves us with a problem: how do we find the

information we're looking for when we don't know what we're looking for?

 

Picture1.png

The geeks over at Solarwinds obviously thought the same thing and decided

to do something to make life easier for those hapless souls frequently

finding themselves tail over teakettle in the proverbial haystack; that

product is PerfStack.

 

PerfStack is a really cool component piece of the Orion Platform as of

the new 12.1 release. In a nutshell, what it allows you to do is to find

all sorts of bits of information that you're already monitoring, and view

it all in one place for easy consumption. Rather than going from this page

to that, one IT discipline-domain to another, or ticket to ticket,

PerfStack gives you more freedom to mix and match, to see only the bits

pertinent to the problem at hand, whether those are in the VOIP systems,

wireless, applications, or network. Who would have thought that would be

useful, and why haven't we thought of that before?

 

Picture2.png

 

In and of itself, those features would be a welcome addition to the Orion

suite--or any monitoring suite, for that matter--but Solarwinds took it

one step further and designed PerfStack in such a way that you can create

your own "PerfStacks" on the fly, as well as passing them around for other

people to use. Let's face it, having a monitoring solution with a lot of

canned reporting, stuff that just works right out of the box, is a great

thing, but having the flexibility to create your own reports at a highly

granular level is infinitely better. Presumably you know your environment

better than the team at Solarwinds, or me, or anyone else. You

shouldn't be forced into a modality that doesn't fit your needs.

 

Passing dashboards ("PerfStacks") around to your fellow team members, or

whomever, is really a key feature here. Often we have a great view to the

domain we operate within, whether that's virtualization, applications,

networking, storage; but we don't have the ability to share

that with other people. That's certainly the case with point products, but

even when we are all sharing the same tools it's not historically been as

smooth a process as it could be. That's unfortunate, but PerfStack goes

a long way toward breaking through that barrier.

 

Picture3.png

 

There are additional features to PerfStack that bear mentioning: real-time updates to dashboards without redrawing the entire screen, saving of the dashboards, importing in real-time of new polling targets/events, etc. I will cover those details next time, but what we've talked about so far should be enough to show the value of the product. Solarwinds doesn't seem to believe in tiny rollouts. They've come out of the gates fast and strong with this update, and with good reason. It really is a great and useful product that will change the way you look at monitoring and troubleshooting.

Back in the office this week and excited for the launch of PerfStack. If you haven't heard about PerfStack yet, you should check out the webcast tomorrow: PerfStack Livecast

 

As usual, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!

 

Cloudflare Coding Error Spills Sensitive Data

A nice reminder about how you are responsible for securing your data, not someone else. Although Cloudflare® was leaking data, a company such as 1Password was not affected because they were encrypting their data in more than one way. In short, 1Password assumed that SSL/TLS *can* fail, and took responsibility to secure their data, rather than relying on someone else to do that for them. We should all be mindful about how we treat our data.

 

Microsoft Invests in Real-time Maps for Drones, and Someday, Flying Cars

Can we skip autonomous cars and go right to flying cars? Because that would be cool with me. And Microsoft® is doing their part to make sure we won't need to use Apple® Maps with our flying cars.

 

Expanding Fact Checking at Google

Nice to see this effort underway. I'm not a fan of crowdsourced entities such as Wikipedia, as they have inherent issues with veracity. Would be good for everyone if we could start verifying data posted online as fact (versus opinion, or just fake).

 

Wikipedia Bots Spent Years Fighting Silent, Tiny Battles With Each Other

Did I mention I wasn't a fan of Wikipedia? As if humans arguing over facts aren't bad enough, someone thought it was a good idea to create bots to do the job instead.

 

Perspective

Besides the issue with fact checking, the internet is also a cesspool of misery. Perspective is an attempt to use Machine Learning to help foster better (or nicer) conversations online. I'm curious to see how this project unfolds.

 

Microsoft Surface: NSA Approves Windows 10 Tablets for Classified Work

Interesting to note here that this is only for devices manufactured by Microsoft, and not other vendors such as HP® or Dell®. What's more interesting to note is how Microsoft continues to make progress in areas of data security for both their devices and the hosted services (Azure®).

 

Alphabet's Waymo Alleges Uber Stole Self-Driving Secrets

I am simply amazed at how many mistakes Uber® can make as a company and still be in business.

 

The weather has been warm and Spring-like, so I decided to put the deck furniture out. So now, if it snows two feet next week, you know who to blame:

deck.jpg

I’ve discussed the idea that performance troubleshooting is a multi-dimensional problem and that virtualization adds more dimensions. Much of the time it is sufficient to look at the layers independently. The cause of a performance problem may be obvious in an undersized VM or an overloaded vSphere cluster. But sometimes you need to correlate the performance metrics across multiple layers. Worst of all is when the problem is intermittent. Apparently random application slowdowns are the worst to troubleshoot. The few times that I have needed to do this correlation I have always had a sinking feeling. I know that I am going to end up gathering a lot of performance logs from different tools. Then I am going to need to identify the metrics that are important and usually graph these together. There is a sinking feeling when I know I need to get the data from Windows Perfmon, the vSphere client, the SAN, and maybe a network monitor into a single set of graphs.

 

My go-to tool for consolidating all this data is still Microsoft Excel, mostly because I have a heap of CSV files and want a set of graphs. Consolidating this data has a few challenges. The first is getting a consistent start and finish times for the sample data. The CSV files are generated from separate tools and the time stamps may even be in different time zones. Usually looking at one or two simple graphs identifies the problem time window. Once we know when to look at we can trim each CSV file for the time range we want. Then there are challenges with getting consistent intervals for the graphs. Some tools log every minutes and others every 20 minutes. On occasion, I have had to re-sample the data to the lowest time resolution just to get everything on one graph. That graph also needs to have sensible scales, meaning applying scaling to the CSV values before we graph them. I’m reminded how much I hate having to do this kind of work and how much it seems like something that should be automated.

 

Usually, when I’m doing this I am an external consultant brought in to deal with a high visibility issue. Senior management is watching closely and demanding answers. Usually, I know the answer early and spend hours putting together the graph that proves the answer. If the client had a good set of data center monitoring tools and well-trained staff, they would not need me. It troubles me how few organizations spend the time and effort in getting value out of monitoring tools.

 

I have been building this picture of the nightmare of complex performance troubleshooting for a reason. Some of you have guessed the reason, PerfStack will be a great tool to avoid exactly this problem. Seeing an early demo of PerfStack triggered memories. Not good memories.

Ensure your IoT devices are secure, or face the consequences. That’s the message being sent to some hardware manufacturers by the Federal Trade Commission. In the aftermath of the ever-increasing number of attacks perpetrated by compromised IoT devices like routers and cameras, the Federal Trade Commission’s Bureau of Consumer Protection has targeted companies such as TrendNet, Asus, and more recently, D-Link.

 

TRENDnet

Back in 2013, the FTC settled its very first action against a manufacturer of IP-enabled consumer products, TRENDnet. TRENDnet’s SecurView cameras were widely used by consumers for a wide range of purposes including home security and baby monitors. By their product name alone, these products were seemingly marketed as “secure." The FTC accused TRENDnet of a number of issues, including:

 

  • Failing to use reasonable security to design and test its software
  • Failing to secure camera passwords
  • Transmitting user login credentials in the clear
  • Storing consumers’ login information in clear, readable text on their mobile devices

 

In January of 2012, a hacker exposed these flaws and made them public, resulting in almost 700 live feeds being posted and freely available on the internet. These included babies sleeping in their cribs.

 

ASUS

Once again the FTC fired a shot across the bow at manufacturers of consumer IoT devices when they leveled a complaint against ASUSTek Computer, Inc. This time, the security of their routers was questioned. ASUS had marketed their consumer line of routers with claims they would “protect computers from any unauthorized access, hacking, and virus attacks” and “protect the local network against attacks from hackers.” However, the FTC found several flaws in the ASUS products, including:

 

  • Easily exploited security bugs in the router’s web-based control panel
  • Allowing consumers to set and retain the default login credentials on every router (admin/admin)
  • Vulnerable cloud storage options AiCloud and AiDisk that exposed consumers’ data and personal information to the internet.

 

In 2014, hackers used these and other vulnerabilities in ASUS routers to gain access to over 12,900 consumers’ storage devices.

 

D-Link

Now, in 2017, the FTC has targeted D-Link Corporation, a well-known manufacturer of consumer and SMB/SOHO networking products. This complaint alleges that D-Link has “failed to take reasonable steps to secure its routers and Internet Protocol (IP) cameras, potentially compromising sensitive consumer information including live video and audio feeds from D-Link IP cameras.”

The FTC complaint goes on to outline how D-Link has promoted the security of its devices with marketing and advertising citing “easy to secure” and “advanced network security," but outlines several issues:

 

  • Hard-coded login credentials (guest/guest) in D-Link camera software
  • Software vulnerable to command injection that could enable remote control of consumer devices
  • Mishandling of a private code signing key, which was openly available on a public website for six months
  • User login credentials store in clear, readable text on mobile device

 

The severity of an exposed and vulnerable router is amplified by the fact that these are a home networks’ primary means of defense. Once compromised, everything behind that router is then potentially exposed to the hacker, and the FTC emphasizes, could result in computers, smartphones, IP cameras, and IP-enabled appliances to be attacked as a result.

 

The DDoS Landscape

According to Akamai’s quarterly State of the Internet report, DDoS attacks continue to flourish and evolve as a primary means to attack both consumers and businesses. In Q4 2016, there was a 140% increase in attacks greater than 100Gbps, a 22 percent increase in reflection-based attacks, and a 6 percent increase in Layer 3 and 4 attacks. At the application layer, a 44% increase in SQLi attacks was observed over the same period. These examples are more evidence that these types of attacks are moving ever upwards in the stack.

 

Not surprisingly, the United States continues to be the largest source of these attacks, accounting for approximately 28 percent of the global web application attacks in Q4 2016. As IoT devices continue to proliferate at exponential rates, and companies like TREN,Dnet, ASUS and D-Link fail to secure them, these numbers may only increase.

 

There is hope however, that organizations like the FTC can send a strong message to device manufacturers in the upcoming months as they continue to identify and hold accountable the companies that fail to protect consumers, and the rest of us, from exposed and vulnerable devices.

 

Do you feel the FTC and FCC (or other government organizations) should be more or less involved in the enforcement of IoT security?

Devops, the process of creating code internally in an effort to streamline the functions of the administrative processes within the framework of the functions of the sysadmin, are still emerging within IT departments across the globe. These tasks have traditionally revolved around the mechanical functions the sysadmin has under their purview. However, another whole category of administration is now becoming far more vital in the role of the sysadmin, and that’s the deployment of applications and their components within the environment.

 

Application Development is taking on a big change. The utilization of methods like MicroServices, and containers, a relatively new paradigm about which I’ve spoken before, makes the deployment of these applications very different. Now, a SysAdmin must be more responsive to the needs of the business to get these bits of code and/or containers into production far more rapidly. As a result, the SysAdmin needs to have tools in place so as to to respond as precisely, actively, and consistently as possibly. The code for the applications is now being delivered so dynamically, that it now must be deployed, or repealed just as rapidly.

 

When I worked at VMware, I was part of a SDDC group who had the main goal of assisting the rollouts of massive deployments (SAP, JDEdwards, etc.) to an anywhere/anytime type model. This was DevOps in the extreme. Our expert code jockeys were tasked with writing custom code at each deployment. While this is vital to the goals of many organizations, but today the tools exist to do these tasks on a more elegant manner.

 

So, what tools would an administrator require to push out or repeal these applications in a potentially seamless manner? There are tools that will roll out applications, say from your VMware vCenter to whichever VMware infrastructure you have in your server farms, but also, there are ways to leverage that same VMware infrastructure to deploy outbound to AWS, Azure, or to hybrid, but non-VMware infrastructures. A great example is the wonderful Platform9, which exists as a separate panel within vCenter, and allows the admin to push out full deployments to wherever the management platform is deployed.

 

There are other tools, like Mezos which help to orchestrate Docker types of container deployments. This is the hope of administrators for Docker administration.

 

But, as of yet, the micro-services type of puzzle piece has yet to be solved. As a result, the sysadmin is currently under the gun for the same type of automation toolset. For today, and for the foreseeable future, DevOps holds the key. We need to not only deploy these parts and pieces to the appropriate places, but we also have to ensure that they get pushed to the correct location, that they’re tracked and that they can be pulled back should that be required. So, what key components are critical? Version tracking, lifecycle, permissions, and specific locations must be maintained.

I imagine that what we’ll be seeing in the near future are standardized toolsets that leverage orchestration elements for newer application paradigms. For the moment, we will need to rely on our own code to assist us in the management of the new ways in which applications are being built.

By Joe Kim, SolarWinds Chief Technology Officer

 

While it’s essential to have a website that is user-friendly, it’s equally important to make sure that the backend technologies that drive that site are working to deliver a fast and fluid performance. In short, good digital citizen engagement combines reliability, performance, and usability to help connect governments and citizens.

 

Assuming you’ve already developed a streamlined user interface (UI), it’s time to start centering your attention on the behind-the-scenes processes that will help you build and maintain that connection. Here are three strategies to help you achieve this ultimate ideal of form and function.

 

Closely monitor application performance

 

Slow or unresponsive applications can undermine federal, state, and local government’s efforts to use online solutions to connect with citizens. What’s the incentive for constituents to use their government’s digital platform if a site is slow and doesn’t easily or quickly lead to information that answers their questions? They might as well pick up the phone or (shudder) pay a visit to their local office.

 

Monitoring application performance is imperative to helping ensure that your digital platforms remain a go-to resource for citizens. Use application monitoring solutions to track and analyze performance and troubleshoot potential issues. The data you collect can help you identify the causes of problems, allowing you to address them quickly and, ideally, with minimal impact to site performance.

 

Log and manage events

 

The feds need to take care to plug any potential holes, and part of that effort entails logging and managing events. Events can range from a new user signing up to receive emails about local information, to potential intrusions or malware designed to collect personal information or compromise website performance.

 

Put steps in place to monitor any types of website events and requests, and this will help you identify, track, and respond to potential incidents before they do lasting damage. You can monitor and manage unauthorized users, failed login attempts, and other events, and even mitigate internal threats by changing user privileges.

 

Test site performance

 

Once application monitoring and log and event management processes are in place, continue to test your website’s performance to ensure that it delivers an optimal and user-friendly experience.

 

The goal is to identify any slow web services that may be impacting that experience. Use web performance monitoring solutions to identify infrastructure issues and various site elements that could be causing latency issues.

 

Your site should be tested from multiple devices, locations, and browsers to provide your users with a fast and reliable experience. These tests should be done proactively and on a regular basis to help ensure a consistently optimal performance that delivers on the promise of true citizen engagement.

 

Remember that as you strive to achieve that promise, it’s important to invest in the appropriate backend solutions and processes to power your efforts. They’re just as important as the surface details. Without them, you run the risk of having your site be just another pretty face that citizens may find wanting.

 

Find the full article on GovLoop.

Filter Blog

By date:
By tag: