Skip navigation
1 2 3 4 Previous Next

Geek Speak

1,870 posts

I'm in Austin this week to film more SolarWinds Lab episodes and conduct general mischief. Lucky for me Delta had a MUCH better Monday this week than last week.

 

Anyway, here are the items I found most amusing from around the Internet. Enjoy!

 

Bungling Microsoft singlehandedly proves that golden backdoor keys are a terrible idea

If you have ever wanted to run Linux on your Surface, now is your chance! I'm look at you adatole.

 

Millions of Cars Vulnerable to Remote Unlocking Hack

This is why I drive an 2008 Jeep. It's like the Battlestar Galactica: too old to be hacked by new technology.

 

Scammers sneak into customer support conversations on Twitter

I've noticed a handful of fake support accounts popping up lately, so this is a good reminder to be careful out there.

 

Best Fighter Jet In History Grounded By Bees

No, seriously. And I dig the part where they decided not to just kill the bees, but to relocate them. Nice touch.

 

Walmart and the Multichannel Trap

Wonderful analysis of where Walmart is headed, and why Amazon is likely to run all retail in the future. After living through what Walmart did to Vlasic 20 years ago, this seems fitting to me.

 

Networking Needs Information, Not Data

Nice post that echos what I've been saying to data professionals for a few years now. There is a dearth of data analysts in the world right now. You could learn enough about data analytics in a weekend that will impact your career for the next 20 years, but only if you get started.

 

Even Michael Phelps knows it's football season, right kong.yang ?

IMG_4244.PNG

data_security.jpgData is an incredibly important asset. In fact, data is the MOST important asset for any company, anywhere. Unfortunately, many continue to treat data as an easily replaced commodity.

 

But we’re not talking about a database administrator’s (DBA) iTunes library. We’re talking highly sensitive and important data that can be lost or compromised.

 

It’s time to stop treating data as a commodity. We need to create a secure and reliable data recovery plan. And we can get that done by following a few core strategies.

 

Here are the six easy steps you can take to prevent data loss.

 

Build a Recovery Plan

Novice DBAs think about backups as the starting point for data loss. It is the experienced senior DBAs that know the starting point is building the recovery plan.

 

The first thing to do here is to establish a Recovery Point Objective (RPO) that determines how much data loss is acceptable. Understanding acceptable risk levels can help establish a baseline understanding of where DBAs should focus their recovery efforts. Then, work on a Recovery Time Objective (RTO) that shows how long the business can afford to be without its data. Is a two-day restore period acceptable, or does it have to be 15 minutes?

 

Finally, remember that “high availability” and “disaster recovery” are different. A DBA managing three nodes with data flowing between each may assume that if something happens to one node the other two will still be available. But an error in one node will undoubtedly get replicated across all of them. You better have a recovery plan in place when this happens.

 

If not, then you should consider having an updated resume.

 

Understand That Snapshots != Database Backups

There’s a surprising amount of confusion about the differences between database backups, server tape backups, and snapshots. Many administrators have a misperception that a storage area network (SAN) snapshot is good enough as a database backup, but that snapshot is only a set of data reference markers. The same issue exists with VM snapshots as well. Remember that a true backup is one that allows you to recover your data to a transactionally consistent view at a specific point in time.

 

Also consider the backup rule of three, where you save three copies of everything, in two different formats, and with one off-site backup. Does this contain hints of paranoia? Perhaps. But it also perfectly illustrates what constitutes a backup, and how it should be done.

 

Make Sure the Backups Are Working

There is only one way to know if your backups are working properly, and that is to try doing a restore. This will provide assurance that backups are running -- not failing -- and highly available. This also gives you a way to verify if your recovery plan is working and meeting your RPO and RTO objectives.

 

Use Encryption

Data-at-rest on the server should always be encrypted, and there should also be backup encryption for the database as well as the database backups. There are a couple of options for this. DBAs can either encrypt the database backup file itself, or encrypt the entire database. That way, if someone takes a backup, they won’t be able to access the information without a key.

 

DBAs must also ensure that if a backup device is lost or stolen, the data stored on the device remains inaccessible to users without proper keys. Bio-level encryption tools like BitLocker can be useful in this capacity.

 

Monitor and Collect Data

Real-time data collection and real-time monitoring should also be used to help protect data. Combined with network monitoring and other analysis software, data collection and monitoring will improve performance, reduce outages, and maintain network and data availability.

 

Collection of data in real-time allows administrators to perform proper data analysis and forensics, making it easier to track down the cause of an intrusion, which can also be detected through monitoring. Together with log and event management, DBAs have the visibility to identify potential threats through unusual queries or suspected anomalies. They can then compare the queries to their historical information to gauge whether or not the requests represent potential intrusions.

 

Test, Test, Test

This is assuming a DBA has already tested backups, but let’s make it a little more interesting. Let’s say a DBA is managing an environment with 3,000 databases. It’s impossible to restore them every night; there’s simply not enough space or time.

 

In this case, DBAs should take a random sampling of their databases to test. Shoot for a sample size representing at least 95 percent of the 3,000 databases in deployment, while leaving a small margin of error (much like a political poll). From this information DBAs can gain confidence that they will be able to recover any database they administer, even if that database is in a large pool. If you’re interested in learning more, check out this post, which gets into further detail on database sampling.

 

Summary

Data is your most precious asset. Don’t treat it like it’s anything but that. Make sure no one is leaving server tapes lying around cubicles, practice the backup rule of three, and, above all, develop a sound data recovery plan.

The term "double bind" refers to an instance where a person receives two or more conflicting messages, each of them negating the other. Addressing one message creates a failure in the other, and vice-versa. A double bind is an unsolvable puzzle resulting in a no-win situation.

 

As a federal IT professional, you’ve probably come across a double bind or two in your career, especially in regard to your network and applications. The two depend on each other but, often, when something fails, it’s hard to identify which one is at fault.

 

Unlike a true double bind, though, this puzzle actually has a two-step solution:

 

Step 1: Check out your network

 

Throughout history, whenever things slow down, the first reaction of IT pros and end-users alike has been to blame the network. So let’s start there – even though any problems you might be experiencing may not be the poor network’s fault.

 

You need to monitor the overall performance of your network. Employ application-aware monitoring and deep-packet inspection to identify mission-critical applications that might be creating network issues. This can help you figure out if the issue is a network or application problem. If it’s a network problem, you’ll be able to quickly identify, and resolve it.

 

What if it’s not the network? That’s where step two comes in.

 

Step 2: Monitor your application stack

 

Federal agencies have become reliant upon hundreds of applications. Each of these applications is responsible for different functions, but they also work together to form a central nervous system that, collectively, keeps things running. Coupled with a backend infrastructure, this application stack forms a critical yet complex system in which it can be difficult to identify a malfunctioning application.

 

Solving this challenge requires cultivating an application-centric view of your entire application stack, which includes not just the applications themselves, but the all the components that help them operate efficiently, including the systems, storage, hypervisors and databases that make up the infrastructure.

 

Consolidating management of your infrastructure internally and maintaining control of your application stack can help. Maintaining an internal level of involvement and oversight, even over cloud-based resources, is important and this approach gives you the control you need to more easily pinpoint and quickly address problems.

 

Of course, the best way to remediate issues is to never have them at all, but that’s not entirely possible. What you can do is mitigate the chances for problems by weighing performance against financial considerations before making changes to your network or applications.

 

For example, for many Defense Department agencies, the move to the cloud model is driven primarily by the desire for cost savings. While that’s certainly a benefit, you cannot discount the importance of performance when it comes to compute, storage and networking technologies, which are just as important.

 

These early considerations – combined with a commitment to network monitoring and a complete application stack view – can save you tons of money, time and trouble. Not to mention keeping you out of some serious binds.

 

Find the full article on Defense Systems.

Capacity Planning 101

The objective of Capacity Planning is to adequately anticipate current and future capacity demand (resource consumption requirements) for a given environment. This helps to accurately evaluate demand growth, identify growth drivers and proactively trigger any procurement activities (purchase, extension, upgrade etc.).

 

Capacity planning is based primarily on two items. The first one is analyzing historical data to obtain organic consumption and growth trends. The second one is predicting the future by analyzing the pipeline of upcoming projects, taking also in consideration migrations and hardware refreshes. IT and Business must work hand-in-hand to ensure that any upcoming projects are well-known in advance.

 

The Challenges with Capacity Planning or “the way we’ve always done it”

 

Manual capacity planning by running scripts here and there, exporting data, compiling data and leveraging Excel formulas can work. However, there are limits of one’s time availability, and at the expense of not focusing into higher priority issues.

 

The time spent on manually parsing data, reconciling and reviewing can be nothing short of a huge challenge, if not a waste of time. The larger an environment grows, the larger the dataset will be, the longer it will take to prepare capacity reports. And the more manual the work is, the more it is prone to human errors.  While it’s safe to assume that any person with Excel skills and a decent set of instruction can generate capacity reports, the question remains about their accuracy. It’s also important to point out that new challenges have emerged for those who like manual work.

 

Space saving technologies like deduplication and compression have complicated things. What used to be a fairly simple calculation of linear growth based on growth trends and YoY estimates is now complicated by non-linear aspects such as compression and dedupe savings. Since both compression and deduplication ratios are dictated by the type of data as well as the specifics of the technology (see in-line vs. at-rest deduplication, as well as block size), it becomes extremely complicated to factor this into a manual calculation process. Of course, you could “guesstimate” compression and/or deduplication factors for each of your servers. But the expected savings can also fail to materialize for a variety of reasons.

 

Typical mistakes in capacity management and capacity planning involve space reclamation activities at the storage array level. Rather, the lack of  awareness and  activities on the matter. Monitoring storage consumption at the array level without relating with the way storage has been provisioned at the hypervisor level may result in discrepancies. For example, not running Thin Provisioning Block Space Reclamation (through the VMware VAAI UNMAP primitive) on VMware environments may lead some individuals to believe that a storage array is reaching critical capacity levels while in fact a large portion of the allocated blocks is no longer active and can be reclaimed.

 

Finally, in manual capacity planning, any attempt to run “What-If” scenarios (adding n number of VMs with a given usage profile for a new project) are wild guesses at best. Even while having the best intentions and focus, you are likely to end up either with an under-provisioned environment and resource pressure, or with an over-provisioned environment with idle resources. While the latter is preferable, this is still a waste of money that might’ve been invested anywhere else.

 

Capacity Planning – Doing It Right

 

As we’ve seen above, the following factors can cause incorrect capacity planning:

  • Multiple sources of data collected in different ways
  • Extremely large datasets to be processed/aggregated manually
  • Manual, simplistic data analysis
  • Key technological improvements not taken into account
  • No simple way to determine effects of a new project into infrastructure expansion plans

 

Additionally, all of the factors above are also prone to human errors.

 

Because the task of processing data manually is nearly impossible and also highly inefficient, precious allies such as Solarwinds Virtualization Manager are required to identify real-time issues, bottlenecks, potential noisy neighbors as well as wasted resources. Once these wasted resources are reclaimed, capacity planning can provide a better evaluation of the actual estimated growth in your environment.

 

Capacity planning activities are not just about looking into the future, but also about managing the environment as it is now. The link between Capacity Planning and Capacity Reclamation activities is crucial. Just as you want to keep your house tidy before planning an extension or improving it with new furniture, the same needs to be done with your virtual infrastructure.

 

Proper capacity planning should factor in the following items:

  • Central, authoritative data source (all the data is collected by a single platform)
  • Automated data aggregation and processing through software engine
  • Advanced data analysis based on historical trends and usage patterns
  • What-If scenarios engine for proper measurement of upcoming projects
  • Capacity reclamation capabilities (Managing VM sprawl)

 

Conclusion

 

Enterprises must consider whether capacity planning done “the way we’ve always done it” is adding any value to their business or rather being the Achilles heel of their IT strategy. Because of its criticality, capacity planning should not be considered as a recurring manual data collection/aggregation chore that is assigned to “people who know Excel”. Instead, it should be run as a central, authoritative function that measures current usage, informs about potential issues and provides key insights to plan future investments in time.

Having grown since college in my career, I have learned many lessons. I started out in computers back in 1986, at the precipice of the “clone” era. I cut my teeth and my hands by installing chips on memory cards and motherboards, replacing 5.25” floppy discs, and dealing with the jumble of cables on Dual 5MB Winchester hard drives that cost and weighed an arm and a leg.

 

I moved on to networking with Novell and Banyan Vines, then Microsoft, onward into Citrix Winframe/Metaframe, and moved into VMware and virtualization adding storage technologies and ultimately cloud into that paradigm. I’m always tinkering, learning, and growing in my abilities. One of my key goals is to never stop learning.

 

Along the way, I’ve had the opportunity to undertake amazing projects, meet wonderful people, and be influenced by some of the most phenomenal people in the industry. It’s been almost a master’s degree in IT. I’ve actually had the opportunity to teach a couple classes at DePaul University in Business Continuity to the Master’s program in IS. What a great experience that was!

 

I’ve been contemplating my friends on The Geek Whisperers, (@Geek_Whisperers) as I listen to their podcast regularly. One of the things that they do regularly is ask their interviewee in some manner, what advice they’d offer someone just coming up in the industry, or at other times, the question is, what lesson have you taken most to heart or what mistake have you made that you would caution against to anyone who might be willing to follow your advice. This question, phrased in whatever manner, is a wonderful exercise in introspection.

 

There are many mistakes I’ve made in my career. Most, if not all, have proven to be learning experiences. I find this to be probably the single most important lesson anyone can learn. Mistakes are inevitable. Admit to them, be honest and humble about them, and most of all, learn from them. Nobody expects you not to make mistakes. My old boss used to say, “If you’re going to screw-up, do it BIG!” By this, he meant that you should push your boundaries, try to do things of significance, and outside the box, and above all, make positive change. I would say that this is some of the best advice I could give as well.

 

Humility is huge. The fine line between knowing the answer to a question, and acting as a jerk to prove it is what I consider to be the “Credibility Line.” I’ve been in meetings wherein a participant has given simply wrong information, and has fought to stay with that point to the point of belligerence. These are people with whom you’d much rather not fight. But, if the point itself is critical, there are ways in which you can prove that you’re in possession of the right information without slamming that other person. I once took this approach: A salesperson stated that a particular replication technology was Active/Active. I knew that it was Active/Passive. When he stated his point, I simply said, OK, let’s white-board the solution as it stands in Active/Active, acting as if I believed him to be right. With the white-boarding of the solution I proved that he’d been wrong, we all apologized and moved on. Nobody was made to feel less-than, all saved face, and we all moved forward.

 

This is not to say that it was the best approach, but simply one that had value in this particular situation. I felt that I’d handled it with some adept facility, and diffused what could have been a very distracting argument.

 

One thing I always try to keep in mind is that we all have our own agendas. While I may have no difficulty acknowledging when and where I’m wrong, others may find that to be highly discouraging. If you were to choose to make someone look foolish, you’d look just as foolish or worse. Often the most difficult thing to do is to size up the person with whom you’re speaking, in an effort to determine their personality type and motivations. Playing things safe, and allowing them to show their cards, their personality rather than making assumptions is always a good policy. This goes with anyone with whom you may be speaking, from coworkers, to customers, from superiors to peers. Assumption can lead to very bad things. Sometimes the best way to answer a question is with another question.

 

To summarize, my advice is this:

  • Push yourself.
  • Be humble.
  • Exploit your strengths.
  • Listen before answering.
  • Try not to assume, rather clarify.
  • Defer to others (your customers and your peers) as appropriately as possible.
  • Learn from your mistakes.
  • Above all, be kind.

 

One last piece of advice, leverage social media. Blog, Tweet, do these things as well and artfully as you can. Follow the same rules that I’ve stated above, as your public profile is how people will know you, and remember: once on the web, always on the web.

1608_GeekSpeak_RequiredReading_1200x628_2.png

Back in April, we helped you get ready for summer by offering some suggestions for summer reading. It ended up being one of the most read posts of the year, with nearly 2,300 views and over two dozen comments.

 

We already knew that THWACK was a community of avid readers, but the level of interest and nature of the comments showed us that you had a real thirst for Geek-recommended and Geek-approved sources of information.

 

So now, with the end of summer in sight, we thought we would create a companion to the "SolarWinds Summer Fun Reading list" to help you get back in the work mindset.

 

We wanted to collect a required reading list for the IT professional and budding monitoring engineer, so our choices reflect books that have stood the test of time in terms of skills, philosophy, and ideas that have deep relevance in the world of IT.

 

These picks are designed to help you get up to speed with some of the foundational concepts and history of IT generally, and monitoring specifically, including processes, technology, tips, tricks, and more. These are the things we at SolarWinds believe any IT pro worth their salt should know about or know how to do.

 

  • The Mythical Man-Month: Essays on Software Engineering, by Frederick P. Brooks Jr.
  • The Practice of System and Network Administration; The Practice of Cloud System Administration, by Thomas Limoncelli
  • The Psychology of Computer Programming; An Introduction to General Systems Thinking, by Gerald M. Weinberg
  • Accidental Empires: How the Boys of Silicon Valley Make Their Millions, Battle Foreign Competition, and Still Can't Get a Date, by Robert X. Cringely
  • Linux for Dummies, by Emmett Dulaney
  • Design Patterns: Elements of Reusable Object-Oriented Software, by Gamma, Helm, Johnson, and Vlissides
  • Network Warrior, by Gary A. Donahue
  • The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win, by Gene Kim and Kevin Behr
  • Liars and Outliers: Enabling the Trust that Society Needs to Thrive, by Bruce Schneier
  • The Clean Coder: A Code of Conduct for Professional Programmers, by Robert Martin
  • Commodore: A Company on the Edge, by Brian Bagnall
  • The Inmates Are Running the Asylum: Why High Tech Products Drive Us Crazy and How to Restore the Sanity, by Alan Cooper
  • In Search of Stupidity: Over Twenty Years of High Tech Marketing Disasters, by Merrill R. (Rick) Chapman
  • Bricklin on Technology, by Dan Bricklin

 

Meanwhile, we would be remiss if we didn’t mention the SolarWinds-specific titles that are out in the world:

 

As your summer tan fades, we hope this list will help you feel like you are ready to brush the sand from your feet, trade your shorts for a trusty pair of cargo pants, and return to the data center with new skills and renewed passion!

I'm back from the SQL Saturday Baton Rouge event. I've been doing SQL Saturday events for over six years now, and I still enjoy them as if it were my first time. It takes a special kind of geek to be willing to give up a weekend day for extra learning but those of us in the SQL community know the value these events have. The ability to mingle with a few hundred similar minded folks all looking to connect, share, and learn is worth every minute, even on a weekend.

 

Anyway, here are the items I found most amusing from around the Internet. Enjoy!

 

Frequent Password Changes Is a Bad Security Idea

I never thought about passwords in this way, but now that I have I can't help but think how much better we will be when the machines take over and everything is a retina scan.

 

The Mechanics of Sprinting: Play a minigame to test your reaction time

Olympic sprinters average just under 200 milliseconds reaction time. Time to see how you rate. There are tests for rowing, long jump, with others coming soon and I can't stop trying to set a new personal best and neither will you.

 

Facebook Wants To Clean Up The Clickbait

And they are going to do it using this one weird trick.

 

Apple is launching an invite-only bug bounty program

Nice to see Apple finally taking security seriously. But, true to form, even this program is being done with a touch of exclusivity, as opposed to other programs that have been in existence for years.

 

Only 9% of America Chose Trump and Clinton as the Nominees

Because I love it when people can tell a story with data visualizations, here's one that I think everyone will find interesting.

 

Delta’s Tech Meltdown Causes Hundreds of Flight Delays, Cancellations

I fly Delta frequently, so the outage they experienced hit home for me not just because I'm a customer but because I'm a data professional that knows how hardware can fail when there is a power outage, or power surge. This is also a good lesson in BCP planning, as well as a reminder that HA != DR.

 

This candle is only available in stores you've never heard about:

 

a - 1 (6).jpg

IMG_8315.JPGThis past Monday morning Delta suffered a disruption to their ticketing systems. While the exact root cause has yet to be announced I did find mention here that the issue was related to a switchgear, a piece of equipment that allows for power failover. It's not clear to me right now if Delta or Georgia Power is responsible for maintaining the switchgear, but something tells me that right now a DBA is being blamed for it anyway.

 

The lack of facts hasn't stopped the armchair architects taking to the internet the past 24 hours in an effort to point out all the ways that Delta failed. I wanted to wait until facts came out about the incident before offering my opinion, but that's not how the internet works.

 

So, here's my take on where we stand right now, with little to no facts at my disposal.

 

HA != DR

I've had to correct more than one manager in my career that there is a big difference between high availability (HA) and disaster recovery (DR). Critics yesterday mentioned that Delta should have had geo-redundancy in place to avoid this outage. But without facts it's hard to say that such redundancy would have solved the issue. Once I heard about it being power related, I thought about power surges, hardware failures, and data corruption. You know what happens to highly available data that is corrupted? It becomes corrupted data everywhere, that's what. That's why we have DR planning, for those cases when you need to restore your data to the last known good point in time.

 

This Was a BCP Exercise

Delta was back online about six hours after the outage was first reported. Notice I didn't say they were "back to normal". With airlines it takes days to get everything and everyone back on schedule. But the systems were back online, in no short part to some heroic efforts on the part of the IT staff at Delta. This was not about HA, or DR, no, this was about business continuity. At some point a decision was made on how best to move forward, on how to keep the business moving despite suffering a freak power outage event involving a highly specialized piece of equipment (the switchgear). From what I can tell, without facts, it would seem the BCP planning at Delta worked rather well, especially when you consider that Southwest recently had to wait 12 hours to reboot their systems due to a bad router.

 

Too Big To Failover

Most recovery sites are not built to handle all of the regular workload, they are designed to handle just the minimum necessary for business to continue. Even if failover was an option many times the issue isn't with the failover (that's the easy part), the issue is with the fallback to the original primary systems. The amount of data involved may be so cumbersome that a six hour outage is preferable to the 2-3 days it might take to fail back. It is quite possible this outage was so severe that Delta was at a point where they were too big to failover. And while it is easy to just point to the Cloud and yell "geo-redundancy" at the top of your lungs the reality is that such a design costs money. Real money.

 

Business Decisions

If you are reading this and thinking "Delta should have foreseen everything you mentioned above and built what was needed to avoid this outage" then you are probably someone that has never sat done with the business side and worked through a budget. I have no doubt that Delta has the technical aptitude to architect a 21st century design but the reality of legacy systems, volumes of data, and near real-time response rates on a global scale puts that prices tag into the hundreds of millions of dollars. While that may be chump change to a high-roller such as yourself, for a company (and industry) that has thin margins the idea of spending that much money is not appealing. That's why things get done in stages, a little bit at a time. I bet the costs for this outage, estimated in tens of millions of dollars, are still less than the costs for the infrastructure upgrades needed to have all of their data systems rebuilt.

 

Stay Calm and Be Nice

If you've ever seen the Oscar-snubbed classic movie "Roadhouse", you know the phrase "be nice". I have read a lot of coverage of the outage since yesterday and one thing that has stood out to me is how professional the entire company has been throughout the ordeal. The CEO even made this video in an effort to help people understand that they are doing everything they can to get things back to normal. And, HE WASN'T EVEN DONE YET, as he followed up with THIS VIDEO. How many other CEOs put their face on an outage like this? Not many. With all the pressure on everyone at Delta, this attitude of staying calm and being nice is something that resonates with me.

 

The bottom line here, for me, is that everything I read about this makes me think Delta is far superior to their peers when it comes to business continuity, disaster recovery, and media relations.

 

Like everyone else, I am eager to get some facts about what happened to cause the outage, and would love the read the post-mortem on this event if it ever becomes available. I think the lessons that Delta learned this week would benefit everyone that has had to spend a night in a data center keeping their systems up and running.

arjantim

The Public Cloud

Posted by arjantim Aug 9, 2016

A couple of years ago nobody really thought of Public cloud (although that might be different in the US), but things change, quickly. Since the AWS invasion of the public clo space we’ve seen a lot competitors try to win their share in this lucrative market. Lucrative is a well chosen word here as most of the businesses getting into this market take a big leap of faith, as most of them have to take their losses for the first couple of years. But why should Public Cloud be of any interest to you, and what are the things you need to think about? Let’s take a plane and fly to see what the public cloud has to offfer, and if it will take over the complete datacenter or just parts of it?

 

Most companies have only one purpose and that is to make more money then they spend… And where prices are under pressure there is really nly one thing to do, cut the cost. A lot of companies see the public cloud as cutting cost, as you’re only paying for the resources you use, and not for all the other stuff that is alsoo needed to run your own “private cloud”. And because of this they think the cost of public cloud is cheaper than building their datacenters every 5 years or so.

 

To be honest, in a lot of ways the companies are right. Cutting cost by moving certain workloads to the public cloud will certainly help to cut cost, but it might also be a great test/dev environment. The thing is you need to determine the best public cloud strategy per company, and it might even be needed to do it per department (in particular cases). But saying everything will be in the public cloud is a bridge to far for many companies…. At the moment.

 

A lot of companies are already doing loads of workloads in the public cloud, without even really understanding it. Microsoft Office 365 (and in particular outlook) is one the examples where a lot of companies use public cloud, sometimes even without really looking into the details and if it is allowed by law. Yes, that’s right going public you need to think of even what can and what can’t be put in the cloud. Some companies are prohibited to by national law to put certain parts of their data in a public cloud, so make sure to look for everything before telling your company or customer to go public.

 

Most companies choose a gentle path towards public cloud, and choose the right workloads to go public. This is the right way to do if you’re an established company with your own way, but than again you need to not only think of your own, but also about the law that your company needs to follow.

 

In my last post on Private Cloud I mentioned the DART framework, as I think it is an important tool to go cloud (private at first, but Public also). In this post on Public Cloud I want to go for the SOAR framework.

 

Security - In a Public Cloud environment it really important to Secure your data. IT should make sure the Public part(s) as well as the Private part(s) are well secured and all data is save. Governance, compliancy and more should be well thought of, and re-thought of every step of the way.

 

Optimization - the IT infrastructure is a key component in a fast changing world. As I already mentioned a lot of companies are looking to do more for less to get more profit. IT should be an enabler for the business, not some sort of firefighters.

 

Automation - is the key to faster deployments. It’s the foundation for continuous delivery and other DevOps practices. Automation enforces consistency across your development, testing and production environments, and ensures you can quickly orchestrate changes throughout your infrastructure: bare metal servers, virtual machines, cloud and container deployments. In the end automation is a key component for optimization

 

Reporting - is a misunderstood IT trade. Again it is tidely connected with Optimization but also automation. For me reporting is only possible with the right monitoring tools. If you want to be able to do the right reporting you need to have a “big brother” in your environment. Getting the rigt reports from public and private is important, and with those reports the company can further finetune the environment.

 

There is so much more to say, but I leave it with this for now. I really look forward on the comments, and I know there is no “right” explanation for private, public, or hybrid cloud but I think we need to help our companies to understand the strenght of cloud. Help them sort out what kind to use and how. We’re here to help them use IT as IT is meant to be, regardless of the name we give it. See you next time, and in the comments!

The Internet of Things (IoT) offers the promise of a more connected and efficient military, but Defense Department IT professionals are having a hard time turning that promise into reality. They’re deterred by the increasing demands and security vulnerabilities of more connected devices.

 

That hasn’t stopped defense agencies from exploring and investing in mobility and next-generation technology, including IoT devices. One of the points in the Defense Information Systems Agency’s 2015 – 2020 Strategic Plan specifically calls out the agency’s desire to “enable warfighter capabilities from a sovereign cyberspace domain, focused on speed, agility, and access.” The plan also notes “mobile devices…continue to transform our operational landscape and enable greater mission effectiveness through improved communication, access, information sharing, data analytics – resulting in more rapid response times.”

 

It’s a good thing the groundwork for IoT was laid a few years ago, when administrators were working on plans to fortify their networks against an onslaught of mobile devices. Perhaps unbeknownst to them, they had already begun implementing and solidifying strategies that can now serve as a good foundation for managing IoT’s unique set of challenges.

 

Tiny devices, big problems

 

The biggest challenge is the sheer number of devices that need to be considered. It’s not just a few smart phones; with IoT, there is literally an explosion of potentially thousands of tiny devices with different operating systems, all pumping vast amounts of data through already overloaded networks.

Many of these technological wonders were developed primarily for convenience, with security as an afterthought. There’s also the not insignificant matter of managing bandwidth and latency issues that the plethora of IoT devices will no doubt introduce.

 

Making the IoT dream an automated reality

 

These issues can be addressed through strategies revolving around monitoring user devices, managing logs and events, and using encrypted channels – the things that administrators hopefully began implementing in earnest when the first iPhones began hitting their networks.

 

Administrators will need to accelerate their device tracking efforts to new levels. Device tracking will help identify users and devices and create watch lists, and the challenge will be the number of new devices. And while log and event management software will still provide valuable data about potential attacks, the attack surface and potential vulnerabilities will increase exponentially with the introduction of a greater number of devices and network access points.

 

More than ever, managers will want to complement these efforts with network automation solutions, which can correct issues as they arise. This creates a much more streamlined atmosphere for administrators to manage, making it easier for them to get a handle on everything that touches the network.

 

A reluctance to automate will not work in a world where everything, from the tablets at central command to the uniforms on soldiers’ bodies, will someday soon be connected. It’s now time for federal IT administrators to build off their BYOD strategies to help the Defense Department realize DISA’s desire for a highly connected and mobilized military.

 

  Find the full article on Defense Systems.

It seems like you can't talk to anyone in IT without hearing about "software-defined" something these days. Ever since Software-Defined Networking (SDN) burst on the scene, it's the hot trend. My world of storage is just as bad: It seems as if every vendor is claiming to sell "Software-Defined Storage" without much clarity about what exactly it is. Is SDS just the latest cloudy buzzword or does it have a real meaning?

 

Wikipedia, that inerrant font of all human knowledge, defines Software-Defined Storage (SDS) to be "policy-based provisioning and management of data storage independent of the underlying hardware." It goes on to talk about abstraction, automation, and commodity hardware. I can get behind that definition. Wikipedia also contrasts SDS with mere "Software-Based Storage", which pretty much encompasses "all storage" these days!

 

I've fought quite a few battles about what is (and isn't) "Software-Defined Storage", and I've listened to more than enough marketers twisting the term, so I think I can make some informed statements.

 

#1: Software-Defined Storage Isn't Just Software-Based Storage

 

Lots of marketers are slapping the "software-defined" name on everything they sell. I recently talked to one who, quite earnestly, insisted that five totally different products were all "SDS", including a volume manager, a scale-out NAS, and an API-driven cloud storage solution! Just about the only thing all these products have in common is that they all contain software. Clearly, "software" isn't sufficient for "SDS".

 

It would be pointless to try to imagine a storage system, software-denied or otherwise, that doesn't rely on software for the majority of its functionality.

 

#2: Commodity Hardware Isn't Sufficient For SDS Either

 

Truthfully, all storage these days is primarily software. Even the big fancy arrays from big-name vendors are based on the same x86 PC hardware as my home lab. That's the rise of commodity hardware for you. And it's not just x86: SAS and SATA, PCI Express, Ethernet, and just about every other technology used in storage arrays is common to PC's and servers too.

 

Commodity hardware is a great way to improve the economics of storage, but software running on x86 isn't sufficiently differentiated to be called "SDS" either.

 

#3: Data Plane + Control Plane = Integration and Automation

 

In the world of software-defined networking, you'll hear a lot about "separating the data plane from the control plane." We can't make the exact same analogy for storage since it's a fundamentally different technology, but there's an important conteptual seed here: SDN is about programmability and centralized control, and this architecture allows such a change. Software-defined storage should similarly allow centralization of control. That's what "policy-based provisioning and management" and "independent of the underlying hardware" are all about.

 

SDS, like SDN, is about integration and automation, even if the "control plane/data plane" concept isn't exactly the same.

 

#4: SDS Is Bigger Than Hardware

 

SDN was invented to transcend micro-management of independent switches, and SDS similarly must escape from the confines of a single array. The primary challenge in storage today is scalability and flexibility, not performance or reliability. Abstraction of storage software from underlying hardware doesn't just mean being able to use different hardware; abstraction also means being able to span devices, swap components, and escape from the confines of a "box."

 

SDS ought to allow storage continuity even as hardware changes.

 

My Ideal Software-Defined Storage Solution

 

Here's what my ideal SDS solution looks like:

  1. A software platform for storage that virtualizes and abstracts underlying components
  2. A scalable solution that can grow, shrink, and change according to the needs of the application and users
  3. API-driven control, management, provisioning, and reporting that allows the array to "disappear" in an integrated application platform

 

Any storage solution that meets these requirements is truly software-defined and will deliver transformative benefits to IT. We've already seen solutions like this (VSAN, Amazon S3, Nutanix) and they are all notable more for what they deliver to applications than the similarity or differences between their underlying components. Software-defined storage really is a different animal.

I’m probably going to get some heat for this, but I have to get something off my chest. At Cisco Live this year, I saw a technology that was really flexible, with amazing controllability potential, and just cool: PoE-based LED lighting. Rather than connecting light fixtures to mains power and controlling them via a separate control network, it’s all one cable. Network and power, with the efficiency of solid-state LED lighting, with only one connection. However, after several vendor conversations, I can’t escape the conclusion that the idea is inherently, well… dumb.

 

Okay, Not Dumb, Just Math

 

Before Cree®, Philips®, or any of the other great companies with clever tech in the Cisco® Digital Celling Pavilion get out their pitchforks, I have to offer a disclaimer: this is just my opinion. But it is the opinion of an IT engineer who also does lots of electrical work at home, automation, and, in a former life, network consulting for a commercial facilities department. I admit I may be biased, and I’m not doing justice to features like occupancy and efficiency analytics, but the problem I can’t get past is the high cost of PoE lighting. It’s a regression to copper cable, and worse, at least as shown at Cisco Live, ridiculous switch overprovisioning.

 

poeledceling.png

First, the obvious: the cost of pulling copper. We’re aggressively moving clients to ever-faster WLANs both to increase flexibility and decrease network wiring costs. With PoE lighting, each and every fixture and bulb has its own dedicated CAT-3+ cable running hub-and-spoke back to an IT closet. Ask yourself this question: do you have more workers or bulbs in your environment? Exactly. Anyone want to go back to the days of thousands of cables in dozens of thick bundles?  (Image right: The aftermath of only two dozen fixtures.)

 

Second, and I’m not picking on Cisco here, is the per port cost of using enterprise switches as wall plugs. UPNP is a marvelous thing. A thousand-plus watts per switch is remarkable, and switch stacking makes everything harmonious and redundant. Everyone gets a different price of course, but the demo switch at Cisco Live was a Catalyst 3850 48 Port UPOE, and at ~$7,000, that’s $145/port. Even a 3650 at ~ $4000 comes to $84 to connect a single light fixture.

 

It’s not that there’s anything inherently wrong with this idea, and I would love to have more Energy Wise Catalysts in my lab, but this is overkill. Cisco access switches are about bandwidth, and PoE LEDs need little. As one vendor in the pavilion put it, “… and bandwidth for these fixtures and sensors is stupid simple. It could work over dial-up, no problem.” It’s going to be tough to sell IT budget managers enterprise-grade stackable switches with multi-100 gig backplanes for that.

 

And $84/port is just a SWAG at hardware costs. Are you going to put a rack of a dozen Catalysts directly on mains power? Of course not. You’re going to add in UPS to protect your enterprise investment. (One of the touted benefits of PoE lighting is stand-by.) The stated goal by most of the vendors was to keep costs under $100/port, and that’s going to be a challenge when you include cable runs, IT closets, switches, and UPS. Even then, $100/port?

 

Other Considerations

 

There are a couple of other considerations, like Cat 3+ efficiency at high power. As you push more power over tiny network cables it becomes less efficient, and at a certain output per port, overall PoE system efficiency drops and becomes less efficient than AC LEDs. There’s also an IPAM management issue, with each fixture getting its own IP. That ads DHCP, and more subnets to wrangle without adding much in terms of management. Regardless of how you reach each fixture you’ll still have to name, organize, and otherwise mange how they’re addressed. Do you really care if it’s by IP you manage or a self-managing low-power mesh?

 

DC Bus for the Rest of Us

 

What this initiative really highlights is that just as we’re in the last gasps of switched mobile carrier networks, and cable television provided in bundles via RF, we need to move past the most basic concept of AC mains lighting to the real opportunity of DC lighting. Instead of separate Ethernet runs, or hub-and-spoke routed 120VAC Romex, the solution for lighting is low voltage DC busses with an overlay control network. It’s the low voltage and efficient common DC transformation that’s the real draw.

 

Lighting would evolve into universally powered, addressable nodes, daisy-chained together with a tap-able cable supplying 24-48VDC from common power supplies. In a perfect world, the lighting bus would also support a data channel, but then you get into the kind of protectionist vendor shenanigans that stall interoperability. What seems to be working for lighting or IoT in general is more future-proof and replaceable control systems, like wireless IPv6 networks today, then whatever comes next later.

 

Of course, on the other hand, if a manufacturer starts shipping nearly disposable white-label PoE switches that aren’t much smarter than mid-spans, mated to shockingly inexpensive and thin cables, then maybe PoE lightening has a brighter future.

 

What do you think? Besides “shockingly” not being the worst illumination pun in this post?

The Rio Olympics start this week which means one thing: Around the clock reports on the Zika virus. If we don't get at least one camera shot of an athlete freaking out after a mosquito bite then I'm going to consider this event a complete disaster.

 

Here are the items I found most amusing from around the Internet. Enjoy!

 

#nbcfail hashtag on Twitter

Because I enjoy reading about the awful broadcast coverage from NBC and I think you should, too. 

 

Apple taps BlackBerry talent for self-driving software project, report says

Since they did so good at BlackBerry, this bodes well for Apple.

 

Parenting In The Digital Age

With my children hitting their teenage years, this is the stuff that scares me the most.

 

Microsoft's Windows NT 4.0 launched 20 years ago this week

Happy Birthday! Where were you when NT 4.0 launched in 1996? I'm guessing some of you unlucky ones were on support that night. Sorry.

 

Larry Ellison Accepts the Dare: Oracle Will Purchase NetSuite

First Larry says that the cloud isn't a thing. Then he says he invented the cloud. And now he overspends for NetSuite. With that kind of background he could run for President. Seriously though, this purchase shows just how far behind Oracle is with the cloud.

 

This Guy Hates Traffic... So He's Building a Flying Car

Flying cars! I've been promised this for years! Forget the Tesla, I will line up to buy one of these.

 

ACHIEVEMENT UNLOCKED! Last weekend I found all 4 IKEA references made in the Deadpool movie! What, you didn't know this was a game?

a - 1 (4).jpg

When there are application performance issues, most IT teams focus on the hardware, after blaming and ruling out the network, of course. If an application is slow, the first thought is to add hardware to combat the problem. Agencies have spent millions throwing hardware at performance issues without a good understanding of the true bottlenecks slowing down an application.

 

But a recent survey on application performance management by research firm Gleanster LLC reveals that the database is the No. 1 source of issues with performance. In fact, 88 percent of respondents cite the database as the most common challenge or issue with application performance.

 

Understanding that the database is often the cause of application performance issues is just the beginning; knowing where to look and what to look for is the next step. There are two main challenges to trying to identify database performance issues:

 

There are a limited number of tools that assess database performance. Tools normally assess the health of a database (is it working, or is it broken?), but don’t identify and help remediate specific database performance issues.

 

Database monitoring tools that do provide more information don’t go much deeper. Most tools send information in and collect information from the database, with little to no insight about what happens inside the database that can impact performance.

 

To successfully assess database performance and uncover the root cause of application performance issues, IT pros must look at database performance from an end-to-end perspective.

 

The application performance team should be performing wait-time analysis as part of regular application and database maintenance. This is a method that determines how long the database engine takes to receive, process, fulfill and return a request for information. A thorough wait-time analysis looks at every level of the database and breaks down each step to the millisecond.

 

The next step is to look at the results, then correlate the information and compare. Maybe the database spends the most time writing to disk; maybe it spends more time reading memory. Understanding the breakdown of each step helps determine where there may be a slowdown and, more importantly, where to look to identify and fix the problem.

 

We suggest that federal IT shops implement regular wait-time analysis as a baseline of optimized performance. The baseline can help with change management. If a change has been implemented, and there is a sudden slowdown in an application or in the database itself, a fresh analysis can help quickly pinpoint the location of the performance change, leading to a much quicker fix.

 

Our nearly insatiable need for faster performance may seem like a double-edged sword. On one hand, optimized application performance means greater efficiency; on the other hand, getting to that optimized state can seem like an expensive, unattainable goal.

 

Knowing how to optimize performance is a great first step toward staying ahead of the growing need for instantaneous access to information.

 

Find the full article on Government Computer News.

What is VM sprawl ?

VM sprawl is defined as a waste of resources (compute : CPU cycles and RAM consumption) as well as storage capacity due to a lack of oversight and control over VM resource provisioning. Because of its uncontrolled nature, VM sprawl has adverse effects on your environment’s performance at best, and can lead to more serious complications (including downtime) in constrained environments.

 

VM Sprawl and its consequences

Lack of management and control over the environment will cause VMs to be created in an uncontrolled way. This means not only the total number of VMs in a given environment, but also how resources are allocated to these VMs. You could have a large environment with minimal sprawl, but a smaller environment with considerable sprawl.

 

Here are some of the factors that cause VM sprawl:

 

  • Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
    • Waste of compute and/or storage resources
    • Over-allocation of RAM will cause ballooning and swapping to disk if the environment falls under memory pressure, which will result in performance degradation
    • Over-allocation of virtual CPU will cause high co-stops, which means that the more vCPUs a VM has, the more it needs to wait for CPU cycles to be available on all the physical cores at the same moment. The more vCPUs a VM has, the less likely it is that all the cores will be available at the same time
    • The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.

 

  • Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
  • Waste of computer and/or storage resources + RAM overhead at the hypervisor level
  • Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
  • Powered Off VMs and orphaned VMDKs eat up space resources

 

 

How to Manage VM sprawl

Controlling and containing VM sprawl relies on process and operational aspects. The former covers how one prevents VM sprawl from happening, while the latter covers how to tackle sprawl that happens regardless of controls set up at the process level.

 

Process

On the process side, IT should define standards and implement policies:

 

  • Role Based Access Control which defines roles & permissions on who can do what. This will greatly help reduce the creation of rogue VMs and snapshots.
  • Define VM categories and acceptable maximums: while not all the VMs can fit in one box, standardizing on several VM categories (application, databases, etc.) will help filter out bizarre or oversized requests. Advanced companies with self-service portals may want to restrict/categorize what VMs can be created by which users or business units
  • Challenge any oversized VM request and demand justification for potentially oversized VMs
  • Allocate resources based on real utilization. You can propose a policy where a VM resources will be monitored during 90 days after which IT can adjust resource allocation if the VM is undersized or oversized.
  • Implement policies on snapshots lifetime and track snapshot creation requests if possible

 

In certain environments where VMs and their allocated resources are chargeable, you should contact your customers to let them know that a VM needs to be resized or was already resized (based on your policies and rules of engagement) to ensure they are not billed incorrectly. It is worthwhile to formalize your procedures for how VM sprawl management activities will be covered, and to agree with stakeholders on pre-defined downtime windows that will allow you to seamlessly carry any right-sizing activities.

 

Operational

Even with the controls above, sprawl can still happen. It can be caused by a variety of factors. For example, you could have a batch of VMs provisioned for one project, but while they passed through the process controls, they can sit idle for months eating up resources because the project could end up being delayed or cancelled and no one informed the IT team.

 

In VMware environments where storage is thin provisioned at the array level, and where Storage DRS is enabled on datastore clusters it’s also important to monitor the storage consumption at the array level. While storage capacity will appear to be freed up at the datastore level after a VM is moved around or deleted, it will not be released on the array and this can lead to out-of-storage conditions. A manual triggering of the VAAI Unmap primitive will be required, ideally outside of business hours, to reclaim unallocated space. It’s thus important to have, as a part of your operational procedures, a capacity reclamation process that is triggered regularly.

 

The usage of virtual infrastructure management tools with built-in resource analysis & reclamation capabilities, such as Solarwinds Virtualization Manager, is a must. By leveraging software capabilities, these tedious analysis and reconciliation tasks are no longer required and dashboards present IT teams with immediately actionable results.

 

Conclusion

Even with all the good will in the world, VM sprawl will happen. Although you may have the best policies in place, your environment is dynamic and in the rush that IT Operations are, you just can’t have an eye on everything. And this is coming from a guy whose team successfully recovered 22 TB of space previously occupied by orphaned VMDKs earlier this year.

Filter Blog

By date:
By tag: