Skip navigation
1 14 15 16 17 18 Previous Next

Geek Speak

2,042 posts

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.




Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.


Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.


Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.




Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.


Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.


Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.




Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.


As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.




Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.


As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?


Thanks for reading!

Historically, cyber security methods closely mirrored physical security – focused primarily on the perimeter and preventing access from the outside. As threats advanced, both have added layers, requiring access credentials or permission to access rooms and systems, and additional defensive layers continued to be added for further protection.


However, the assumption is that everything is accessible; it’s assumed that no layer is secure and that, at some point, an intruder will get in—or is already in. What does this mean for the federal IT pro? Does it mean traditional security models are insufficient?


On the contrary; it means that as attacks – and attackers – get more sophisticated, traditional security models become one piece of a far greater security strategy made-up of processes and tools that provide layers to enhance their agency’s security posture.


A layered approach


Agencies must satisfy federal compliance requirements, and the Risk Management Framework (RMF) was created to help. That said, meeting federal compliance does not mean you’re 100 percent secure; it’s simply one—critical—layer.


The next series of layers that federal IT pros should consider are those involved in network operations. Change monitoring, alerting, backups and rollbacks are useful, as are configuration management tools.


A network configuration management tool will help you create a standard, compliant configuration and deploy then across your agency. In fact, a good tool will let you create templates.


Automation is key and a configuration management tool will help you keep up with changes automatically; it will let you change your configuration template based on new NIST NVD recommendations and get those changes out quickly to ensure all devices maintain compliance.


In addition to a network configuration tool, federal IT pros should consider layering in the following tools to enhance security:


Patch management. Patch management is critical to ensuring all software is up to date, and all vulnerabilities covered. Look for a patch management tool that is automated and supports custom applications, as many agencies have unique needs and unique applications.


Traffic analysis. A traffic analyzer will tell you, at any given time, who is talking to whom, who is using which IP address, and who is sending what to whom. This is vital information. Particularly in the case of a threat, where you need to conduct forensics, a traffic analysis tool is your best weapon.


Security information and event management. Log and event management tools brings all the other pieces together to allow federal IT pros to see the entire environment—the bigger picture—to correlate information and make connections to see threats that may not have been visible before.


The ideal solution is to build on what you already have; use what works and keep adding. Create layers of security within every crevice of your environment. The more you can enhance your visibility, the more you know, the harder it will be for attackers to get through and the greater your chances of dramatically reducing risk will be.


Find the full article on Defense Systems.

Talk about disruptive innovation. Pokemon GO, has put Nintendo back on the map as it has skyrocketed into the stratosphere of top mobile apps in just a week's time. The mobile game is a venture between Nintendo and Niantic, Inc - an independent entity in the Alphabet set of companies. It launched with much fanfare last week and generated critical mass and out-of-the-stratosphere velocity in terms of user adoption and game play. And here in lies the IT aspect.


There were noticeable hiccups in Quality-of-Service (QoS) such that even though it was a Top Charts game, it had a 3.5 star rating in the Google Play Store. Essentially, the launch and overwhelming response created a situation where the elastic supply of cloud powered by Niantic's Alphabet parent company could not meet the demands of the rabidly active user base. An example was the experience of our very own James Honey, who was attempting to create an account for his youngest. He went through the web portal and created an account when it didn't timeout. It sent him an URL to verify his information to create his child's account, which also timed out. So he clicked on the customer support button, which sent an email but mean-time-to-resolution is 48-hours. Interestingly enough, he's still waiting for the verification and approval of the account for his child after 80+ hours from when he first started with the app download.


In closing, it circles back to things that IT pros already know all too well: (1) No matter how much planning and preparation goes into a production launch, it happens and IT pros have to remediate it efficiently and effectively; (2) Hybrid IT is reality as the app lifecycle is now spanning the developer's platform, running across distributed systems in the cloud(s), and is being consumed on someone's local mobile platform; and (3) the rate of change and scale of that change is ever growing over time and yet IT pros still have to deliver the CIO's SLA.


Do you think those three things are apropos? Let me know below in the comment section.

Cisco Live! begins in just a few days. In fact, even as you read this, my colleague, Cara Prystowski, is winging her way northwest from Austin to begin the process of setting up our brand new booth. (While I can't share pictures with you yet, trust me, it is a thing of wonder and beauty and I can't wait to see people's faces when they lay eyes on it for the first time.) Following close behind her is Head Geek™ Patrick Hubbard (, to make sure that the 16 demo systems are all up and running so that we can show the latest and greatest that SolarWinds has to offer.


But Cisco Live (or #CLUS, as it's known on social media) is about more than a bunch of vendors hawking their wares on the trade show floor. Here's what I'll be looking forward to:


First and foremost, YOU!! If you are coming to #CLUS, please stop by booth 1419 and say hello. We'll all be there, and the best part of each day is meeting people who share our passion for all things monitoring (regardless of which tools you use.).


For me, personally, that also means connecting with my fellow Ohio-based networking nerds. We even have our own hashtag: #CLUSOH, and I expect to tirelessly track them down like the Pink Panther detective namesake.


#CLUS also is the first time we can introduce a familiar face with a new role. Destiny Bertucci ( is a veteran at SolarWinds (she's employee #13), a veteran of our convention circuit, and our newest Head Geek!! Destiny is uniquely and eminently qualified to be part of the Head Geek team, and all of SolarWinds is excited to see what comes from this next chapter in her career.


So with 3 Head Geeks, not to mention the rest of our amazing staff in-booth (all technical, not a salesperson in sight!!) I am excited to tell our story, and share all the amazing new features in NPM 12, not to mention NCM, NTA, SRM, and the rest of the lineup.


As mentioned earlier, our new booth is amazing. It features multiple demo stations, two video walls, and a vivid LED-infused design that underscores the SolarWinds style and focus. For those of us in the booth, it's functional and comfortable. For folks visiting us, it's eye-catching and distinctive.


Along with the new design comes new buttons, stickers, and convention swag. This includes SolarWinds branded socks. YES, SOCKS!! There is an underground #SocksOfCLUS conversation on Twitter, and I am proud to say we will be representing with the best of them. Meanwhile, the buttons and stickers that have become a sought-after collectible at these shows feature all new messages.


Any convention would be a waste of time if one didn't hit at least a few talks, seminars, and keynotes. While much of my time is committed to being in the booth, I'm looking forward to attending "Architecture of Network Management Tools,” and "Enterprise IPv6 Deployment,” among other sessions.


Of course, I would be remiss if I didn't mention the SolarWinds session! Patrick and I will be presenting "Hybrid IT: The Technology Nobody Asked For" on Tuesday at 4:30pm in the Think Tank. The response so far has been fantastic, but there is still room available. We hope you will join us if you are in the neighborhood.


Despite our heavy commitment to the booth and our personal growth, all three of the Head Geeks will be carving out time to stop by Lauren Friedman's section of the Cisco© area to film some segments for Engineers Unplugged. Because whiteboards and unicorns!


Finally, and I almost hate to mentio this because it's already sold out, so it's kind of a tease (but I will anyway): we're hosting our first convention-based (mini) SolarWInds User Group (SWUG) down at the Mandalay Bay Shark Reef. As always, SWUGS are a great way for us to meet our most passionate customers, but more than that, it's a way for customers within a particular area to meet each other and share ideas, brainstorm problems, and build community beyond the electronic boundaries of THWACK.


Obviously, there will be more, including social media scavenger hunts and Kilted Monday. But this should give you a sense of the range and scale of the Cisco Live experience. If you can't make it this year, I suggest you start saving and/or pestering your managers to help you make it out next time.


You know that we'll be there waiting to see you!

Recently, when hearing of the AWS outage due to weather in the Sydney data center, I began thinking about High Availability (HA), and the whole concept of “Build for Failure.” It made me wonder about the true meaning of HA. In the case of AWS, as Ben Kepes correctly stated on a recent Speaking in Tech podcast, a second data center in, for example, Melbourne would have had implemented a failover capacity which would have alleviated a high degree of consternation.


The following is a multi-level conversation about High Availability, so I thought that I’d break it up into some sections: Server level, storage level and cloud data center level.


Remember, Fault Tolerance (FT) is not HA. FT means that the application being hosted remains at 100% uptime, regardless of the types of faults experienced. HA means that the application can endure a fault on some levels with rapid recovery from downtime, including potentially little to no downtime. FT, particularly in networking and virtual environments, involves a mirrored device always sitting in standby mode, actively receiving simultaneous changes to the app, storage, etc., which will take over should the primary device encounter a fault of some sort.


Server level HA, which is certainly the oldest IT segment into which we’ve been struggling, has been addressed in a number of ways. Initially, when we realized that a single server was never going to resolve the requirement and typically this referred to a mission critical app or database), we decided clustering would be our first approach. By building systems where a pair (or a larger number), of servers built as tandem devices would enhance uptime, and grant a level of stability to the application being serviced, we’d addressed some of the initial issues on platform vulnerable to patching, and other kinds of downtime.


Issues in a basic cluster had to do with things like high availability in the storage, networking failover from a pair to a single host, etc. For example, what would happen in a scenario in which a pair of servers were tied together in a cluster, each with their own storage (internal) and one went down? If a single host went down unexpectedly, there would be the potential for issues with the storage becoming “Out of sync” and potential data-loss would ensue. This “Split Brain” is precisely what we’re hoping to avoid. If you lose consistency in your transactional database, often times, a rebuild can fix, but of course take precious resources away from day-to-day operations, or even worse, there could be unrecoverable data loss, which can only be repaired with a restore. Assuming that the restore is flawless, how many transactions, and/or how much time was lost during the restore and from what recovery point were the backups made? So many potential losses here.  Microsoft introduced the “Quorum Drive” concept into their Clustering software, which offered up the ability to avoid “Split Brain” data, and ensured some cache coherency into an X86 SQL cluster, and that helped quite a bit, but still didn’t really resolve the issue.


To me, there’s no wonder that so many applications that could have easily been placed onto X86 platforms had so much time pass prior to that taking place. Mainframes, and robust Unix systems which do cost much to maintain, and stand up, had so much more viability in the enterprise, particularly on mission critical, and high transaction apps. Note that there are of course, other clustering tools, for example Veritas Cluster manager which made the job of clustering within any application cluster a more consistent, and actually quite a bit more robust process.


Along comes virtualization on the X86 level. Clustering happened in its own way, HA was achieved through tasks like Distributed Resource Scheduling, and as the data sat typically on shared disc, the consistency within the data could be ensured. We were also afforded a far more robust way in which to stand up much larger and more discrete applications, with tasks like more processor, adding disc, and memory requiring no more than a reboot of individual virtual machines within the cluster that made up the application.


This was by no means a panacea, but for the first time, we’d been given the ability to address inherent stability issues on X86. The flexibility of vMotion allowed for the backing infrastructure to handle the higher availability of the VM within the app cluster itself, literally removed the sheer reliance of the internal cluster on hardware in network, compute, and storage. Initially, the quorum drive which needed to be a raw device mapping in VMWare, disappeared, thus making pure Microsoft SQL clusters to be more difficult, but as versions of vSphere moved on, these Microsoft clusters became truly viable.


Again, VMWare has an ability to support a Fault Tolerant environment, for truly mission critical applications. There are specific requirements in FT, along the lines of doubling the storage onto a different storage volume, doubling the CPU/Memory and VM count on a different host, as these involve mirrored devices whereas HA doesn’t actually follow that paradigm.


In my next posting, I plan to address Storage as it relates to HA, storage methodologies, replication, etc.


The Actuator - July 6th

Posted by sqlrockstar Employee Jul 6, 2016

This edition comes to you from London where I am on vacation. I'm hopeful that #Brexit won't cause any issues getting in or out of the UK, or any issues while we are playing tourist. And I'm certainly not going to let a vacation get in the way of delivering the latest links for you to consume.


So, here is this week's list of things I find amusing from around the Internet. Enjoy!


Ransomware takes it to the next level: Javascript

Another article on ransomware which means another set of backups I must take, and so should you.


When the tickets just keep coming

Yeah, it's a lot like that.


IT must ditch ‘Ministry of No’ image to tackle shadow IT

OMG I never thought about this angle but now I need t-shirts that say "Ministry of NO" on them.


Happy 60th Birthday Interstate Highway System! We Need More Big-Bang Projects Like You

Once upon a time I was a huge fan of everything interstate. A huge 'roadgeek', I couldn't get enough knowledge about things like wrong road numbers, weird exit signs, and ghost roads like the Lincoln Highway and Route 66. Happy Birthday!


Oracle Loses $3 Billion Verdict for Ditching HP Itanium Chip

In related news, Larry Ellison has cut back on the number of islands he is thinking of buying this year.


Microsoft pays woman $10,000 over forced Windows 10 upgrade

Now *everyone* is going to expect money in exchange for not understanding the need for backups of your business critical data.


Apple Granted Patent For Phone Camera Disabling Device

If this technology comes to market it is going to wind up in the courts, which makes me think it was invented by lawyers at Apple.


Looking forward to more sightseeing with Lego Daughter the next two weeks:



The public sector frequently provides services and information via websites, and it’s important that these websites are up and running properly. And that’s not just for citizen-facing websites. Federal IT managers face the same challenge with internal sites such as intranets and back-end resource sites.


So what can federal IT pros do to keep ahead of the challenge, catch critical issues before they impact the user, and keep external and internal sites running at optimal performance?


The answer is three-fold:


  1. Monitor key performance metrics on the back-end infrastructure that supports the website.
  2. Track customer experience and front-end performance from the outside.
  3. Integrate back- and front-end information to get a complete picture.


Performance monitoring


Federal IT pros understand the advantages of standard performance monitoring, but monitoring in real time is just not enough. To truly optimize internal and external site performance, the key is to have performance information in advance.


This advance information is best gained by establishing a baseline, then comparing activity to that standard. With a baseline in place, a system can be configured to provide alerts based on information that strays from the baseline. And troubleshooting can start immediately and the root cause can be uncovered before it impacts customers. By anticipating an impending usage spike that will push capacity limits, the IT team can be proactive and avoid a slowdown.


That historical baseline will also help allocate resources more accurately and enable capacity planning. Capacity planning analysis lets IT managers configure the system to send an alert based on historical analysis.


Automation is also a critical piece of performance monitoring. If the site goes down over the weekend, automated tools can restart the site if it crashes and send an alert when it’s back up so the team can start troubleshooting.


End-user experience monitoring


Understanding the customer experience is a critical piece of ensuring optimal site performance. Let’s say the back-end performance looks good, but calls are coming in from end-users that the site is slow. Ideally, IT staff would be able to mimic a user’s experience, from wherever that user is located, anywhere around the world. This allows the team to isolate the issue to a specific location.


It is important to note that federal IT pros face a unique challenge in monitoring the end-user experience. Many monitoring tools are cloud based, and therefore will not work within a firewall. If this is the case, be sure to find something that works inside the firewall that will monitor internal and external sites equally.


Data integration


The ultimate objective is to bring all this information together to provide the visibility across the front- and back-end alike, to know where to start looking for any anomaly, no matter where it originates.


The goal is to improve visibility in order to optimize performance. The more data IT pros can muster, the greater their power to optimize performance and provide customers with the optimal experience.


Find the full article on Government Computer News.


Hard to monitor

Posted by networkautobahn Jul 3, 2016

With monitoring, we try to achieve end to end visibility for our services. So everything that is running for business critical applications needs to be watched . For the usual suspects like switches, servers and firewalls we have great success with that. But in all environments you have these black spots on the map that nobody is taking care of. There are two main  categories why something is not monitored, the organisational (not my department) and the technical.




Not my Department Problem

In IT sometimes the different departments are only looking after the devices that they are responsible for. Nobody has established a view over the complete infrastructure. That silo mentality ends up with a lot of finger pointing and ticket ping pong. Even more problematic are devices that are under the control of a 3rd party vendor or non IT people. For example, the power supply of a building is the responsibility of the facility management. In the mindset of the facility management monitoring has a completly different meaning to the one we have in IT. We have build up fully redundant infrastructures. We have put a lot of money and effort into making sure that every device has a redundant power supply. Only to find that it ends up in a single power cord that is going to a single diesel power generator that was build in the 1950s. The monitoring by the facility management is to go to the generator two times per day and take a look at the front panel of the machine.




And than you have the technical problems that can be a reason why something is not monitored. Here are some examples why it is sometimes hard to implement monitoring from a technical perspective. Ancient devices: Like the mentioned Diesel Power generator there are old devices that come from an era without any connectors that can be used for monitoring. Or it is a very old Unix or Host machine. I have found all sorts of tech that was still important for a specific task. So when it couldn´t be decommissioned it is still a dependency for a needed application or task. If it is still that important than we have to find a way to monitor it. It is needed to find a way to connect like we do with SNMP or an agent. If the devices simply support none of this connections we can try to watch the service that is delivered through the device or implement an extra sensor that can be monitored. For example of the Power generator, maybe we can not watch the generator directly but we can insert some devices like an UPS that can be watched over SNMP and shows the current power output. With intelligent PDU in every rack you can achieve even more granularity on the power consumption of your components. Often all the components of a rack have been changed nearly every two years, but the Rack and the power connector strip have been used for 10+ years. The same is true for the cooling systems. There are additional sensor bars available that feed your monitoring with data for the case the cooling plant can not deliver these data. With a good monitoring you can react before something happens.




Another case are passive technologies like CWDM/DWDM or antennas. These also can only be monitored indirectly with other components that are capable of proper monitoring. With GBICs that have an active measurement / DDI interface you have access to real time data that can be implemented into the monitoring. Once you have this data in your monitoring you have a baseline and know how the damping across your CWDM/DWDM fibres should look like. As a final thought, try and take a step back to figure out what is needed so that your services can run. Think in all directions and expect nothing as given. Include everything that you can think of from climate, power and include all dependancy of storage, network and applications. And with that in mind take a look at the monitoring and check if you cover everything.

In previous posts, I've talked about the importance of having a network of trusted advisors. I've also discussed the importance of honing your DART-SOAR skills. Now I'd like us to explore one of those soft and cloudy topics that every IT professional deals with, but is reluctant to address directly. And that is the business of breaking down personal silos of inefficiency, particularly as it pertains to IT knowledge and expertise.


As an IT professional, I tend to put all the pressure of knowing and doing everything on myself, aka Team Me. I've been rewarded for this behavior, but it has also proven to be ineffective at times. This is because the incentives could influence me to not seek help from anyone outside the sphere of me.


The majority of my struggle was trust-related. The thought that discussing something I knew nothing or little about would be a sign of weakness. Oh, how naïve my green, professional self was. This modus operandi did harm to me, my team, and my organization because its inefficiencies created friction where there didn’t need to be any.


It wasn’t until I turned Me into We that I started truly owning IT. By believing in its core tenet and putting it into practice, it opened doors to new communities, industry friends, and opportunities. I was breaking down silos by overcoming the restrictions that I placed on myself. I was breaking my mold, learning cool new stuff, and making meaningful connections with colleagues who eventually became friends.


It reminds me of my WoW days. I loved playing a rogue and being able to pick off opponents in PvP Battlegrounds. But I had to pick my battles, because though I could DPS the life out of you, I didn’t have the skills to self-heal over time, or tank for very long. So engagements had to be fast and furious. It wasn't until I started running in a team with two druids (a tank and a healer), that we could really start to own our PvP competition. My PvP teammates also played rogues and shared their tips and tricks, which included Rogue communities with game play strategies. As a result, I really learned how to optimize my DPS and my other unique set of skills toward any given goal.


Do you stand on the IT front and try to win alone? Have you found winning more gratifying when you win as a team? Let me know in the comment section below.

In my previous post, I reviewed the 5 Infrastructure Characteristics that will be included as a part of a good design. The framework is layed out in the great work IT Architect: Foundations in the Art of Infrastructure Design. In this post, I’m going to continue that theme by outlining the 4 Considerations that will also be a part of that design.


While the Characteristics could also be called “qualities” and can be understood as a list of ways by which the design can be measured or described, Considerations could be viewed as the box that defines the boundaries of the design. Considerations set things like the limits and scope of the design, as well as explain what the architect or design team will need to be true of the environment in order to complete the design.


Design Considerations

I like to think of the four considerations as the four walls that create the box that the design lives in. When I accurately define the four different walls, the design to go inside of it is much easier to construct. There are less “unknowns” and I leave myself less exposed to faults or holes in the design.


Requirements – Although they’re all very important, I would venture to say that Requirements is the most important consideration. “Requirements”is  a list - either identified directly by the customer/business or teased out by the architect – of things that must be true about the delivered infrastructure. Some examples listed in the book are a particular Service Level Agreement metric that must be met (like uptime or performance) or governance or regulatory compliance requirements. Other examples I’ve seen could be usability/manageability requirements dictating how the system(s) will be interfaced with or a requirement that a certain level of redundancy must be maintained. For example, the configuration must allow for N+1, even during maintenance.


Constraints – Constraints are the considerations that determine how much liberty the architect has during the design process. Some projects have very little in the way of constraints, while others are extremely narrow in scope once all of the constraints have been accounted for. Examples of constraints from the book include budgetary constraints or the political/strategic choice to use a certain vendor regardless of other technically possible options. More examples that I’ve seen in the field include environmental considerations like “the environment is frequently dusty and the hardware must be able to tolerate poor environmentals” and human resource constraints like “it must be able to be managed by a staff of two.”


Risks – Risks are the architect’s tool for vetting a design ahead of time and showing the customer/business the potential technical shortcomings of the design imposed by the constraints. It also allows the architect to show the impact of certain possibilities outside the control of either the architect or the business. A technical risk could be that N+1 redundancy actually cannot be maintained during maintenance due to budgetary constraints. In this case, the risk is that a node fails during maintenance and puts the system into a degraded (and vulnerable) state. A risk that is less technical might be something like that the business is located within a few hundred yards of a river and flooding could cause a complete loss of the primary data center. When risks are purposely not mitigated in the design, listing them shows that the architect thought through the scenario, but due to cost, complexity, or some other business justification, the choice has been made to accept the risk.


Assumptions – For lack of a better term, an assumption is a C.Y.A. statement. Listing assumptions in a design shows the customer/business that the architect has identified a certain component of the big picture that will come into play but is not specifically addressed in the design (or is not technical in nature). A fantastic example listed in the book is an assumption that DNS infrastructure is available and functioning. I’m not sure if you’ve tried to do a VMware deployment recently, but pretty much everything beyond ESXi will fail miserably if DNS isn’t properly functioning. Although a design may not include specifications for building a functioning DNS infrastructure, it will certainly be necessary for many deployments. Calling it out here ensures that it is taken care of in advance (or in the worst case, the architect doesn’t look like a goofball when it isn’t available during the install!).


If you work these four Considerations (and the 5 Characteristics I detailed in my previous post) into any design documentation you’re putting together, you’re sure to have a much more impressive design. Also, if you’re interested in working toward design-focused certifications, many of these topics will come into play. Specifically, if VMware certification is of interest to you, VCIX/VCDX work will absolutely involve learning these factors well. Good luck on your future designs!


The Actuator - June 29th

Posted by sqlrockstar Employee Jun 29, 2016

Well, Britain has voted to leave the EU. I have no idea why, or what that means other than my family vacation to London next month just got a whole lot cheaper.


Anyway, here is this week's list of things I find amusing from around the Internet. Enjoy!


EU Proposal Seeks To Adjust To Robot Workforce

Maybe this is why the UK wanted to leave, because they don't want their robots to be seen as "electronic persons with specific rights and obligations."


Real-time dashboards considered harmful

This is what adatole and I were preaching about recently. To me, a dashboard should compel me to take action. Otherwise it is just noise.


Many UK voters didn’t understand Brexit, Google searches suggest

I wont' pretend to know much about what it means, either. I'm hoping there will be a "#Brexit for Dummies" book available soon.


UK Must Comply With EU Privacy Law, Watchdog Argues

A nice example of how the world economy, and corporate business, is more global than people realize. Just because Britain wants to leave the EU doesn't mean they won't still be bound by EU rules should they wish to remain an economic partner.


Hacking Uber – Experts found dozen flaws in its services and app

Not sure anyone needed more reasons to distrust Uber, but here you go.


History and Statistics of Ransomware

Every time I read an article about ransomware I take a backup of all my files to an external drive because as a DBA I know my top priority is the ability to recover.


Blade Runner Futurism

If you are a fan of the movie, or sci-fi movies in general, set aside the time to read through this post. I like how the film producers tried to predict things like the cost of a phone call in the future.


Here's a nice reminder of the first step in fixing any issue:



The Pareto Principle


The Pareto principle, also known as the 80-20 principle, says that 20% of the issues will cause you 80% of the headaches. This principle is also known as The Law of the Vital Few. In this post, I'll describe how the Pareto principle can guide your work to provide maximum benefit. I'll also describe a way to question the information at hand using a technique known as 5 Whys.


The 80-20 rule states that when you address the top 20% of your issues, you'll remove 80% of the pain. That is a bold statement. You need to judge its accuracy yourself, but I've found it to be uncannily accurate.


The implications of this principle can take a while to sink it. On the positive side, it means you can make a significant impact if you address the right problems. On the down side, if you randomly choose what issues to work on, it's quite likely you're working on a low-value problem.


Not quite enough time


When I first heard of the 80-20 rule I was bothered by another concern: What about the remaining problems? You should hold high standards and strive for a high-quality network, but maintaining the illusion of a perfect network is damaging. If you feel that you can address 100% of the issues, there's no real incentive to prioritize. I heard a great quote a few months back:


     "To achieve great things, two things are needed; a plan, and not quite enough time." - Leonard Bernstein


We all have too much to do, so why not focus our efforts on the issues that will produce the most value? This is where having Top-N reports from your management system is really helpful. Sometime you need to see the full list of issues, but only occasionally. More often, this restricted view of the top issues is a great way to get started on your Pareto analysis.


3G WAN and the 80-20 rule


A few years back, I was asked to design a solution for rapid deployment warehouses in remote locations. After an analysis of the options I ran a trial using a 3G-based WAN. We ran some controlled tests, cutting over traffic for 15 minutes, using some restrictive QoS policies. The first tests failed with a saturated downlink.


When I analyzed the top-talkers report for the site I saw something odd. It seemed that 80% of the traffic to the site was print traffic. It didn't make any sense to me, but the systems team verified that the shipping label printers use an 'inefficient' print driver.


At this point I could have ordered WAN optimizers to compress the files, but we did a 5 Whys analysis instead. Briefly, '5 Whys' is a problem solving technique that helps you identify the true root cause of issues.


  • Why is the bandwidth so high? - Printer traffic taking 80% of bandwidth
  • Why is printer traffic such a high percentage? - High volume of large transactions
  • Why is the file size so large? - Don't know - oh yeah we use PostScript (or something)
  • Why can't we use an alternative print format? - We can, let's do it, yay, it worked!
  • Why do we need to ask 5 whys? - We don't, you can stop when you solve the problem


The best form of WAN optimization is to suppress or redirect the demand. We don't all have the luxury of a software engineer to modify their code and reduce bandwidth, but in this case it was the most elegant solution. We were able to combine a trial, reporting, top-N and deep analysis with a flexible team. The result was a valuable trial and a great result.




Here's a quick summary of what I covered in this post:


  • The 80/20 principle can help you get real value from your efforts.
  • Top-N reports are a great starting point to help you find that top 20%.
  • The 5 Whys principle can help you dig deeper into your data and choose the most effective actions.


Of course a single example doesn't prove the rule.  Does this principle ring true for you, or perhaps you think it is nonsense? Let me know in the comments.

Let’s face it!  We live in a world now where we are seeing a heavy reliance on software instead of hardware.  With Software Defined Everything popping up all over the place we are seeing traditional hardware oriented tasks being built into software – this provides an extreme amount of flexibility and portability on how we chose to deploy and configure various pieces of our environments.


With this software management layer taking hold of our virtualized datacenters we are going through a phase where technologies such as private and hybrid cloud are now within our grasp.  As the cloud descends upon us there is one key player that we need to focus on – the automation and orchestration that quietly executes in the background, the key component to providing the flexibility, efficiency, and simplicity that we as sysadmins are expected to provide to our end users.


To help drive home the importance and reliance of automation let’s take a look at a simple task – that of deploying a VM.  When we do this in the cloud, mainly public,  it’s just a matter of swiping a credit card, providing some information in regards to a name and network configuration, waiting a few minutes/seconds and away we go. Our end users can have a VM setup almost instantaneously!


The ease of use and efficiency of the public cloud, such as the above scenario is putting extended pressure on IT within their respective organizations – we are now expected to create, deliver and maintain these flexible like services within our businesses, and do so with the same efficiency and simplicity that cloud brings to the table.  Virtualization certainly provides a decent starting point for this, but it is automation and orchestration that will take us to the finish line.


So how do we do it?


Within our enterprise I think we can all agree that we don’t simply just create a VM and call it “done”!  There are many other steps that come after we power up that new VM.  We have server naming to contend with, networking configuration (IP, DNS, Firewall, etc).  We have monitoring solutions that need to be configured in order to properly monitor and respond to outages and issues that may pop up, as well as I’m pretty certain we will want to include our newly created VM within some sort of backup or replication job in order to protect it.  With more and more software vendors exposing public API’s we are now living in a world where its possible to tie all of these different pieces of our datacenter together.


Automation and orchestration doesn’t stop at just creating VMs either – there’s room for it throughout the whole VM life cycle.  The concept of the self-healing datacenter comes to mind – having scripts and actions performed automatically by monitoring software in efforts to fix issues within your environment as the occur – this is all made possible by automation.


So with this I think we can all conclude that automation is a key player within our environments but the questions always remains – should I automate task x?  Meaning, will the time savings and benefits of creating the automation supersede the efforts and resources it will take to create the process?  So with all this in mind I have a few questions- Do you use automation and orchestration within your environment?   If so what tasks have you automated thus far?  Do you have a rule of thumb that dictates when you will automate a certain task?  Believe it or not there are people within this world that are somewhat against automation, whether it be in fear of their jobs or simply not adapting – how do you help “push” these people down the path of automation?

Government information technology administrators long have been trained to keep an eye out for the threats that come from outside their firewalls. But what if the greatest threats actually come from within?


According to a federal cybersecurity survey we conducted last year, that is a question that many government IT managers struggle to answer. In fact, a majority of the 200 respondents said they believe malicious insider threats are just as damaging as malicious external threats.


The threat of a careless user storing sensitive data on a USB drive left on a desk can raise just as much of a red flag as an anonymous hacker. Technology, training and policies must be consistently deployed, and work together, to ensure locked-down security.




Manual network monitoring is no longer feasible, and respondents identified tools pertaining to identity and access management, intrusion prevention and detection, and security incident and event management or log management as “top tier” tools to prevent internal and external threats.


Each solution offers continuous and automatic network monitoring, and alerts. Problems can be traced to individual users and devices, helping identify the root cause of potential insider threats. Most importantly, administrators can address potential issues far more quickly.


However, tools are just that—tools. They need to be supported with proper procedures and trained professionals who understand the importance of security and maintaining constant vigilance. 




According to the survey, 53 percent of respondents claim careless and untrained insiders are the largest threat at federal agencies, while 35 percent stated “lack of IT training” is a key barrier to insider threat detection. IT personnel should be trained on technology protocols and the latest government security initiatives and policies and receive frequent and in-depth information on agency-specific initiatives that could impact or change the way security is handled throughout the organization.


All employees should be aware of the dangers and costs of accidental misuse of agency information or rogue devices. Forty-seven percent of survey respondents stated employee or contractor computers were the most at-risk sources for data loss. Human error often can prove far more dangerous than explicit intent.




When it comes to accidental or careless insider threats, 56 percent of survey respondents were somewhat confident in their security policies, while only 31 percent were “very confident.” 


Agency security policies, combined with federal policies, serve as a security blueprint and are therefore extremely important. They should plainly outline the agency’s overall security approach and include specific details such as authorized users and use of acceptable devices.


As one of the survey respondents said: “Security is a challenge, and the enemy is increasingly sophisticated.” More and more, the enemy attacks from all fronts—externally and internally. Federal IT managers clearly need to be prepared to combat the threat using their own three-pronged attack of technology, training and policies.


Find the full article on Signal.

The short answer to the question in the title is NO, backup speed and restore speed are no longer related.  There are a number of reasons why this is the case.


Let's go back in time to understand the historical reasons behind this question.  Historically, backup was a batch process that was sent to a serial device.  Various factors led to the commonly used rule of thumb that restores took 50% to 100% longer than the full backup that created them.  This started with the fact that a restore started with first reading the entire full backup, which at a minimum would take the same amount of time as creating the full backup.  Then once that happened multiple incremental backups had to be read, each of which added time to the restore due to them time involved in loading multiple tapes.  (It wasn't that long ago that all backups were to tape.)   Also because backups were sent to tape, it was not possible to do the kind of parallel processing that today's restores are capable of.


The first reason why backup and restore speed are no longer related is actually negative.  Today's backups are typically sent to a device that uses deduplication.  While deduplication comes with a lot of benefits, it also can come with one challenge.  The "dedupe tax," as its referred to, is the difference between a device's I/O speed with and without deduplication.  Depending on how dedupe is done, backup can be much faster than restore and vice versa.


The second -- and perhaps more important -- reason why backup and restore speed are unrelated is that backups and restores don't always use the same technology any more.  Where historically both backups and restores were a batch process that simply copied everything from A to B, today's backups and restores can actually be very different from each other.  A restore may not even happen, for example.  If someone uses a CDP or near-CDP product, a "restore" may consist of pointing the production app to the backup version of that app until the production version of that app can be repaired.  Some backup software products also have the ability to do a "reverse restore" that identifies the blocks or files that have been corrupted and only transfer and overwrite those blocks or files.  That would also be significantly faster than a traditional restore.


One thing hasn't changed: the only way you can know the speed at which a restore will run is to test it.  Sometimes the more things change the more they stay the same.

Filter Blog

By date:
By tag: