1 2 3 4 Previous Next

Monitoring Central

57 posts

We are happy to announce the arrival of the brand-new SolarWinds NAT Lookup free tool.

 

We know how frustrating it can be when your users can’t even get through the firewall to their own network. SolarWinds NAT Lookup helps simplify the network address translation lookup process to help get your users beyond their firewall translation issues, prevent overlapped policies that cause incorrect translations, and effectively troubleshoot address issues. 

 

So, what exactly can you do with this new addition to the SolarWinds free tool catalog?

 

  1. Search an IP address across one or multiple Palo Alto firewalls.

        Quickly perform an IP lookup on one or multiple Palo Alto firewalls to verify IP translation information.

 

     2. Troubleshoot and verify NAT policy configuration

         Render complete lists of all NAT policies that apply to a searched address to spot overlapping policies, cross-reference the policy configuration to live session traffic, and see the                order of overlapping NAT policies.

 

     3. See live session traffic per firewall for the translated address

         Gain insight into live session traffic per firewall for each translated address to help ensure the policy configuration matches observed behavior and performance.

 

     4. Export information for records keeping

         Keep historical records of each executed search by exporting the policy configurations from the tool into CSV file format.

 

Additional Features:

  • Removes the need for users to have direct access to firewalls
  • Easy distribution to other IT groups instead of granting direct access to your sensitive firewalls Helps users identify dedicated translated addresses

How do you plan to use SolarWinds NAT Lookup? We’d love to hear from you, so feel free to post your use cases in the comments below.

 

If you’d like to get the nitty-gritty, in-depth info about NAT Lookup and how you can make the most of it, check out this article

Synthetic user monitoring is a technique that simulates user transactions—or common paths on websites—so the administrator can watch for performance issues. These transactions are meant to represent how a user might be experiencing the site. For instance, is a potential customer getting an error when they add an item to their cart? Is a specific page loading slowly or not loading at all? These are things that can affect your bottom line and result in unplanned fire drills.

 

Synthetic user monitoring should not be confused with Real User Monitoring. Real User Monitoring captures and analyzes transactions from real users on a site. It helps understand load times for your pages from browsers in their actual locations.

 

These approaches provide different perspectives on web performance. Each have their benefits, but today—in honor of the release of Web Performance Monitor 3.0—we’re going to focus on situations when synthetic user monitoring is a good choice.

 

Find Performance Issues Before They Cause Problems for Your Users

IT infrastructure monitoring tools are great at telling you if a server or a service is up or down, but users might still be frustrated even if these things look OK. Synthetic user experience monitoring tools let you see if an overall transaction is working (can a user purchase something from your site?) or if a certain step is having trouble (when I click “buy” my payment processing is hanging). Once you’re alerted, you can go into troubleshooting mode with the specifics of what your users are seeing to minimize the impact. Plus, you can continuously run these tests from multiple locations to ensure things are working where your users are. 

 

Benchmark Your Site’s Performance to Identify Areas for Improvement

As mentioned, synthetic user experience monitoring tools can watch your websites from multiple locations at frequencies of your choice. Seeing this data over time can help you identify areas to optimize going forward. Waterfall charts can be particularly helpful to pinpoint performance bottlenecks over time.

 

Monitor the Performance of Critical SaaS Applications From Inside Your Firewall

Most companies rely on third-party SaaS applications to run some aspects of their business. For instance, your sales team may be using a SaaS CRM solution to drive and track their daily activities. It’s critical to know if your coworkers are having issues getting what they need. While you don’t own the app, you’re the one they’ll come to when they have issues. A common scenario is setting up a transaction to make sure a valid user can log in successfully and be alerted if it fails.

 

Knowing about failures or performance issues before your users can save you time and frustration. Synthetic user experience monitoring can help when it comes to websites and web-based applications. How have you used it? Comment below and let us know.

It’s easy to recognize problems in Ruby on Rails, but finding each problem’s source can be a challenging task. A problem due to an unexpected event could result in hours of searching through log files and attempting to reproduce the issue. Poor logs will leave you searching, while a helpful log can assist you in finding the cause right away.

 

Ruby on Rails applications automatically create and maintain the basic text logs for each environment, such as development, staging, and production. You can easily format and add extra information to the logs using open-source logging libraries, such as Lograge and Fluentd. These libraries effectively manage small applications, but as you scale your application across many servers, developers need to aggregate logs to troubleshoot problems across all of them.

 

In this tutorial, we will show you how Ruby on Rails applications handle logging natively. Then, we’ll show you how to send the logs to SolarWinds® Papertrail™. This log management solution enables you to centralize your logs in the cloud and provides helpful features like fast search, alerts, and more.

 

Ruby on Rails Native Logging

Ruby offers a built-in logging system. To use it, simply include the following code snippet in your environment.rb (development.rb/production.rb). You can find environments under the config directory of the root project.

config.logger = Logger.new(STDOUT)

Or you can include the following in an initializer:

Rails.logger = Logger.new(STDOUT)

By default, each log is created under #{Rails.root}/log/ and the log file is
named after the environment in which the application is running. The default format gives basic information that includes the date/time of log generation and description (message or exception) of the log.

D, [2018-08-31T14:12:44.116332 #28944] DEBUG -- : Debug message I, [2018-08-31T14:12:44.117330 #28944]  INFO -- : Test message F, [2018-08-31T14:12:44.118348 #28944] FATAL -- : Terminating application, raised unrecoverable error!!! F, [2018-08-31T14:12:44.122350 #28944] FATAL -- : Exception (something bad happened!):

Each log line also includes the severity, otherwise known as log level. The log levels enable you to filter the logs when outputting them or when monitoring problems, such as errors or fatal. The available log levels are :debug, :info, :warn, :error, :fatal, and :unknown. These are
converted to uppercase when output in the log file.

 

Formatting Logs Using Lograge

The default logging in Ruby on Rails during development or in production can be noisy, as you can see below. It also records a limited amount of
information for each page view.

I, [2018-08-31T14:37:44.588288 #27948]  INFO -- : method=GET path=/ format=html controller=Rails::WelcomeController action=index status=200 duration=105.06 view=51.52 db=0.00 params={'controller'=>'rails/welcome', 'action'=>'index'} headers=#<ActionDispatch::Http::Headers:0x046ab950> view_runtime=51.52 db_runtime=0

Lograge adds extra detail and uses a format that is less human readable, but more useful for large-scale analysis through its JSON output option. JSON makes it easier to search, filter, and summarize large volumes of logs. The discrete fields facilitate the process of searching through logs and filtering for the information you need.

I, [2018-08-31T14:51:54.603784 #17752]  INFO -- : {'method':'GET','path':'/','format':'html','controller':'Rails::WelcomeController','action':'index','status':200,'duration':104.06,'view':51.99,'db':0.0,'params':{'controller':'rails/welcome','action':'index'},'headers':'#<ActionDispatch::Http::Headers:0x03b75520>','view_runtime':51.98726899106987,'db_runtime':0}

In order to configure Lograge in a Ruby on Rails app, you need to follow some simple steps:

Step 1: Find the Gemfile under the project root directory and add the following gem.

gem 'lograge'

Step 2: Enable Lograge in each relevant environment (development, production, staging) or in the initializer. You can find all those environments under the config directory of your project. To find the initializer, open up the config directory of your project.

# config/initializers/lograge.rb# OR# config/environments/production.rbRails.application.configure do  config.lograge.enabled = trueend

Step 3: If you’re using Rails 5’s API-only mode and inherit from ActionController::API, you must define it as the controller base class that Lograge will patch:

# config/initializers/lograge.rbRails.application.configure do  config.lograge.base_controller_class = 'ActionController::API'end

With Lograge, you can include additional attributes in log messages, like user ID or request ID, host, source IP, etc. You can read the Lograge documentation to get more information.


Here’s a simple example that captures three attributes:

class ApplicationController < ActionController::Base  before_action :append_info_to_payload  def append_info_to_payload(payload)    super    payload[:user_id] = current_user.try(:id)    payload[:host] = request.host    payload[:source_ip] = request.remote_ip  endend

The above three attributes are logged in environment.rb (production.rb/development.rb) with this block.

config.lograge.custom_options = lambda do |event|  event.payloadend

Troubleshoot Problems Faster Using Papertrail

Papertrail is a popular cloud-hosted log management service that integrates with different logging library solutions. It is easier to centralize all your Ruby on Rails log management in the cloud. You can quickly track real-time activity, making it easier to identify and troubleshoot real-time production applications.

 

Papertrail provides numerous features for handling Ruby on Rails log files, including:

 

Instant log visibility: Papertrail provides fast search and team-wide
access. It also provides analytics reporting and webhook monitoring, which
can be set up typically in less than a minute.

 

Aggregate logs: : Papertrail aggregates logs across your entire deployment, making them available from a single location. It provides you with an easy way to access logs, including application logs, database logs, Apache logs, and more.

2018-10-03-viewer

 

Tail and search logs: Papertrail lets you tail logs in real time from
multiple devices. With the help of advanced searching and filtering tools, you can quickly troubleshoot issues in a production environment.

 

Proactive alert notifications: Almost every application has critical events
that require human attention. That’s precisely why alerts exist. Papertrail gives you the ability to receive alerts via email, Slack, Librato®, PagerDuty, or any custom HTTP webhooks of your choice.

2018-10-03-edit-alert

 

Log archives: You can load the Papertrail log archives into third-party utilities, such as Redshift or Hadoop.

 

Logs scalability: With Papertrail, you can scale your log volume and desired searchable duration.

 

Encryption: For your security, Papertrail supports optional TLS encryption
and certificate-based destination host verification.

Configuring Ruby on Rails to Send Logs to Papertrail

It’s an easy task to get started with Papertrail. If you already have log files,
you can send them to Papertrail using Nxlog or remote_syslog2. This utility will monitor the log files and send new logs to Papertrail. Next, we’ll show you how to send events asynchronously from Ruby on Rails using the remote_syslog_logger.

Add the remote_syslog_logger to your Gemfile. If you are not using a Gemfile, run the following script:

$ gem install remote_syslog_logger

Change the environment configuration file to log via remote_syslog_logger. This is almost always in config/environment.rb (to affect all environments) or config/environments/<environment name>.rb, such as config/environments/production.rb (to affect only a specific environment). Update the host and port to the ones given to you in your Papertrail log destination settings.

config.logger = RemoteSyslogLogger.new('logsN.papertrailapp.com', XXXXX)

It’s that simple! Your logs should now be sent to Papertrail.

 

Papertrail is designed to help you troubleshoot customer problems, resolve error messages, improve slow database queries, and more. It gives you analytical tools to help identify and resolve system anomalies and potential security issues. Learn more about how Papertrail can give you frustration-free log management in the cloud, and sign up for a trial or the free plan to get started.

Slow websites on your mobile device are frustrating when you’re trying to look up something quickly. When a page takes forever to load, it’s often due to a spotty network connection or a website that is overly complicated for a phone. Websites that load many images or videos can also eat up your data plan. Most people have a monthly cap on the amount of data they can use, and it can be expensive to pay an overage fee or upgrade your plan.

 

Can switching to a different browser app truly help websites load faster and use less data? We’ll put the most popular mobile browsers to the test to see which is the fastest and uses the least data. Most people use their phone’s default browser app, like Safari on iPhone or Chrome on Android. Other browsers, like Firefox Focus and Puffin, claim to better at saving data. Let’s see which one comes out on top with our testing.

 

How We Benchmark

We’ll look specifically at page-load performance by testing three popular websites with different styles of content. The first will be the Google home page, which should load quickly as Google designed it to be fast. Next, we’ll measure the popular social media website Reddit. Lastly, we’ll test BuzzFeed, a complex website with many ads and trackers.

 

To conduct these tests, we’ll use an Apple iPhone 7. (We may look at other phones such as Android in future articles.) We’ll use the browsers with default settings and clear any private browsing data so cached data won’t change the results.

 

Since we don’t have access to the browser developer tools we’d typically have on a desktop, we’ll need to use a different technique. One way is to time how long it takes to download the page, but some websites preload data in the background to make your next click load faster. From the user’s perspective, this shouldn’t count toward the page-load time because it happens behind the scenes. A better way is to record a video of each page loading. We can then play them back and see how long each took to load all the visible content.

 

To see how much data each browser used, we’ll use something called a “proxy server” to monitor the phone’s connections. Normally, phones load data directly through the cellular carrier’s LTE connection or through a router’s Wi-Fi connection. A proxy server acts like a man in the middle, letting us count how much data passes between the website and the phone. It also lets us see which websites it loaded data from and even the contents of the data.

 

We’ll use the proxy server software called Fiddler. This tool also allows enables us to decrypt the HTTPS connection to the website and spy on exactly which data is being sent. We configured it for iOS by installing a root certificate on our phone, which the computer can use to decrypt the data. Fiddler terminates the SSL connection with the external website, then encrypts the data to our phone using its own root certificate. It allows us to see statistics on which sites were visited, which assets where loaded, and more.


©2015 Telerik

 

The Puffin browser made things more challenging because we were unable to see the contents of pages after installing the Fiddler root certificate. It’s possible Puffin uses a technique called certificate pinning. Nevertheless, we were still able to see the number of bytes being sent over the connection to our phone and which servers it connected to.

 

Which Browser has the Best Mobile Performance?

Here are the results of measuring the page-load time for each of the mobile browsers against our three chosen websites. Faster page load times are better.

BrowserGoogle.comReddit.comBuzzfeed.com
Safari3.48s5.50s8.67s
Chrome1.03s4.93s5.93s
Firefox1.89s3.47s3.50s
Firefox focus2.67s4.90s5.70s
Puffin0.93s2.20s2.40s

 

The clear winner in the performance category is Puffin, which loaded pages about twice as fast as most other browsers. Surprisingly, it even loaded Google faster than Chrome. Puffin claims the speed is due to a proprietary compression technology. Most modern browsers support gzip compression, but it’s up to site operators to enable it. Puffin can compress all content by passing it through its own servers first. It can also downsize images and videos so they load faster on mobile.

 

Another reason Puffin was so much faster is because it connected to fewer hosts. Puffin made requests to only 14 hosts, whereas Safari made requests to about 50 hosts. Most of those extra hosts are third-party advertisement and tracking services. Puffin was able to identify them and either remove them from the page or route calls through its own, faster servers at cloudmosa.net.

PuffinSafari
vid.buzzfeed.com: 83img.buzzfeed.com: 51
google.com: 9www.google-analytics.com: 16
www.google.com: 2www.buzzfeed.com: 14
en.wikipedia.org: 2tpc.googlesyndication.com: 9
pointer2.cloudmosa.net: 2securepubads.g.doubleclick.net: 7
data.flurry.com: 2pixiedust.buzzfeed.com: 7
www.buzzfeed.com: 2vid.buzzfeed.com: 6
pivot-ha2.cloudmosa.net: 1cdn-gl.imrworldwide.com: 6
p40-buy.itunes.apple.com: 1www.facebook.com: 6
gd11.cloudmosa.net: 1sb.scorecardresearch.com: 3
gd10.cloudmosa.net: 1cdn.smoot.apple.com: 3
gd9.cloudmosa.net: 1pagead2.googlesyndication.com: 3
collector.cloudmosa.net: 1video-player.buzzfeed.com: 3
www.flashbrowser.com: 1gce-sc.bidswitch.net: 3
secure-dcr.imrworldwide.com: 3
connect.facebook.net: 3
events.redditmedia.com: 3
s3.amazonaws.com: 2
thumbs.gfycat.com: 2
staticxx.facebook.com: 2
id.rlcdn.com: 2
i.redditmedia.com: 2
googleads.g.doubleclick.net: 2
videoapp-assets-ak.buzzfeed.com: 2
c.amazon-adsystem.com: 2
buzzfeed-d.openx.net: 2
pixel.quantserve.com: 2
… 20 more omitted

 

It’s great Puffin was able to load data so quickly, but it raises some privacy questions. Any users of this browser are giving CloudMosa access to their entire browsing history. While Firefox and Chrome let you opt out of sending usage data, Puffin does not. In fact, it’s not possible to turn this tracking off without sacrificing the speed improvements. The browser is supported by ads, although its privacy policy claims it doesn’t keep personal data. Each user will have to decide if he or she is comfortable with this arrangement.

 

Which Browser Uses the Least Mobile Data?

Now let’s look at the amount of data each browser uses. Again, we see surprising results:

BrowserGoogle.comReddit.comBuzzfeed.com
Safari0.82MB2.89MB4.22MB
Chrome0.81MB2.91MB5.46MB
Firefox0.82MB2.62MB3.15MB
Firefox Focus0.79MB2.61MB3.13MB
Puffin0.54MB0.17MB42.2MB

 

Puffin was the clear leader for loading google.com and it dominated reddit.com by a factor of 10. It claims it saved 97% of data usage on reddit.com.


©2015 Telerik

 

However, Puffin lost on buzzfeed.com by a factor of 10. In Fiddler, we saw that it made 83 requests to vid.buzzfeed.com. It appears it was caching video data in the background so videos would play faster. While doing so saves the user time, it ends up using way more data. On a cellular plan, this approach could quickly eat up a monthly cap.

 

As a result, Firefox Focus came in the lead for data usage on buzzfeed.com. Since Firefox Focus is configured to block trackers by default, it was able to load the page using the least amount of mobile data. It was also able to avoid making requests to most of the trackers listed in the Buzzfeed section above. In fact, if we take away Puffin, Firefox Focus came in the lead consistently for all the pages. If privacy is important, Firefox Focus could be a great choice for you.

 

How to Test Your Website Performance

Looking at the three websites we tested, we see an enormous difference in the page-load time in the amount of data used. It matters because higher page-load time is correlated to higher bounce rates and even lower online purchases.

 

Pingdom® makes it even easier to test your own website’s performance with page speed monitoring. It gives you a report card showing how your website compares with others in terms of load time and page size.

To get a better idea of your customer’s experience, you can see a film strip showing how content on the page loads over time. Below, we can see that Reddit takes about two seconds until it’s readable. If we scroll over, we’d see it takes about six seconds to load all the images.

The SolarWinds® Pingdom® solution also allows us to dive deeper into a timeline view showing exactly which assets were loaded and when. The timeline view helps us see if page assets are loading slowly because of network issues or their size, or because third parties are responding slowly. The view will give us enough detail to go back to the engineering team with quantifiable data.

Pingdom offers a free version that gives you a full speed report and tons of actionable insights. The paid version also gives you the filmstrip, tracks changes over time, and offers many more website monitoring tools.

 

Conclusion

The mobile browser you choose can make a big difference in terms of page-load time and data usage. We saw that the Puffin browser was able to load pages much faster than the default Safari browser on an Apple iPhone 7. Puffin also used less data to load some, but not all, pages. However, for those who care about privacy and saving data on their mobile plan, Firefox Focus may be your best bet.

 

Because mobile performance is so important for customers, you can help improve your own website using the Pingdom page speed monitoring solution. This tool will give you a report card to share with your team and specific actions you can take to make your site faster.

We’re no strangers to logging from Docker containers here at SolarWinds® Loggly®. In the past, we’ve demonstrated different techniques for logging individual Docker containers. But while logging a handful of containers is easy, what happens when you start deploying dozens, hundreds, or thousands of containers across different machines?

In this post, we’ll explore the best practices for logging applications deployed using Docker Swarm.

Intro to Docker Swarm

Docker Swarm is a container orchestration and clustering tool from the creators of Docker. It allows you to deploy container-based applications across a number of computers running Docker. Swarm uses the same command-line interface (CLI) as Docker, making it more accessible to users already familiar with Docker. And as the second most popular orchestration tool behind Kubernetes, Swarm has a rich ecosystem of third-party tools and integrations.

A swarm consists of manager nodes and worker nodes. Managers control how containers are deployed, and workers run the containers. In Swarm, you don’t interact directly with containers, but instead define services that define what the final deployment will look like. Swarm handles deploying, connecting, and maintaining these containers until they meet the service definition.

For example, imagine you want to deploy an Nginx web server. Normally, you would start an Nginx container on port 80 like so:

$ docker run --name nginx --detach --publish 80:80 nginx

With Swarm, you instead create a service that defines what image to use, how many replica containers to create, and how those containers should interact with both the host and each other. For example, let’s deploy an Nginx image with three containers (for load balancing) and expose it over port 80.

$ docker service create --name nginx --detach --publish 80:80 --replicas 3 nginx

When the deployment is done, you can access Nginx using the IP address of any node in the Swarm.

Best Practices for Logging in Docker Swarm 1
© 2011-2018 Nginx, Inc.

To learn more about Docker services, see the services documentation.

 

The Challenges of Monitoring and Debugging Docker Swarm

Besides the existing challenges in container logging, Swarm adds another layer of complexity: an orchestration layer. Orchestration simplifies deployments by taking care of implementation details such as where and how containers are created. But if you need to troubleshoot an issue with your application, how do you know where to look? Without comprehensive logs, pinpointing the exact container or service where an error occurred can become an operational nightmare.

On the container side, nothing much changes from a standard Docker environment. Your containers still send logs to stdout and stderr, which the host Docker daemon accesses using its logging driver. But now your container logs include additional information, such as the service that the container belongs to, a unique container ID, and other attributes auto-generated by Swarm.

Consider the Nginx example. Imagine one of the containers stops due to a configuration issue. Without a monitoring or logging solution in place, the only way to know this happened is by connecting to a manager node using the Docker CLI and querying the status of the service. And while Swarm automatically groups log messages by service using the docker service logs command, searching for a specific container’s messages can be time-consuming because it only works when logged in to that specific host.

 

How Docker Swarm Handles Logs

Like a normal Docker deployment, Swarm has two primary log destinations: the daemon log (events generated by the Docker service), and container logs (events generated by containers). Swarm doesn’t maintain separate logs, but appends its own data to existing logs (such as service names and replica numbers).

The difference is in how you access logs. Instead of showing logs on a per-container basis using docker logs <container name>, Swarm shows logs on a per-service basis using docker service logs <service name>. This aggregates and presents log data from all of the containers running in a single service. Swarm differentiates containers by adding an auto-generated container ID and instance ID to each entry.

For example, the following message was generated by the second container of the nginx_nginx service, running on swarm-client1.

# docker service logs nginx_nginx  nginx_nginx.2.subwnbm15l3f@swarm-client1 | 10.255.0.2 - - [01/Jun/2018:22:21:11 +0000] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0" "-"

To learn more about the logs command, see the Docker documentation.

 

Options for Logging in Swarm

Since Swarm uses Docker’s existing logging infrastructure, most of the standard Docker logging techniques still apply. However, to centralize your logs, each node in the swarm will need to be configured to forward both daemon and container logs to the destination. You can use a variety of methods such as Logspout, the daemon logging driver, or a dedicated logger attached to each container.

 

Best Practices to Improve Logging

To log your swarm services effectively, there are a few steps you should take.

 

1. Log to STDOUT and STDERR in Your Apps

Docker automatically forwards all standard output from containers to the built-in logging driver. To take advantage of this, applications running in your Docker containers should write all log events to STDOUT and STDERR. If you try to log from within your application, you risk losing crucial data about your deployment.

 

2. Log to Syslog Or JSON

Syslog and JSON are two of the most commonly supported logging formats, and Docker is no exception. Docker stores container logs as JSON files by default, but it includes a built-in driver for logging to Syslog endpoints. Both JSON and Syslog messages are easy to parse, contain critical information about each container, and are supported by most logging services. Many container-based loggers such as Logspout support both JSON and Syslog, and Loggly has complete support for parsing and indexing both formats.

 

3. Log to a Centralized Location

A major challenge in cluster logging is tracking down log files. Services could be running on any one of several different nodes, and having to manually access log files on each node can become unsustainable over time. Centralizing logs lets you access and manage your logs from a single location, reducing the amount of time and effort needed to troubleshoot problems.

One common solution for container logs is dedicated logging containers. As the name implies, dedicated logging containers are created specifically to gather and forward log messages to a destination such as a syslog server. Dedicated containers automatically collect messages from other containers running on the node, making setup as simple as running the container.

 

Why Loggly Works for Docker Swarm

Normally you would access your logs by connecting to a master node, running docker service logs <service name>, and scrolling down to find the logs you’re looking for. Not only is this labor-intensive, but it’s slow because you can’t easily search, and it’s difficult to automate with alerts or create graphs. The more time you spend searching for logs, the longer problems go unresolved. This also means creating and maintaining your own log centralization infrastructure, which can become a significant project on its own.

Loggly is a log aggregation, centralization, and parsing service. It provides a central location for you to send and store logs from the nodes and containers in your swarm. Loggly automatically parses and indexes messages so you can search, filter, and chart logs in real-time. Regardless of how big your swarm is, your logs will be handled by Loggly.

 

Sending Swarm Logs to Loggly

The easiest way to send your container logs to Loggly is with Logspout. Logspout is a container that automatically routes all log output from other containers running on the same node. When deploying the container in global mode, Swarm automatically creates a Logspout container on each node in the swarm.

 

To route your logs to Loggly, provide your Loggly Customer Token and a custom tag, then specify a Loggly endpoint as the logging destination.

# docker service create --name logspout --mode global --detach --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/etc/hostname:/etc/host_hostname:ro -e SYSLOG_STRUCTURED_DATA="<Loggly Customer Token>@41058 tag=\"<custom tag>\"" gliderlabs/logspout syslog+tcp://logs-01.loggly.com:514

You can also define a Logspout service using Compose.

#

 docker-compose-logspout.yml  version: "3"  networks:   logging:  services:   logspout:     image: gliderlabs/logspout     networks:       - logging     volumes:       - /etc/hostname:/etc/host_hostname:ro       - /var/run/docker.sock:/var/run/docker.sock     environment:       SYSLOG_STRUCTURED_DATA: "<Loggly Customer Token>@41058"       tag: "<custom tag>"     command: syslog+tcp://logs-01.loggly.com:514     deploy:       mode: global

Use docker stack deploy to deploy the Compose file to your swarm. <stack name> is the name that you want to give to the deployment.

# docker stack deploy --compose-file docker-compose-logspout.yml <stack name>

As soon as the deployment is complete, messages generated by your containers start appearing in Loggly.
Best Practices for Logging in Docker Swarm 2

Configuring Dashboards and Alerts

Since Swarm automatically appends information about the host, service, and replica to each log message, we can create Dashboards and Alerts similar to those for a single-node Docker deployment. For example, Loggly automatically breaks down logs from the Nginx service into individual fields.

Best Practices for Logging in Docker Swarm 3

We can create Dashboards that show, for example, the number of errors generated on each node, as well as the container activity level on each node.

Best Practices for Logging in Docker Swarm 4

Alerts are useful for detecting changes in the status of a service. If you want to detect a sudden increase in errors, you can easily create a search that scans messages from a specific service for error-level logs.

Best Practices for Logging in Docker Swarm 5

You can select this search from the Alerts screen and specify a threshold. For example, this alert triggers if the Nginx service logs more than 10 errors over a 5-minute period.

Best Practices for Logging in Docker Swarm 6

Conclusion

While Swarm can add a layer of complexity over a typical Docker installation, logging it doesn’t have to be difficult. Tools like Logspout and Docker logging drivers have made it easier to collect and manage container logs no matter where those containers are running. And with Loggly, you can easily deploy a complete, cluster-wide logging solution across your entire environment.

ikongf

Papertrail for Python Logs

Posted by ikongf Employee Apr 27, 2019

When you’re troubleshooting a problem or tracking down a bug in Python, the first place to look for clues related to server issues is in the application log files.

 

Python includes a robust logging module in the standard library, which provides a flexible framework for emitting log messages. This module is widely used by various Python libraries and is an important reference point for most programmers when it comes to logging.

 

The Python logging module provides a way for applications to configure different log handlers and provides a standard way to route log messages to these handlers. As the Python.org documentation notes, there are four basic classes defined by the Python logging module: Loggers, Handlers, Filters, and Formatters. We’ll provide more details on these below.

 

Getting Started with Python Logs

There are a number of important steps to take when setting up your logs. First, you need to ensure logging is enabled in the applications you use. You also need to categorize your logs by name so they are easy to maintain and search. Naming the logs makes it easier to search through large log files, and to use filters to find the information you need.

 

To send log messages in Python, request a logger object. It should have a unique name to help filter and prioritize how your Python application handles various messages. We are also adding a StreamHandler to print the log on our console output. Here’s a simple example:

import logginglogging.basicConfig(handlers=[logging.StreamHandler()])log = logging.getLogger('test')log.error('Hello, world')

This outputs:

ERROR:test:Hello, world

This message consists of three fields. The first, ERROR, is the log level. The second, test, is the logger name. The third field, “Hello, world”, is the free-form log message.

 

Most problems in production are caused by unexpected or unhandled issues. In Python, such problems generate tracebacks where the interpreter tries to include all important information it could gather. This can sometimes make the traceback a bit hard to read, though. Let’s look at an example traceback. We’ll call a function that isn’t defined and examine the error message.

def test():    nofunction()test()

Which outputs:

Traceback (most recent call last):   File '<stdin>', line 1, in <module>   File '<stdin>', line 2, in test NameError: global name 'nofunction' is not defined

This shows the common parts of a Python traceback. The error message is usually at the end of the traceback. It says “nofunction is not defined,” which is what we expected. The traceback also includes the lines of all stack frames that were touched when this error occurred. Here we can see that it occurred in the test function on line two. Stdin means standard input and refers to the console where we typed this function. If we were using a Python source file, we’d see the file name here instead.

 

Configuring Logging

You should configure the logging module to direct messages to go where you want them. For most applications, you will want to add a Formatter and a Handler to the root logger. Formatters let you specify the fields and timestamps in your logs. Handlers let you define where they are sent. To set these up, Python provides a nifty factory function called basicConfig.

import logginglogging.basicConfig(format='%(asctime)s %(message)s',                  handlers=[logging.StreamHandler()])logging.debug('Hello World!')

By default, Python will output uncaught exceptions to your system’s standard error stream. Alternatively, you could add a handler to the excepthook to send any exceptions through the logging module. This gives you the flexibility to provide custom formatters and handlers. For example, here we log our exceptions to a log file using the FileHandler:

import loggingimport syslogger = logging.getLogger('test')fileHandler = logging.FileHandler('errors.log')logger.addHandler(fileHandler)def my_handler(type, value, tb):  logger.exception('Uncaught exception: {0}'.format(str(value)))# Install exception handlersys.excepthook = my_handler# Throw an errornofunction()

Which results in the following log output:

$ cat errors.log Uncaught exception: name 'nofunction' is not defined None

In addition, you can filter logs by configuring the log level. One way to set the log level is through an environment variable, which gives you the ability to customize the log level in the development or production environment. Here’s how you can use the LOGLEVEL environment variable:

$ export LOGLEVEL='ERROR' $ python >>> import logging >>> logging.basicConfig(handlers=[logging.StreamHandler()]) >>> logging.debug('Hello World!') #prints nothing >>> logging.error('Hello World!') ERROR:root:Hello World!

Logging from Modules

Modules intended for use by other programs should only emit log messages. These modules should not configure how log messages are handled. A standard logging best practice is to let the Python application importing and using the modules handle the configuration.

Another standard best practice to follow is that each module should use a logger named like the module itself. This naming convention makes it easy for the application to distinctly route various modules and helps keep the log code in the module simple.

You need just two lines of code to set up logging using the named logger. Once you do this in Python, the “ name” contains the full name of the current module, and will work in any module. Here’s an example:

import logginglog = logging.getLogger(__name__)def do_something():    log.debug('Doing something!')

Analyzing Your Logs with Papertrail

Python applications on a production server contain millions of lines of log entries. Command line tools like tail and grep are often useful during the development process. However, they may not scale well when analyzing millions of log events spread across multiple servers.

 

Centralized logging can make it easier and faster for developers to manage a large volume of logs. By consolidating log files onto one integrated platform, you can eliminate the need to search for related data that is split across multiple apps, directories, and servers. Also, a log management tool can alert you to critical issues, helping you more quickly identify the root cause of unexpected errors, as well as bugs that may have been missed earlier in the development cycle.

 

For production-scale logging, a log management tool such as Solarwinds® Papertrail™ can help you better manage your data. Papertrail is a cloud-based platform designed to handle logs from any Python application, including Django servers.

 

The Papertrail solution provides a central repository for event logs. It helps you consolidate all of your Python logs using syslog, along with other application and database logs, giving you easy access all in one location. It offers a convenient search interface to find relevant logs. It can also stream logs to your browser in real time, offering a “live tail” experience. Check out the tour of the Papertrail solution’s features.

2018-09-07-viewer

Papertrail is designed to help minimize downtime. You can receive alerts via email, or send them to Slack, Librato, PagerDuty, or any custom HTTP webhooks of your choice. Alerts are also accessible from a web page that enables customized filtering. For example, you can filter by name or tag.

2018-09-07-edit-alert

 

Configuring Papertrail in Your Application

There are many ways to send logs to Papertrail depending on the needs of your application. You can send logs through journald, log files, Django, Heroku, and more. We will review the syslog handler below.

 

Python can send log messages directly to Papertrail with the Python SysLogHandler. Just set the endpoint to the log destination shown in your Papertrail settings. You can optionally format the timestamp or set the log level as shown below.

import loggingimport socketfrom logging.handlers import SysLogHandlersyslog = SysLogHandler(address=('logsN.papertrailapp.com', XXXXX))format = '%(asctime)s YOUR_APP: %(message)s'formatter = logging.Formatter(format, datefmt='%b %d %H:%M:%S')syslog.setFormatter(formatter)logger = logging.getLogger()logger.addHandler(syslog)logger.setLevel(logging.INFO)def my_handler(type, value, tb):  logger.exception('Uncaught exception: {0}'.format(str(value)))# Install exception handlersys.excepthook = my_handlerlogger.info('This is a message')nofunction() #log an uncaught exception

Conclusion

Python offers a well-thought-out framework for logging that makes it simple to enable and manage your log files. Getting started is easy, and a number of tools baked into Python automate the logging process and help ensure ease of use.

Papertrail adds even more functionality and tools for diagnostics and analysis, enabling you to manage your Python logs on a centralized cloud server. Quick to setup and easy to use, Papertrail consolidates your logs on a safe and accessible platform. It simplifies your ability to search log files, analyze them, and then act on them in real time—so that you can focus on debugging and optimizing your applications.

Learn more about how Papertrail can help optimize your development ecosystem.

Have you ever wondered what happens when you type an address into your browser? The first step is the translation of a domain name (such as pingdom.com) to an IP address. Resolving domain names is done through a series of systems and protocols that make up the Domain Name System (DNS). Here we’ll break down what DNS is, and how it powers the underlying infrastructure of the internet.

 

What is DNS?

Traffic across the internet is routed by an identifier called an IP address. You may have seen IP addresses before. IPv4 addresses are a series of four numbers under 256, separated by periods (for example: 123.45.67.89).

 

IP addresses are at the core of communicating between devices on the internet, but they are hard to memorize and can change often, even for the same service. To get around these problems, we give names to IP addresses. For example, when you type https://www.pingdom.com into your web browser, it translates that name into an IP address, which your computer then uses to access a server that ultimately responds with the contents of the page that your browser displays. If a new server is put into place with a new IP address, that name can simply be updated to point to the new address.

 

These records are stored in the name server for a given name, or “zone,” in DNS parlance. These zones can include many different records and record types for the base name and subdomains in that zone.

 

The internet is decentralized, designed to withstand failure, and not rely on a single source of truth. DNS is built for this environment using recursion, which enables DNS servers to talk to each other to find the answer for a request. Each server is more authoritative than the last, until it reaches one of 13 “root” servers that are globally maintained as the definitive source for other DNS servers.

 

Anatomy of a DNS Request

When you type in “pingdom.com” to your browser and hit enter, your browser doesn’t directly ask the web servers for that page. First, a multi-step interaction with DNS servers must happen to translate pingdom.com into an IP address that is useable for establishing a connection and routing traffic. Here’s what that interaction looks like:

  1. Recursive DNS server requests abc.com from a DNS root server. The root server replies with the .com TLD name server IP address.
  2. Recursive DNS server requests abc.com from the .com TLD name server. The TLD name server replies with the authoritative name server for abc.com.
  3. Recursive DNS server requests abc.com from the abc.com nameserver. The nameserver replies with the IP address A record for abc.com. This IP address is returned to the client.
  4. Client requests abc.com using the web server’s IP address that was just resolved.

 

In subsequent requests, the recursive name server will have the IP address for pingdom.com.

This IP address is cached for a period of time determined by the pingdom.com nameserver. This value is called the time-to-live (TTL) for that domain record. A high TTL for a domain record means that local DNS resolvers will cache responses for longer and give quicker responses. However, making changes to DNS records can take longer due to the need to wait for all cached records to expire. Conversely, domain records with low TTLs can change much more quickly, but DNS resolvers will need to refresh their records more often.

 

Not Just for the Web

The DNS protocol is for anything that requires a decentralized name, not just the web. To differentiate between various types of servers registered with a nameserver, we use record types. For example, email servers are part of DNS. If a domain name has an MX record, it is signaling that the address associated with that record is an email server.

 

Some of the more common record types you will see are:

  • A Record – used to point names directly at IPv4 addresses. This is used by web browsers.
  • AAAA Record – used to point names directly at IPV6 addresses. This is used by web browsers when a device has an IPv6 network.
  • CNAME Record – also known as the Canonical Name record and is used to point web domains at other DNS names. This is common when using platforms as a service such as Heroku or cloud load balancers that provide an external domain name rather than an IP address.
  • MX Record – as mentioned before, MX records are used to point a domain to mail servers.
  • TXT Record – arbitrary information attached to a domain name. This can be used to attach validation or other information about a domain name as part of the DNS system. Each domain or subdomain can have one record per type, with the exception of TXT records.

 

DNS Security and Privacy

There are many parts to resolving a DNS request, and these parts are subject to security and privacy issues. First, how do we verify that the IP address we requested is actually the one on file with the domain’s root nameserver? Attacks exist that can disrupt the DNS chain, providing false information back to the client or triggering denial of service attacks upon sites. Untrusted network environments are vulnerable to man-in-the-middle attacks that can hijack DNS requests and provide back false results.

 

There is ongoing work to enhance the security of DNS with the Domain Name System Security Extensions (DNSSEC). This is a combination of new records, public-key cryptography, and establishing a chain of trust with DNS providers to ensure domain records have not been tampered with. Some DNS providers today offer the ability to enable DNSSEC, and its adoption is growing as DNS-based attacks become more prevalent.

 

DNS requests are also typically unencrypted, which allows attackers and observers to pry into the contents of a DNS request. This information is valuable, and your ISP or recursive zone provider may be providing this information to third parties or using it to track your activity. Furthermore, it may or may not contain personally identifiable information like your IP address, which can be correlated with other tracking information that third parties may be holding.

 

There are a few ways to help protect your privacy with DNS and prevent this sort of tracking:

 

1. Use a Trusted Recursive Resolver

Using a trusted recursive resolver is the first step to ensuring the privacy of your DNS requests. For example, the Cloudflare DNS service https://1.1.1.1is a fast, privacy-centric DNS resolver. Cloudflare doesn’t log IP addresses or track requests that you make against it at any time.

 

2. Use DNS over HTTPS (DoH)

DoH is another way of enhancing your privacy and security when interacting with DNS resolvers. Even when using a trusted recursive resolver, man-in-the-middle attacks can alter the returned contents back to the requesting client. DNSSEC offers a way to fix this, but adoption is still early, and relies on DNS providers to enable this feature.

 

DoH secures this at the client to DNS resolver level, enabling secure communication between the client and the resolver. The Cloudflare DNS service offers DNS over HTTPS, further enhancing the security model that their recursive resolver provides. Keep in mind that the domain you’re browsing is still available to ISPs thanks to Server Name Indication, but the actual contents, path, and other parts of the request are encrypted.

 

Even without DNSSEC, you can still have a more private internet experience. Firefox recently switched over to using the Cloudflare DNS resolver for all requests by default. At this time, DoH isn’t enabled by default unless you are using the nightly build.

 

Monitoring DNS Problems

DNS is an important part of your site’s availability because a problem can cause a complete outage. DNS has been known to cause outages due to BGP attacks, TLD outages, and other unexpected issues. It’s important your uptime or health check script includes DNS lookups.

Using SolarWinds® Pingdom®, we can monitor for DNS problems using the uptime monitoring tool. Here we will change the DNS record for a domain and show you how the Pingdom tool responds. Once you have an uptime check added in Pingdom, click the “Reports” section, and “Uptime” under that section, then go to your domain of interest. Under the “Test Result Log” tab for an individual domain’s uptime report, hover over the failing entry to see why a check failed.

This tells us that for our domain, we have a “Non-recoverable failure in name resolution.” This lets us know to check our DNS records. After we fix the problem, our next check succeeds:

Pingdom gives us a second set of eyes to make sure our site is still up as expected.

 

Curious to learn more about DNS? Check out our post on how to test your DNS-configuration. You can also learn more about Pingdom uptime monitoring.

For an infrastructure to be healthy, there must be good monitoring. The team should have a monitoring infrastructure that speeds up and facilitates the verification of problems, following the line of prevention, maintenance, and correction. SolarWinds® AppOptics™ was created with the purpose of helping monitoring teams control infrastructure, including Linux monitoring.

 

Monitoring overview

It is critical that a technology team prepare for any situation that occurs in their environment. The purpose of monitoring is to be aware of changes in the environment so that problems can be solved with immediate action. Good monitoring history and proper perception can allow you to suggest environmental improvements according to the charts. If you have a server that displays memory usage for a certain amount of time, you can purchase more memory, or investigate the cause of the abnormal behavior before the environment becomes unavailable.

 

Monitoring indexes can be used for various purposes, such as application availability for a given number of users, tool deployment tracking, operating system update behavior, purchase requests, and exchanges or hardware upgrades. Each point of use depends on your deployment purpose.

 

Linux servers historically have operating systems that are difficult to monitor because most of the tools in the market serve other platforms. In addition, a portion of IT professionals cannot make monitoring work properly on these servers, so when a disaster occurs, it is difficult to identify what happened.

 

Constant monitoring of servers and services used in production is critical for company environments. Server failures in virtualization, backup, firewalls, and proxies can directly affect availability and quality of service.

 

The Linux operating system offers a basic monitoring system for more experienced administrators, but when it comes to monitoring, real-time reports are needed for immediate action. You cannot count on an experienced system administrator being available to access the servers, or that they can perform all existing monitoring capabilities.

 

In the current job market, it is important to remember that Linux specialists are rare, and their availability is limited. There are cases where an expert administrator can only act on a server when the problem has been long-standing. Training for teams to become Linux experts can be

expensive and time-consuming, with potentially low returns.

 

Metrics used for monitoring

  1. CPU – It is crucial to monitor CPU, as it can reach a high utilization rate and temperature. It can have multiple cores, but an application can be directed to only one of these cores, pointing to a dangerous hardware behavior.

  2. Load – This specifies whether the CPU is being used, how much is being executed, and how long it has been running.

  3. Disk Capacity and IO – Disk capacity is especially important when it comes to image servers, files, and VMs, as it can directly affect system shutdown, corrupt the operating system, or cause extreme IO slowness. Along with disk monitoring, it’s possible to plan for an eventual change or addition of a disk, and to verify the behavior of a disk that demonstrates signs of hardware failure.

  4. Network – When it comes to DNS, DHCP, firewall, file server, and proxy, it is extremely important to monitor network performance as input and output of data packets. With network performance logs, you can measure the utilization of the card, and create a plan to suit the application according to the use of the network.

  5. Memory – Memory monitoring in other components determines the immediate stop of a system due to memory overflow or misdirection for a single application.

  6. Swap – This is virtual memory created by the system and allocated to disk to be used when necessary. Its high utilization can indicate that the amount of memory for the server is insufficient.

With this information from Linux systems, you can have good monitoring and a team that can act immediately on downtime that can paralyze critical systems.

 

 

Monitoring with AppOptics

AppOptics is a real-time web monitoring tool that enables you to set up a real-time monitoring environment, create alerts by e-mail, and focus on threshold and monitoring history. You can also create monitoring levels with profiles of equipment to be monitored, and have simple monitoring viewers that can trigger a specialist or open a call for immediate action when needed.

 

This tool can also be an ally of an ITIL/COBIT team, which can use the reports to justify scheduled and unscheduled stops, and clarify systems that historically have problems. It can also be used to justify the purchase of new equipment, software upgrades, or the migration of a system that no longer meets the needs of a company.

 

AppOptics can be installed in major Linux distributions such as Red Hat, CentOS, Ubuntu, Debian, Fedora, and Amazon Linux. Its deployment is easy, fast, and practical.

 

 

Installing the AppOptics Agent on the Server

Before you start, you’ll need an account with AppOptics. If you don’t already have one, you can create a demo account which will give you 14 days to try the service, free of charge. Sign up here.

 

First, to allow AppOptics to aggregate the metrics from the server, you will need to install the agent on all instances. To do this, you’ll need to reference your AppOptics API token when setting up the agent. Log in to your AppOptics account and navigate to the Infrastructure page.

 

Locate the Add Host button, and click on it. It should look similar to the image below.

Fig. 2. AppOptics Host Agent Installation

 

You can follow a step-by-step guide on the Integration page, where there are Easy Install and Advanced options for users. I used an Ubuntu image in the AWS Cloud, but this will work on almost any Linux server.

 

Note: Prior to installation of the agent, the bottom of the dialog below will not contain the success message.

 

Copy the command from the first box, and then SSH into the server and run the Easy Install script.

 

Fig. 3. Easy Install Script to Add AppOptics Agent to a Server

 

When the agent installs successfully, you should be presented with the following message on your terminal. The “Confirm successful installation” box on the AppOptics agent screen should look similar to the above, with a white on blue checkbox. You should also see “Agent connected.”

 

Fig. 4. Installing the AppOptics Agent on your Linux Instance

 

After installation, you can start configuring the dashboard for monitoring on the server. Click on the hostname link in the Infrastructure page, or navigate to the Dashboards page directly, and then select the Host Agent link to view the default dashboard provided by AppOptics.

 

Working with the Host Agent Dashboard

The default Host Agent Dashboard provided by AppOptics offers many of the metrics discussed earlier, related to the performance of the instance itself, and should look similar to the image below.

 

Fig. 6. Default Host Agent Dashboard

 

One common pattern is to create dashboards for each location you want to monitor. Let’s use “Datacenter01” for our example. Head to Dashboards and click the Create a New Dashboard button.

 

You can choose the type of monitoring display (Line, Stacked, and Big Number). Then you can choose what you want to monitor as CPU Percent, Swap, or Load. In addition, within the dashboard, you can select how long you want to monitor a group of equipment or set it to be monitored indefinitely.

 

Fig. 8. Custom Dashboard to View Linux System Metrics

 

Metrics – You can select existing metrics to create new composite metrics according to what you want to be monitored in the operating system.

 

Alerts – Alerts are created for the operating system, including time settings for issuing a new alert and the conditions for issuing alerts.

 

Integrations – You can add host agent plug-ins for support for application monitoring.

 

 

Conclusion

Monitoring your Linux servers is critical as they represent the basis of your infrastructure. You need to know immediately when there is a sudden change in CPU or memory usage that could affect the performance of your applications. AppOptics has a range of ready-made tools, customizable monitoring panels, and reports that are critical for investigating chronic infrastructure problems. Learn more about AppOptics infrastructure monitoring and try it today with a free 14-day trial.

The world may run on coffee, but it’s the alarm clock that gets us out of bed. It operates on a simple threshold. You set the time that’s important to you and receive an alert when that variable is true.

 

Like your alarm clock, today’s tooling for web service alerting often operates on simple thresholds, but unlike with your clock, there is a wide variety of metrics and it’s not as clear which should trigger an alert. Until we have something better than thresholds, engineers have to carefully weigh which metrics are actionable, how they are being measured, and what thresholds correspond to real-world problems.

 

Measure the Thing You Care About

 

In practice, this arguably simple process of reasoning about what you are monitoring, and how you are monitoring it, is rarely undertaken. More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency. This practice of letting our tools guide our telemetry content is an anti-pattern which results in unreliable problem detection and alerting.

 

Effective alerting requires metrics that are reliably actionable. You have to start by reasoning about the application and/or infrastructure you want to monitor. Only then can you choose and implement collection tools that get you the metrics you’re actually interested in, like queue size, DB roundtrip times, and inter-service latency.

 

One Reliable Signal

 

Effective alerting requires a singular, reliable telemetry signal, to which every collector can contribute. Developing and ensuring a reliable signal can be difficult, but the orders of magnitude are simpler than building out multiple disparate monitoring systems and trying to make them agree with each other – in the way many shops, for example, alert from one system like Nagios, and troubleshoot from another like Ganglia.

 

It’s arguably impossible to make multiple, fallible systems agree with each other in every case. They may usually agree, but every false positive or false negative undermines the credibility of both systems. Further, multiple systems rarely improve because it’s usually impossible to know which system was at fault when they disagree. Did the alerting system send a bogus alert or is there a problem with the data in the visualization system? If false positives arise from a single telemetry system, you simply iterate and improve that system.

 

Alert Recipient == Alert Creator

 

Crafting effective alerts involves knowing how your systems work. Each alert should trigger in the mind of its recipient an actionable cognitive model that describes how the production environment is being threatened. How does the individual piece of infrastructure that fired this alert affect the application? Why is this alert a problem?

 

Only engineers who understand the systems and applications we care about have the requisite knowledge to craft alerts that describe actionable threats to those systems and applications. Therefore effective alerting requires that the recipients of alerts be able to craft those alerts.

 

Push Notifications as Last Resort

 

Emergencies force context switches. They interrupt workflow and destroy productivity. Many alerts are necessary, but very few of them should be considered emergencies. At AppOptics, the preponderance of our alerts are delivered to group chat. We find that this is a timely notification medium which doesn’t interrupt productivity. Further, group chat allows everyone to react to the alert together, in a group context, rather than individually from an email-box or pager. This helps us avoid redundant troubleshooting effort, and keeps everyone synchronized on problem resolution.

 

Effective alerting requires an escalation system that can communicate problems in a way that is not interrupt-driven. There are myriad examples in other industries like healthcare and security systems, where, when every alert is interrupt-driven, human beings quickly begin to ignore the alerts. Push notifications should be a last resort.

 

Alerting is Hard

 

Effective alerting is a deceptively hard problem, which represents one of the biggest challenges facing modern operations engineers. A careful balance needs to be struck between the needs of the systems and the needs of the humans tending those systems.

How do PHP logging frameworks fare when pushed to their limits? This analysis can help us decide which option is best for our PHP applications. Performance, speed, and reliability are important for logging frameworks because we want the best performance out of our application and to minimize loss of data.

Our goals for the fastest PHP framework benchmark tests are to measure the time different frameworks require to process a large number of log messages, considering various logging handlers, as well as which logging frameworks are more reliable at their limits (dropping none or less messages).

The frameworks we tried are:

  • native PHP logging (error_log and syslog built-in functions)
  • KLogger
  • Apache Log4php
  • Monolog

All of these frameworks use synchronous or “blocking” calls, as PHP functions typically do. The web server execution waits until the function/method call is finished in order to continue. As for the handlers: error_log, KLogger, Log4php, and Monolog can write log messages to text file, while error_log/syslog, Log4php, and Monolog are able to send the messages to the local system logger. Finally, only Log4php and Monolog allow remote Syslog connections.

NOTE: The term syslog can refer to various things. In this article, this includes the PHP function of the same name, the system logger daemon (e.g. syslogd), or a remote syslog server (i.e. rsyslog).

Application and Handlers

For this framework benchmark, we built a PHP CodeIgniter 3 web app with a controller for each logging mechanism. Controller methods echo the microtime difference before and after logging, which is useful for manual tests. Each controller method call has a loop that writes 10,000 INFO log messages in the case of file handlers (except error_log which can only produce E_ERROR), or 100,000 INFO messages to syslog. This helps us stress the logging system while not over-burdening the web server request handler.

NOTE: You may see the full app source code at https://github.com/jorgeorpinel/php-logging-benchmark

 

For the local handlers, first we tested writing to local files and kept track of the number of written logs in each test. We then tested the local system logger handler (which uses the /dev/log UNIX socket by default) and counted the number of logs syslogd wrote to /var/log/syslog.

As for the “remote” syslog server, we set up rsyslog on the system and configured it to accept both TCP and UDP logs, writing them to /var/log/messages. We recorded the number of logs there to determine whether any of them were dropped.

Benchmarking PHP Logging Frameworks A
Fig. 1 System Architecture – Each arrow represents a benchmark test.

Methodology

We ran the application locally on Ubuntu with Apache (and mod-php). First, each Controller/method was “warmed up” by requesting that URL with curl, which ensures the PHP source is already precompiled when we run the actual framework benchmark tests. Then we used ApacheBench to stress test the local web app with 100 or 10 serial requests (file or syslog, respectively). For example:

ab -ln 100 localhost:8080/Native/error_log

ab -ln 10 localhost:8080/Monolog/syslog_udp

The total number of log calls in each test was 1,000,000 (each method). We gathered performance statistics from the tool’s report for each Controller/method (refer to figure 1).

Please note in normal operations the actual drop rates should be much smaller, if any.

Hardware and OS

We ran both the sample app and the tests on AWS EC2 micro instance. It’s set up as a 64-bit Ubuntu 16.04 Linux box with an Intel(R) Xeon(R) CPU @ 2.40GHz processors and 1GiB of memory, and an 8 GB storage SSD.

Native tests

The “native” controller uses a couple of PHP built-in error handling functions. It has two methods: one that calls error_log, which is configured in php.ini to write to a file, and one that calls syslog to reach the system logger. Both functions are used with their default parameters.

error_log to file

By definition, no log messages can be lost by this method as long as the web server doesn’t fail. Its performance when writing to file will depend on the underlying file system and storage speed. Our test results:

error_log (native PHP file logger)
Requests per sec23.55 [#/sec] (mean)
Time per request42.459 [ms] (mean)
↑ Divide by 10,000 logs written per request.
NOTE: error_log can also be used to send messages to system log, among other message types.

syslog

Using error_log when error_log = syslog in php.ini, or simply using the syslog function, we can reach the system logger. This is similar to using the logger command in Linux.

syslog (native PHP system logger)
Requests per sec0.25 [#/sec] (mean)
Time per request4032.164 [ms] (mean)  ← ÷ 100,000 logs sent per request

This is typically the fastest logger, and syslogd is as robust as the web server or more, so no messages should be dropped (none were in our tests). Another advantage of the system logger is that it can be configured to write to a file and to forward logs via network.

KLogger test

KLogger is a “simple logging class for PHP” with its first stable release in 2104. It’s only able to write logs to file. Its simplicity helps its performance, however. KLogger is PSR-3 compliant: It implements the LoggerInterface.

K2Logger (simple PHP logging class)
Requests per sec14.11 [#/sec] (mean)
Time per request70.848 [ms] (mean)  ← Divide by 10,000 = 0.0070848 ms / msg
NOTE: This GitHub fork of KLogger allows local syslog usage as well. We did not try it.

Log4php tests

Log4php, first released in 2010, is one in the suite of loggers that Apache provides for several popular programming languages. Logging to file, it turns out to be a speedy contender, at least on Apache. Running the application on Apache probably helps the performance of Log4php. In local tests using PHP’s built-in server (php -S command), it was actually the slowest contender!

Log4php (Apache PHP file logger)
Requests per sec18.70 [#/sec] (mean) * 10k = 187k msg per sec
Time per request53.470 [ms] (mean) / 10k = .0053 ms / msg

As for sending to syslog, it was actually our least performant option, but not by far:

Log4php to syslog
Local syslog socketSyslog over TCP/IPSyslog over UDP/IP
0.08 ms per logAround 24 ms per log0.07 ms per log
0% dropped0% dropped0.15% dropped

Some of the advantages Log4php has, which may offset its lack of performance, are Java-like XML configuration files (same as other Apache loggers, such as the popular log4j), six logging destinations, and three message formats.

NOTE: Remote syslog over TCP however, doesn’t seem to be well supported at this time. We had to use the general-purpose LoggerAppenderSocket, which was really slow, so we only ran 100,000.

Monolog tests

Monolog, like KLogger, is a PSR-3; and, like Log4php, a full logging framework that can send logs to files, sockets, email, databases, and various web services. It was first released in 2011.

Monolog features many integrations with popular PHP frameworks, making it a popular alternative. Monolog beat its competitor Log4php in our tests, but is still not the fastest PHP framework nor most reliable of options, although probably one of the easiest for web developers.

Monolog (full PHP logging framework)
Requests per sec4.93 [#/sec] (mean) x 10k
Time per request202.742 [ms] (mean) / 10k

Monolog over Syslog:

Monolog over syslog
UNIX socketTCPUDP
0.062 ms per log0.06 ms per log0.079 ms per log
Less than 0.01% dropped0.29% dropped0% dropped

Now let’s take a look at graphs that summarize and compare all the results above. These charts show the tradeoff between using faster native or basic logging methods, more limited and lower level in nature vs. relatively less performant but full-featured frameworks:

Local File Performance Comparison

Benchmarking PHP Logging Frameworks 2
Fig 2. Time per message written to file [ms/msg]

Local Syslog Performance and Drop Rates

Log handler or “appender” names vary from framework to framework. For native PHP, we just use the syslog function (Klogger doesn’t support this); in Log4php, it’s a class called LoggerAppenderSyslog; and it’s called SyslogHandler in Monolog.

Benchmarking PHP Logging Frameworks 3
Fig 3. Time per message sent to syslogd via socket [ms/msg]

Benchmarking PHP Logging Frameworks 4
Fig 4. Drop rates to syslogd via socket [%]

 

Remote Syslog Performance and Drop Rates

The appenders are LoggerAppenderSocket in Log4php, SocketHandler and SyslogUdpHandler for Monolog.

To measure the drop rates, we leveraged the $RepeatedMsgReduction config param of rsyslog, which collapses identical messages into a single one and a second message with the count of further repetitions. In the case of Log4php, since the default message includes a timestamp that varies in every single log, we forwarded the logs to SolarWinds® Loggly® (syslog setup in seconds) and used a filtered, interactive log monitoring dashboard to count the total logs received.

TCP

Benchmarking PHP Logging Frameworks 6
Fig 5. Time per message sent via TCP to rsyslog

Benchmarking PHP Logging Frameworks 5
Fig 6. Drop rates to rsyslog (TCP) [%]

UDP

Benchmarking PHP Logging Frameworks 8
Fig 7. Time per message sent on UDP to rsyslog
Benchmarking PHP Logging Frameworks 7
Fig 8. Drop rates to rsyslog (UDP)

Conclusion

Each logging framework is different, and while each could be best fit to specific projects, our recommendations are as follows. Nothing beats the performance of native syslog for system admins who know their way around syslogd or syslog-ng daemons, or to forward logs to a cloud service such as Loggly. If what’s needed is a simple, yet powerful way to log locally to files, KLogger offers PSR-3 compliance and is almost as fast as native error_log, although Log4php does seem to edge it out when the app is running on Apache. For a more complete framework, Monolog seems to be the more well-rounded option, particularly when considering remote logging via TCP/IP.

 

After deciding on a logging framework, your next big decision is choosing a log management solution. Loggly provides unified log analysis and monitoring for all your servers in a single place. You can configure your PHP servers to forward syslog to Loggly or simply use Monolog’s LogglyHandler, which is easy to set up in your app’s code. Try Loggly for free and take control over your PHP application logs.

What are some common problems that can be detected with the handy router logs on Heroku? We’ll explore them and show you how to address them easily and quickly with monitoring of Heroku from SolarWinds Papertrail.

 

One of the first cloud platforms, Heroku is a popular platform as a service (PaaS) that has been in development since June 2007. It allows developers and DevOps specialists to easily deploy, run, manage, and scale applications written in Ruby, Node.js, Java, Python, Clojure, Scala, Go, and PHP.

 

To learn more about Heroku, head to the Heroku Architecture documentation.

 

Intro to Heroku Logs

Logging in Heroku is modular, similar to gathering system performance metrics. Logs are time-stamped events that can come from any of the processes running in all application containers (Dynos), system components, or backing services. Log streams are aggregated and fed into the Logplex—a high-performance, real-time system for log delivery into a single channel.

 

Run-time activity, as well as dyno restarts and relocations, can be seen in the application logs. This will include logs generated from within application code deployed on Heroku, services like the web server or the database, and the app’s libraries. Scaling, load, and memory usage metrics, among other structural events, can be monitored with system logs. Syslogs collect messages about actions taken by the Heroku platform infrastructure on behalf of your app. These are two of the most recurrent types of logs available on Heroku.

 

To fetch logs from the command line, we can use the heroku logs command. More details on this command, such as output format, filtering, or ordering logs, can be found in the Logging article of Heroku Devcenter.

$ heroku logs 2019-09-16T15:13:46.677020+00:00 app[web.1]: Processing PostController#list (for 208.39.138.12 at 2010-09-16 15:13:46) [GET] 2018-09-16T15:13:46.677902+00:00 app[web.1]: Rendering post/list 2018-09-16T15:13:46.698234+00:00 app[web.1]: Completed in 74ms (View: 31, DB: 40) | 200 OK [http://myapp.heroku.com/] 2018-09-16T15:13:46.723498+00:00 heroku[router]: at=info method=GET path='/posts' host=myapp.herokuapp.com' fwd='204.204.204.204' dyno=web.1 connect=1ms service=18ms status=200 bytes=975   # © 2018 Salesforce.com. All rights reserved.

Heroku Router Logs

Router logs are a special case of logs that exist somewhere between the app logs and the system logs—and are not fully documented on the Heroku website at the time of writing. They carry information about HTTP routing within Heroku Common Runtime, which manages dynos isolated in a single multi-tenant network. Dynos in this network can only receive connections from the routing layer. These routes are the entry and exit points of all web apps or services running on Heroku dynos.

 

Tail router only logs with the heroku logs -tp router CLI command.

$ heroku logs -tp router 2018-08-09T06:24:04.621068+00:00 heroku[router]: at=info method=GET path='/db' host=quiet-caverns-75347.herokuapp.com request_id=661528e0-621c-4b3e-8eef-74ca7b6c1713 fwd='104.163.156.140' dyno=web.1 connect=0ms service=17ms status=301 bytes=462 protocol=https 2018-08-09T06:24:04.902528+00:00 heroku[router]: at=info method=GET path='/db/' host=quiet-caverns-75347.herokuapp.com request_id=298914ca-d274-499b-98ed-e5db229899a8 fwd='104.163.156.140' dyno=web.1 connect=1ms service=211ms status=200 bytes=3196 protocol=https 2018-08-09T06:24:05.002308+00:00 heroku[router]: at=info method=GET path='/stylesheets/main.css' host=quiet-caverns-75347.herokuapp.com request_id=43fac3bb-12ea-4dee-b0b0-2344b58f00cf fwd='104.163.156.140' dyno=web.1 connect=0ms service=3ms status=304 bytes=128 protocol=https 2018-08-09T08:37:32.444929+00:00 heroku[router]: at=info method=GET path='/' host=quiet-caverns-75347.herokuapp.com request_id=2bd88856-8448-46eb-a5a8-cb42d73f53e4 fwd='104.163.156.140' dyno=web.1 connect=0ms service=127ms status=200 bytes=7010 protocol=https   # Fig 1. Heroku router logs in the terminal

Heroku routing logs always start with a timestamp and the “heroku[router]” source/component string, and then a specially formatted message. This message begins with either “at=info”, “at=warning”, or “at=error” (log levels), and can contain up to 14 other detailed fields such as:

  • Heroku error “code” (Optional) – For all errors and warning, and some info messages; Heroku-specific error codes that complement the HTTP status codes.
  • Error “desc” (Optional) – Description of the error, paired to the codes above
  • HTTP request “method” e.g. GET or POST – May be related to some issues
  • HTTP request “path” – URL location for the request; useful for knowing where to check on the application code
  • HTTP request “host” – Host header value
  • The Heroku HTTP Request ID – Can be used to correlate router logs to application logs;
  • HTTP request “fwd” – X-Forwarded-For header value;
  • Which “dyno” serviced the request – Useful for troubleshooting specific containers
  • “Connect” time (ms) spent establishing a connection to the web server(s)
  • “Service” time (ms) spent proxying data between the client and the web server(s)
  • HTTP response code or “status” – Quite informative in case of issues;
  • Number of “bytes” transferred in total for this web request;

 

Common Problems Observed with Router Logs

Examples are manually color-coded in this article. Typical ways to address the issues shown above are also provided for context.

 

Common HTTP Status Codes

404 Not Found Error

Problem: Error accessing nonexistent paths (regardless of HTTP method):

2018-07-30T17:10:18.998146+00:00 heroku[router]: at=info method=POST path='/saycow' host=heroku-app-log.herokuapp.com request_id=e5634f81-ec54-4a30-9767-bc22365a2610 fwd='187.220.208.152' dyno=web.1 connect=0ms service=15ms status=404 bytes=32757 protocol=https 2018-07-27T22:09:14.229118+00:00 heroku[router]: at=info method=GET path='/irobots.txt' host=heroku-app-log.herokuapp.com request_id=7a32a28b-a304-4ae3-9b1b-60ff28ac5547 fwd='187.220.208.152' dyno=web.1 connect=0ms service=31ms status=404 bytes=32769 protocol=https

Solution: Implement or change those URL paths in the application or add the missing files.

500 Server Error

Problem: There’s a bug in the application:

2018-07-31T16:56:25.885628+00:00 heroku[router]: at=info method=GET path='/' host=heroku-app-log.herokuapp.com request_id=9fb92021-6c91-4b14-9175-873bead194d9 fwd='187.220.247.218' dyno=web.1 connect=0ms service=3ms status=500 bytes=169 protocol=https

Solution: The application logs have to be examined to determine the cause of the internal error in the application’s code. Note that HTTP Request IDs can be used to correlate router logs against the web dyno logs for that same request.

Common Heroku Error Codes

 

Other problems commonly detected by router logs can be explored in the Heroku Error Codes. Unlike HTTP codes, these error codes are not standard and only exist in the Heroku platform. They give more specific information on what may be producing HTTP errors.

H14 – No web dynos running

Problem: App has no web dynos setup:

2018-07-30T18:34:46.027673+00:00 heroku[router]: at=error code=H14 desc='No web processes running' method=GET path='/' host=heroku-app-log.herokuapp.com request_id=b8aae23b-ff8b-40db-b2be-03464a59cf6a fwd='187.220.208.152' dyno= connect= service= status=503 bytes= protocol=https

Notice that the above case is an actual error message, which includes both Heroku error code H14 and a description. HTTP 503 means “service currently unavailable.”

Note that Heroku router error pages can be customized. These apply only to errors where the app doesn’t respond to a request e.g. 503.

Solution: Use the heroku ps:scale command to start the app’s web server(s).

 

H12 – Request timeout

Problem: There’s a request timeout (app takes more than 30 seconds to respond):

2018-08-18T07:11:15.487676+00:00 heroku[router]: at=error code=H12 desc='Request timeout' method=GET path='/sleep-30' host=quiet-caverns-75347.herokuapp.com request_id=1a301132-a876-42d4-b6c4-a71f4fe02d05 fwd='189.203.188.236' dyno=web.1 connect=1ms service=30001ms status=503 bytes=0 protocol=https

Error code H12 indicates the app took over 30 seconds to respond to the Heroku router.

Solution: Code that requires more than 30 seconds must run asynchronously (e.g., as a background job) in Heroku. For more info read Request Timeout in the Heroku DevCenter.

H18 – Server Request Interrupted

Problem: The Application encountered too many requests (server overload):

2018-07-31T18:52:54.071892+00:00 heroku[router]: sock=backend at=error code=H18 desc='Server Request Interrupted' method=GET path='/' host=heroku-app-log.herokuapp.com request_id=3a38b360-b9e6-4df4-a764-ef7a2ea59420 fwd='187.220.247.218' dyno=web.1 connect=0ms service=3090ms status=503 bytes= protocol=https

Solution: This problem may indicate that the application needs to be scaled up, or the app performance improved.

H80 – Maintenance mode

Problem: Maintenance mode generates an info router log with error code H18:

2018-07-30T19:07:09.539996+00:00 heroku[router]: at=info code=H80 desc='Maintenance mode' method=GET path='/' host=heroku-app-log.herokuapp.com request_id=1b126dca-1192-4e98-a70f-78317f0d6ad0 fwd='187.220.208.152' dyno= connect= service= status=503 bytes= protocol=https

Solution: Disable maintenance mode with heroku maintenance:off

 

Papertrail

Papertrail™ is a cloud log management service designed to aggregate Heroku app logs, text log files, and syslogs, among many others, in one place. It helps you to monitor, tail, and search logs via a web browser, command-line, or an API. The Papertrail software analyzes log messages to detect trends, and allows you to react instantly with automated alerts.

 

The Event Viewer is a live aggregated log tail with auto-scroll, pause, search, and other unique features. Everything in log messages is searchable, and new logs still stream in real time in the event viewer when searched (or otherwise filtered). Note that Papertrail reformats the timestamp and source in its Event Viewer to make it easier to read.

Viewer Live Pause
Fig 2. The Papertrail Event Viewer.

Provisioning Papertrail on your Heroku apps is extremely easy: heroku addons:create papertrail from terminal. (See the Papertrail article in Heroku’s DevCenter for more info.) Once setup, the add-on can be open from the Heroku app’s dashboard (Resources section) or with heroku addons:open papertrail in terminal.

 

Troubleshooting Routing Problems Using Papertrail

A great way to examine Heroku router logs is by using the Papertrail solution. It’s easy to isolate them in order to filter out all the noise from multiple log sources: simply click on the “heroku/router” program name in any log message, which will automatically search for “program:heroku/router” in the Event Viewer:

Heroku router viewer
Fig 3. Tail of Heroku router logs in Papertrail, 500 app error selected. © 2018 SolarWinds. All rights reserved.

 

Monitor HTTP 404s

How do you know that your users are finding your content, and that it’s up to date? 404 Not Found errors are what a client receives when the URL’s path is not found. Examples would be a misspelled file name or a missing app route. We want to make sure these types of errors remain uncommon, because otherwise, users are either walking to dead ends or seeing irrelevant content in the app!

 

With Papertrail, setting up an alert to monitor the amount of 404s returned by your app is easy and convenient. One way to do it is to search for “status=404” in the Event Viewer, and then click on the Save Search button. This will bring up the Save Search popup, along with the Save & Setup Alert option:

Save a search
Fig 4. Save a log search and set up an alert with a single action © 2018 SolarWinds. All rights reserved.

 

The following screen will give us the alert delivery options, such as email, Slack message, push notifications, or even publish all matching events as a custom metric for application performance management tools such as AppOptics™.

Troubleshoot 500 errors quickly

500 error on Heroku
Fig 5. HTTP 500 Internal Server Error from herokuapp.com. © 2018 Google LLC. All rights reserved.

 

Let’s say an HTTP 500 error is happening on your app after it’s deployed. A great feature of Papertrail is to make the request_id in log messages clickable. Simply click on it or copy it and search it in the Event Viewer to find all the app logs that are causing the internal problem, along with the detailed error message from your application’s code.

 

Conclusion

Heroku router logs are the glue between web traffic and (sometimes intangible) errors in your application code. It makes sense to give them special focus when monitoring a wide range of issues because they often indicate customer-facing problems that we want to avoid or address ASAP. Add the Papertrail addon to Heroku to get more powerful ways to monitor router logs.

 

Sign up for a 30-day free trial of Papertrail and start aggregating logs from all your Heroku apps and other sources. You may learn more about the Papertrail advanced features in its Heroku Dev Center article.

Look back into almost any online technology businesses 10, or even 5 years ago and you’d see a clear distinction between what the CTO and CMO did in their daily roles. The former would oversee the building of technology and products whilst the latter would drive the marketing that brought in the customers to use said technology. In short, the two together took care of two very different sides of the same coin.

 

Marketing departments traditionally measure their success against KPIs such as the number of conversions a campaign brought in versus the cost of running it. Developers measure their performance on how quickly and effectively they develop new technologies.

 

Today, companies are shifting focus towards a customer-centric approach, where customer experience and satisfaction is paramount. After all, how your customers feel about your products can make, or break a business.

Performance diagnostic tools can help you optimize a slow web page but won’t show you whether your visitors are satisfied.

So where do the classic stereotypes that engineers only care about performance and marketers only care about profit fit into the customer-centric business model? The answer is they don’t: in a business where each department works against the same metrics — increasing their customers’ experience — having separate KPIs is as redundant as a trap door in a canoe.

 

The only KPI that matters is “are my customers happy?”

 

Developers + Marketing KPIs = True

With technology being integral to any online business, marketers are now in a position where we can gather so much data and in such detail that we are on the front line when it comes to gauging the satisfaction and experience of our customers. We can see what path a visitor took on our website, how long they took to complete their journey and whether they achieved what they set out to do.

 

Armed with this, we stand in a position to influence the technologies developers build and use.

 

Support teams, no longer confined to troubleshooting customer problems have become Customer Success teams, and directly impact on how developers build products, armed with first-hand data from their customers.

 

So as the lines blur between departments, it shouldn’t come as a surprise that engineering teams should care about marketing metrics. After all, if a product is only as effective as the people who use it, engineers build better products and websites when they know how customers intend to use them.

 

Collaboration is King

“How could engineers possibly make good use of marketing KPIs?” you might ask. After all, the two are responsible for separate ends of your business but can benefit from the same data.

 

Take a vital page on your business’s website: it’s not the fastest page on the net but its load time is consistent and it achieves its purpose: to convert your visitors to customers. Suddenly your bounce rate has shot up from 5% to 70%.

Ask an engineer to troubleshoot the issue and they might tell you that the page isn’t efficient. It takes 2.7 seconds to load, which is 0.7 seconds over the universal benchmark and what’s more is that some of the file sizes on your site are huge.

 

Ask a marketer the same question and they might tell you that the content is sloppy, making the purpose of the page unclear. The colors are off-brand and what’s more is that an important CTA is missing.

 

Even though both have been looking at the same page, they’ve come to two very different results, but the bottom line is that your customer doesn’t care about what went wrong. What matters is that the issue is identified and solved, quickly.

 

Unified Metrics Mean Unified Monitoring

Having unified KPIs across the various teams internal to your organisation means that they should all draw their data from the same source: a single, unified monitoring tool.

 

For businesses where the customer comes first, a new breed of monitoring is evolving that offers organizations this unified view, centred on how your customer experiences your site: Digital Experience Monitoring, or seeing as everything we do is digital, how about we just call it Experience Monitoring?

With Digital Experience Monitoring, your marketers and your engineering teams can follow a customer’s journey through your site, see how the navigated through it and where and why interest became a sale or a lost opportunity.

 

Let’s go back to our previous example: both your marketer and your engineer will see that although your bounce rate skyrocketed, the page load time and size stayed consistent. What they might also see is that onboarding you implemented that coincides with your bounce rate spike is confusing to your customers meaning that they leave, frustrated and unwilling to convert.

 

Digital Experience Monitoring gives a holistic view of your website’s health and helps you answer questions like:

  • Where your visitors come from
  • When are they visiting your site
  • What they visit and the journey they take to get there
  • How your site’s performance impacts on your visitors

By giving your internal teams access to the same metrics, you foster greater transparency across your organization which leads to faster resolution of issues, a deeper knowledge of your visitors and better insights into what your customers love about your products.

 

Pingdom’s Digital Experience Monitoring, Visitor Insights, bridges the gap between site performance and customer satisfaction, meaning you can guess less and know more about how your visitors experience your site.

Page load time is inversely related to page views and conversion rates. While probably not a controversial statement, as the causality is intuitive, there is empirical data from industry leaders such as Amazon, Google, and Bing to back it in High Scalability and O’Reilly’s Radar, for example.

 

As web technology has become much more complex over the last decade, the issue of performance has remained a challenge as it relates to user experience. Fast forward to 2018, and UX is identified as a key requirement for business success by CIOs and CDOs.

 

In today’s growing ecosystem of competing web services, the undeniable reality remains that performance impacts business and it can represent a major competitive (dis)advantage. Whether your application relies on AWS, Azure, Heroku, Salesforce, Cloud Foundry, or any other SaaS platform, consider these five tips for monitoring SaaS services.

 

1. Realize the Importance of Monitoring

In case we haven’t established that app performance is critical for business success, let’s look at research done in the online retail sector.

 

“E-commerce sites must adopt a zero-tolerance policy for any performance issues that will impact customer experience [in order to remain competitive]” according to Retail Systems Research. Their conclusion is that performance management must shift from being considered an IT issue to being a business matter.

 

We can take this concept into more specific terms, as stated in our article series on Building a SaaS Service for an Unknown Scale. “Treat scalability and reliability as product features; this is the only way we can build a world-class SaaS application for unknown scale.”

LG ProactiveMonitoringSaaS BlogImage A
Data from Measuring the Business Impact of IT Through Application Performance (2015).

 

End users have come to expect very fast, real-time-like interaction with most software, regardless of the system complexities behind the scenes. This means that commercial applications and SaaS services need to be built and integrated with performance in mind at all times. And so, knowing how to measure their performance from day one is paramount. Logs extend application performance monitoring (APM) by giving you deeper insights into the causes of performance problems as well as application errors that can cause user experience problems.

 

2. Incorporate a Monitoring Strategy Early On

In today’s world, planning for your SaaS service’s successful adoption to take time (and thus worrying about its performance and UX later) is like selling 100 tickets to a party but only beginning preparations on the day of the event. Needless to say, such a plan is prone to produce disappointed customers, and it can even destroy a brand. Fortunately, with SaaS monitoring solutions like SolarWinds® Loggly®, it’s not time-consuming or expensive to implement monitoring.

 

In fact, letting scalability become a bottleneck is the first of Six Critical SaaS Engineering Mistakes to Avoid we published some time ago. We recommend defining realistic adoption goals and scenarios in early project stages, and to map them into performance, stress, and capacity testing. To realize these tests, you’ll need to be able to monitor specific app traffic, errors, user engagement, and other metrics that tech and business teams need to define together.

 

A good place to start is with the Four Golden Signals described by Google’s Monitoring Distributed Systems book chapter: Latency, Traffic, Errors, and Saturation. Finally, and most importantly from the business perspective, your key metrics can be used as service level indicators (SLI), which are measures of the service level provided to customers.

 

Based on your SLIs and adoption goals, you’ll be able to establish service level objectives (SLOs) so your ops team can target specific availability levels (uptime and performance). And, as a SaaS service provider, you should plan to offer service level agreement (SLA). SLAs are contracts with your clients that specify what happens if you fail to meet non-functional requirements, and the terms are based on your SLOs, but can be negotiated with each client, of course. SLIs, SLOs, and SLAs are the basis for successful site reliability engineering (SRE).

LG ProactiveMonitoringSaaS BlogImage B
Apache Preconfigured Dashboards in Loggly can help you watch SLOs in a single click.

 

For a seamless understanding among tech and business leadership, key performance indicators (KPI) should be identified for various business stakeholders. KPIs should then be mapped to the performance metrics that compose each SLA (so they can be monitored). Defining a matrix of KPI vs. metrics vs. area of business impact as part of the business documentation is a good option. For example, a web conversion rate could map to page load time and number of outages, and impacts sales.

 

Finally, don’t forget to consider and plan for governance: roles and responsibilities around information (e.g., ownership, prioritization, and escalation rules). The RACI model can help you establish a clear matrix of which team is responsible, accountable, consulted, and informed when there are unplanned events emanating from or affecting business technology.

 

3. Have Application Logging as a Code Standard

Tech leadership should realize that the main function of logging begins after the initial development is complete. Good logging serves multiple purposes:

  1. Improving debugging during development iterations
  2. Providing visibility for tuning and optimizing complex processes
  3. Understanding and addressing failures of production systems
  4. Business intelligence

“The best SaaS companies are engineered to be data-driven, and there’s no better place to start than leveraging data in your logs.” (From the last of our SaaS Engineering Mistakes)

 

Best practices for logging is a topic that’s been widely written about. For example, see our article on best practices for creating logs. Here are a few guidelines from that and other sources:

  • Define logging goals and criteria to decide what to log. (Logging absolutely everything produces noise and is needlessly expensive.)
  • Log messages should contain data, context, and description. They need to be digestible (structured in a way that both humans and machines can read them).
  • Ensure that log messages are appropriate in severity using standard levels such as FATAL, ERROR, WARN, INFO, DEBUG, TRACE (See also Syslog facilities and levels).
  • Avoid side effects on the code execution. Particularly, don’t let logging halt your app by using non-blocking calls.
  • External systems: try logging all data that comes out from your application and gets in.
  • Use a standard log message format with clear key-value pairs and/or consider a known text standard format like JSON. (See figure 4 below.)
  • Support distributed logging: Centralize logs to a shareable, searchable platform such as Loggly.

Some of our sources include:

LG ProactiveMonitoringSaaS BlogImage C
Loggly automatically parses several log formats you can navigate with the Fields Explorer.

 

Every stage in the software development life cycle can be enriched by logs and other metrics. Implementation, integration, staging, and production deployment (especially rolling deploys) will particularly benefit from monitoring such metrics appropriately.

 

Logs constitute valuable data for your tech team, and invaluable data for your business. Now that you have rich information about the app that is generated in real-time, think about ways to put it in good use.

 

4. Automate Your Monitoring Configuration

Modern applications are deployed using infrastructure as code (IaC) techniques because they replace fragile server configuration with systems that can be easily torn down and restarted. If your team has made undocumented changes to servers and are too scared to shut them down, they are essentially “pet” servers.

 

If you manually deploy monitoring configuration on a per-server basis, then you have the potential to lose visibility when servers stop or when you add new ones. If you treat monitoring as something to be automatically deployed and configured, then you’ll get better coverage for less effort in the long run. This becomes even more important when testing new versions of your infrastructure or code, and when recovering from outages. Tools like Terraform, Ansible, Puppet, and CloudFormation can automate not just the deployment of your application but the monitoring of it as well.

Monitoring tools typically have system agents that can be installed on your infrastructure to begin streaming metrics into their service. In the case of applications built on SaaS platforms, there are convenient integrations that plug into well-known ecosystems. For example, Loggly streams and centralizes logs as metrics, and supports dozens of out-of-box systems, including the Amazon Cloudwatch and Heroku PaaS platforms.

 

5. Use Alerts on Your Key Metrics

Monitoring solutions like Loggly can alert you in changes in your SLIs over time, such as your error rate. It can help you visually identify the types of errors that occur and when they start. This will help identify root causes and fix problems faster, minimizing impact to user experience.

LG ProactiveMonitoringSaaS BlogImage D
Loggly Chart of application errors split by errorCode.

 

Custom alerts can be created from saved log searches, which act as key metrics of your application’s performance. Loggly even lets you integrate alerts to incident management systems like PagerDuty and OpsGenie.

LG ProactiveMonitoringSaaS BlogImage E
Adding an alert from a Syslog error log search in Loggly.

 

In conclusion, monitoring your SaaS service performance is very important because it significantly impacts your business’ bottom line. This monitoring has to be planned for, applied early on, and instrumented for all the stages in the SDLC.

 

Additionally, we explained how and why correct logging is one of the best sources for key metrics to measure your monitoring goals during development and production of your SaaS service. Proper logging on an easy-to-use platform such as Loggly will also help your business harness invaluable intel in real time. You can leverage these streams of information for tuning your app, improving your service, and to discover new revenue models.

 

Sign up for a free 14-day trial of SolarWinds Loggly to start doing logging right today, and move your SaaS business into the next level of performance control and business intelligence.

Let’s dream for a while—imagine your databases. All of them are running fast and smooth. There’s no critical issue, no warnings. All requests are handled immediately and the response time is literally immeasurable. Sounds like a database nirvana, doesn’t it? Now, let’s face reality. You resolved all critical issues of the database, but people still report slowdowns. Everything looks good at a first glance, but your sixth sense tells you something bad is happening under the surface. You could start shooting in the dark and hope that you will hit the target, or you need more information about what’s going inside the database to make a single, surgically precise cut to solve the problem.

 

We’ve got good news for you. SolarWinds has a new tool called SQL Plan Warnings. For the first time, you can inspect the list of queries that have warnings without spending hours on manual and labor-intensive work. Oh, and we almost forgot to mention—this tool is available for you right now for free.

Why do we believe that the free SQL Plan Warnings tool can help you improve your databases? Well, SQL Server Optimizer often comes up with bad plans with warnings. That can cause increased resource consumption, increased wait time, and unnecessary end-user or customer angst. For these reasons, a database professional should look at it. But we don’t always have time or resources to do so.

 

SQL Plan Warnings free tool at a glance:

  • Gives you unique visibility into plan warnings that can be easily overlooked and can affect query performance
  • Sort all warnings by consumed CPU time, elapsed time, or execution
  • Filter results by warning type or by specific keywords
  • Investigate plan warnings, query text, or complete query plan in a single click
  • No installation is needed—just download the tool and run
  • Runson Microsoft Windows and MacOS X

 

And what can SQL Plan Warnings check for you?

  • Spill to TempDB – Sorts that are larger than estimated can spill to disk via TempDB. This can dramatically slow down queries. There are two similar warnings that fall into this category.
  • No join predicates – Query does not properly join tables/objects, which can cause Cartesian products and slow queries.
  • Implicit conversion – A column of data is being converted to another data type, which can cause a query to not use an index.
  • Missing indexes – SQL Server is telling us there is an index that may help performance.
  • Missing column statistics – If statistics are missing, it can lead to bad decisions by the optimizer.
  • Lookup warning – An index is being used, but it's not covering an index, and a visit back to the table is required to complete the query.

 

The free SQL Plan Warnings tool brings a fresh new feature to your database management capabilities and gives you another tool to improve query performance. Download it here today and be another step closer to our dream—everything in a database running fast and smooth with no critical issues and no warnings.

When development started on NGINX in 2002, the goal was to develop a web server which would be more performant than Apache had been up to that point. While NGINX may not offer all of the features available in Apache, its default configuration can handle approximately four times the number of requests per second while using significantly less memory.

 

While switching to a web server with better performance seems like a no-brainer, it’s important that you have a monitoring solution in place to ensure that your web server is performing optimally, and that users who are visiting the NGINX-hosted site receive the best possible experience. But how do we ensure that the experience is as performant as expected for all users?

 

Monitoring!

 

This article is meant to assist you in putting together a monitoring plan for your NGINX deployments. We’ll look at what metrics you should be monitoring, why they are important, and putting a monitoring plan in place using SolarWinds® AppOptics™.

 

Monitoring is a Priority

 

As engineers, we all understand and appreciate the value that monitoring provides. In the age of DevOps, however, when engineers are responsible for both the engineering and deployment of solutions into a production environment, monitoring is often relegated to the list of things we plan to do in the future. In order to be the best engineers we can be, monitoring should be the priority from day one.

 

Accurate and effective monitoring allows us to test the efficiency of our solutions, and help identify and troubleshoot inefficiencies and other potential problems. Once the solution has moved to requiring operational support, monitoring allows us to ensure that the application is running efficiently and alerting us when things go wrong. An effective monitoring plan should help to identify problems before they start, allowing engineers to resolve issues proactively, instead of being purely reactive.

 

Specific Metrics to Consider with NGINX

 

Before we can develop a monitoring plan, we need to know what metrics are available for monitoring, understand what they mean, and how we can use them. There are two distinct groups of metrics we should be concerned with—metrics related to the web server itself, and those related to the underlying infrastructure.

 

While a highly performant web server like NGINX may be able to handle more requests and traffic, it is vital that the machine hosting the web server has the necessary resources as well. Each metric represents a potential limit to the performance of your application. Ultimately, you want to ensure your web server and underlying infrastructure are able to operate efficiently without approaching those limits.

 

NGINX Web Server-specific Metrics

 

  • Current Connections
    Indicates the number of active and waiting client connections with the server. This may include actual users and automated tasks or bots.
  • Current Requests
    Each connection may be making one or more requests to the server. This number indicates the total count of requests coming in.
  • Connections Processed
    This shows the number of connections that have been accepted and handled by the server. Dropped connections can also be monitored.

 

Infrastructure-specific Metrics

  • CPU Usage
    An indication of the processing usage of the underlying machine. This should be measured as utilization across all cores, if using a multi-core machine.
  • Memory Usage
    Measurement of the memory currently in use on the machine.
  • Swap Usage
    Swap is what the host machine uses when it runs out of memory or if the memory region has been unused for a period of time. It is significantly slower, and is generally only used in an emergency. When an application begins using swap space, it’s usually an indicator that something is amiss.
  • Network Bandwidth
    Similar to traffic, this is a measurement of information flowing in and out of the machine. Again, load units are important to monitor here as well.
  • Disk Usage
    Even if the web server is not physically storing files on the host machine, space is required for logging, temporary files, and other supporting files.
  • Load
    Load is a performance metric which combines many of the other metrics into a simple number. A common rule of thumb is the load on the machine should be less than the number of processing cores.

 

Let’s look at how to configure monitoring on your instances with AppOptics, along with building a dashboard which will show each of those metrics.

 

Installing the AppOptics Agent on the Server

 

Before you start, you’ll need an account with AppOptics. If you don’t already have one, you can create a demo account, which will give you 14 days to try the service, free of charge.

 

The first thing to do to allow AppOptics to aggregate the metrics from the server is install the agent on all instances. To do this, you’ll need to reference your AppOptics API token when setting up the agent. Log in to your AppOptics account and navigate to the Infrastructure page.

 

Locate the Add Host button, and click on it. It should look similar to the image below.

 

Fig. 2. AppOptics Host Agent Installation

 

I used the Easy Install option when setting up the instances for this article. Ensure that Easy Install is selected, and select your Linux distribution. I used an Ubuntu image in the AWS Cloud, but this will work on almost any Linux server.

 

Note: Prior to installation of the agent, the bottom of the dialog below will not contain the success message.

 

Copy the command from the first box, and then SSH into the server and run the Easy Install script.

 

Fig. 3. Easy Install Script to Add AppOptics Agent to a Server

 

When the agent installs successfully, you should be presented with the following message on your terminal. The “Confirm successful installation” box on the AppOptics agent screen should look similar to the above, with a white on blue checkbox. You should also see “Agent connected.”

 

Fig. 4. Installing the AppOptics Agent on your NGINX Instance

 

Configuring the AppOptics Agent

 

With the agent installed, the next step is to configure NGINX to report metrics to the agent. Navigate back to the Infrastructure page, Integrations tab, and locate the NGINX plugin.

 

Note: Prior to enabling the integration, the “enabled” checkbox won’t be marked.

 

Fig. 5. NGINX Host Agent Plugin

 

Click on the plugin, and the following panel will appear. Follow the instructions in the panel, click Enable Plugin, and your metrics will start flowing from the server into AppOptics.

 

Fig. 6. NGINX Plugin Setup

 

When everything is configured, either click on the NGINX link in the panel’s Dashboard tab, or navigate to the Dashboards page directly, then select the NGINX link to view the default dashboard provided by AppOptics.

 

Working With the NGINX Dashboard

 

The default NGINX dashboard provided by AppOptics offers many metrics related to the performance of the web server that we discussed earlier and should look similar to the image below.

 

Fig. 8. Default AppOptics Dashboard

 

Now we need to add some additional metrics to get a full picture of the performance of our server. Unfortunately, you can’t make changes to the default dashboard, but it’s easy to create a copy and add metrics of your own. Start by clicking the Copy Dashboard button at the top of the screen to create a copy.

 

Create a name for your custom dashboard. For this example, I’m monitoring an application called Retwis, so I’m calling mine “NGNIX-Retwis.” It’s also helpful to select the “Open dashboard on completion” option, so you don’t have to go looking for the dashboard after it’s created.

 

Let’s do some customization. First, we want to ensure that we’re only monitoring the instances we need to. We do this by filtering the chart or dashboard. You can find out more about how to set and filter these in the documentation for Dynamic Tags.

 

With our sources filtered, we can add some additional metrics. Let’s look at CPU Usage, Memory Usage, and Load. Click on the Plus button located  at the bottom right of the dashboard. For CPU and Memory Usage, let’s add a Stacked chart. We’ll add one for each. Click on the Stacked icon.

 

Fig. 10. Create New Chart

 

In the Metrics search box, type “CPU” and hit enter. A selection of available metrics will appear below. I’m going to select system.cpu.utilization, but your selection may be different depending on the infrastructure you’re using. Select the checkbox next to the appropriate metric, then click Add Metrics to Chart. You can add multiple metrics to the chart by repeating the same process, but we’ll stick with one for now.

 

If you click on Chart Attributes, you can change the scale of the chart, adjust the Y-axis label, and even link it to another dashboard to show more detail for a specific metric. When you’re done, click on the green Save button, and you’ll be returned to your dashboard, with the new chart added. Repeat this for Memory Usage. I chose the “system.mem.used” metric.

 

For load, I’m going to use a Big Number Chart Type, and select the system.load.1_rel metric. When you’re done, your chart should look similar to what is shown below.

 

Fig. 11. Custom Dashboard to View NGINX Metrics

 

Pro tip: You can move charts around by hovering over a chart, clicking on the three dots that appear at the top of the chart, and dragging it around. Clicking on the menu icon on the top right of the chart will allow you to edit, delete, and choose other options related to the chart.

 

Beyond Monitoring

 

Once you have a monitoring plan in place and functioning, the next step is to determine baseline metrics for your application and set up alerts which will be triggered when significant deviations occur. Traffic is a useful baseline to determine and monitor. A significant reduction in traffic may indicate a problem that is preventing clients from accessing the service. A significant increase in traffic would indicate an increase in clients, and may require either an increase in the capacity of your environment (in the case of increased popularity), or, potentially, the deployment of defensive measures in response to a cyberattack.

 

Monitoring your NGINX server is critical as a customer-facing part of your infrastructure. You need to know immediately when there is a sudden change in traffic or connections that could impact the rest of your application or website. AppOptics provides an easy way to monitor your NGINX servers and it typically only takes a few minutes to get started. Learn more about AppOptics infrastructure monitoring and try it today with a free 14-day trial.

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.