Welcome!

Agile Computing Authors: Elizabeth White, Liz McMillan, Darren Anstee, Pat Romanski, James Pathman

Blog Feed Post

10 Things We Forgot to Monitor

10 Things We Forgot to Monitor

After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.

Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.


There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 – Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

2 – flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 – Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

4 – Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

5 – NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

6 – DNS Resolutions

Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 – SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 – DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 – Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 – Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).


We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!

Codeship – A hosted Continuous Deployment platform for web applications

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

@ThingsExpo Stories
The vision of a connected smart home is becoming reality with the application of integrated wireless technologies in devices and appliances. The use of standardized and TCP/IP networked wireless technologies in line-powered and battery operated sensors and controls has led to the adoption of radios in the 2.4GHz band, including Wi-Fi, BT/BLE and 802.15.4 applied ZigBee and Thread. This is driving the need for robust wireless coexistence for multiple radios to ensure throughput performance and th...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
The many IoT deployments around the world are busy integrating smart devices and sensors into their enterprise IT infrastructures. Yet all of this technology – and there are an amazing number of choices – is of no use without the software to gather, communicate, and analyze the new data flows. Without software, there is no IT. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will look at the protocols that communicate data and the emerging data analy...
“We're a global managed hosting provider. Our core customer set is a U.S.-based customer that is looking to go global,” explained Adam Rogers, Managing Director at ANEXIA, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
According to Forrester Research, every business will become either a digital predator or digital prey by 2020. To avoid demise, organizations must rapidly create new sources of value in their end-to-end customer experiences. True digital predators also must break down information and process silos and extend digital transformation initiatives to empower employees with the digital resources needed to win, serve, and retain customers.
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, will compare the Jevons Paradox to modern-day enterprise IT, e...
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lea...
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
What are the new priorities for the connected business? First: businesses need to think differently about the types of connections they will need to make – these span well beyond the traditional app to app into more modern forms of integration including SaaS integrations, mobile integrations, APIs, device integration and Big Data integration. It’s important these are unified together vs. doing them all piecemeal. Second, these types of connections need to be simple to design, adapt and configure...
In his general session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, discussed cloud as a ‘better data center’ and how it adds new capacity (faster) and improves application availability (redundancy). The cloud is a ‘Dynamic Tool for Dynamic Apps’ and resource allocation is an integral part of your application architecture, so use only the resources you need and allocate /de-allocate resources on the fly.
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
SYS-CON Events announced today that CDS Global Cloud, an Infrastructure as a Service provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. CDS Global Cloud is an IaaS (Infrastructure as a Service) provider specializing in solutions for e-commerce, internet gaming, online education and other internet applications. With a growing number of data centers and network points around the world, ...
Information technology is an industry that has always experienced change, and the dramatic change sweeping across the industry today could not be truthfully described as the first time we've seen such widespread change impacting customer investments. However, the rate of the change, and the potential outcomes from today's digital transformation has the distinct potential to separate the industry into two camps: Organizations that see the change coming, embrace it, and successful leverage it; and...
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Bradley Holt, Developer Advocate a...
SYS-CON Events announced today that LeaseWeb USA, a cloud Infrastructure-as-a-Service (IaaS) provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. LeaseWeb is one of the world's largest hosting brands. The company helps customers define, develop and deploy IT infrastructure tailored to their exact business needs, by combining various kinds cloud solutions.