Welcome!

Web 2.0 Authors: Yeshim Deniz, Pat Romanski, Elizabeth White, Liz McMillan, Peter Dyer

Blog Feed Post

10 Things We Forgot to Monitor

10 Things We Forgot to Monitor

After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.

Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.


There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 – Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

2 – flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 – Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

4 – Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

5 – NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

6 – DNS Resolutions

Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 – SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 – DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 – Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 – Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).


We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!

Codeship – A hosted Continuous Deployment platform for web applications

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

@ThingsExpo Stories
The Internet of Things is tied together with a thin strand that is known as time. Coincidentally, at the core of nearly all data analytics is a timestamp. When working with time series data there are a few core principles that everyone should consider, especially across datasets where time is the common boundary. In his session at Internet of @ThingsExpo, Jim Scott, Director of Enterprise Strategy & Architecture at MapR Technologies, discussed single-value, geo-spatial, and log time series data. By focusing on enterprise applications and the data center, he will use OpenTSDB as an example t...
The industrial software market has treated data with the mentality of “collect everything now, worry about how to use it later.” We now find ourselves buried in data, with the pervasive connectivity of the (Industrial) Internet of Things only piling on more numbers. There’s too much data and not enough information. In his session at @ThingsExpo, Bob Gates, Global Marketing Director, GE’s Intelligent Platforms business, to discuss how realizing the power of IoT, software developers are now focused on understanding how industrial data can create intelligence for industrial operations. Imagine ...
Cultural, regulatory, environmental, political and economic (CREPE) conditions over the past decade are creating cross-industry solution spaces that require processes and technologies from both the Internet of Things (IoT), and Data Management and Analytics (DMA). These solution spaces are evolving into Sensor Analytics Ecosystems (SAE) that represent significant new opportunities for organizations of all types. Public Utilities throughout the world, providing electricity, natural gas and water, are pursuing SmartGrid initiatives that represent one of the more mature examples of SAE. We have s...
There is no doubt that Big Data is here and getting bigger every day. Building a Big Data infrastructure today is no easy task. There are an enormous number of choices for database engines and technologies. To make things even more challenging, requirements are getting more sophisticated, and the standard paradigm of supporting historical analytics queries is often just one facet of what is needed. As Big Data growth continues, organizations are demanding real-time access to data, allowing immediate and actionable interpretation of events as they happen. Another aspect concerns how to deliver ...
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
The Industrial Internet revolution is now underway, enabled by connected machines and billions of devices that communicate and collaborate. The massive amounts of Big Data requiring real-time analysis is flooding legacy IT systems and giving way to cloud environments that can handle the unpredictable workloads. Yet many barriers remain until we can fully realize the opportunities and benefits from the convergence of machines and devices with Big Data and the cloud, including interoperability, data security and privacy.
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
Cloud Expo 2014 TV commercials will feature @ThingsExpo, which was launched in June, 2014 at New York City's Javits Center as the largest 'Internet of Things' event in the world.
"People are a lot more knowledgeable about APIs now. There are two types of people who work with APIs - IT people who want to use APIs for something internal and the product managers who want to do something outside APIs for people to connect to them," explained Roberto Medrano, Executive Vice President at SOA Software, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Media announced that Splunk, a provider of the leading software platform for real-time Operational Intelligence, has launched an ad campaign on Big Data Journal. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. The ads focus on delivering ROI - how improved uptime delivered $6M in annual ROI, improving customer operations by mining large volumes of unstructured data, and how data tracking delivers uptime when it matters most.
DevOps Summit 2015 New York, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
Wearable devices have come of age. The primary applications of wearables so far have been "the Quantified Self" or the tracking of one's fitness and health status. We propose the evolution of wearables into social and emotional communication devices. Our BE(tm) sensor uses light to visualize the skin conductance response. Our sensors are very inexpensive and can be massively distributed to audiences or groups of any size, in order to gauge reactions to performances, video, or any kind of presentation. In her session at @ThingsExpo, Jocelyn Scheirer, CEO & Founder of Bionolux, will discuss ho...
We’re no longer looking to the future for the IoT wave. It’s no longer a distant dream but a reality that has arrived. It’s now time to make sure the industry is in alignment to meet the IoT growing pains – cooperate and collaborate as well as innovate. In his session at @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, will examine the key ingredients to IoT success and identify solutions to challenges the industry is facing. The deep industry expertise behind this presentation will provide attendees with a leading edge view of rapidly emerging IoT oppor...
“With easy-to-use SDKs for Atmel’s platforms, IoT developers can now reap the benefits of realtime communication, and bypass the security pitfalls and configuration complexities that put IoT deployments at risk,” said Todd Greene, founder & CEO of PubNub. PubNub will team with Atmel at CES 2015 to launch full SDK support for Atmel’s MCU, MPU, and Wireless SoC platforms. Atmel developers now have access to PubNub’s secure Publish/Subscribe messaging with guaranteed ¼ second latencies across PubNub’s 14 global points-of-presence. PubNub delivers secure communication through firewalls, proxy ser...
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective storage designed to handle the massive surge in back-end data in a world where timely analytics is e...
"There is a natural synchronization between the business models, the IoT is there to support ,” explained Brendan O'Brien, Co-founder and Chief Architect of Aria Systems, in this SYS-CON.tv interview at the 15th International Cloud Expo®, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrategies, will examine why IT must finally fulfill its role in support of its SBUs or face a new round of...
The BPM world is going through some evolution or changes where traditional business process management solutions really have nowhere to go in terms of development of the road map. In this demo at 15th Cloud Expo, Kyle Hansen, Director of Professional Services at AgilePoint, shows AgilePoint’s unique approach to dealing with this market circumstance by developing a rapid application composition or development framework.