Welcome!

Agile Computing Authors: Jonathan Fries, William Schmarzo, John Esposito, Igor Drobiazko, Elizabeth White

Blog Feed Post

10 Things We Forgot to Monitor

10 Things We Forgot to Monitor

After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.

Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.


There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 – Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

2 – flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 – Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

4 – Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

5 – NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

6 – DNS Resolutions

Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 – SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 – DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 – Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 – Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).


We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!

Codeship – A hosted Continuous Deployment platform for web applications

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

@ThingsExpo Stories
There is an ever-growing explosion of new devices that are connected to the Internet using “cloud” solutions. This rapid growth is creating a massive new demand for efficient access to data. And it’s not just about connecting to that data anymore. This new demand is bringing new issues and challenges and it is important for companies to scale for the coming growth. And with that scaling comes the need for greater security, gathering and data analysis, storage, connectivity and, of course, the...
The IoTs will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm and share the must-have mindsets for removing complexity from the development proc...
Manufacturers are embracing the Industrial Internet the same way consumers are leveraging Fitbits – to improve overall health and wellness. Both can provide consistent measurement, visibility, and suggest performance improvements customized to help reach goals. Fitbit users can view real-time data and make adjustments to increase their activity. In his session at @ThingsExpo, Mark Bernardo Professional Services Leader, Americas, at GE Digital, will discuss how leveraging the Industrial Interne...
Artificial Intelligence has the potential to massively disrupt IoT. In his session at 18th Cloud Expo, AJ Abdallat, CEO of Beyond AI, will discuss what the five main drivers are in Artificial Intelligence that could shape the future of the Internet of Things. AJ Abdallat is CEO of Beyond AI. He has over 20 years of management experience in the fields of artificial intelligence, sensors, instruments, devices and software for telecommunications, life sciences, environmental monitoring, process...
The IETF draft standard for M2M certificates is a security solution specifically designed for the demanding needs of IoT/M2M applications. In his session at @ThingsExpo, Brian Romansky, VP of Strategic Technology at TrustPoint Innovation, will explain how M2M certificates can efficiently enable confidentiality, integrity, and authenticity on highly constrained devices.
Increasing IoT connectivity is forcing enterprises to find elegant solutions to organize and visualize all incoming data from these connected devices with re-configurable dashboard widgets to effectively allow rapid decision-making for everything from immediate actions in tactical situations to strategic analysis and reporting. In his session at 18th Cloud Expo, Shikhir Singh, Senior Developer Relations Manager at Sencha, will discuss how to create HTML5 dashboards that interact with IoT devic...
We're entering the post-smartphone era, where wearable gadgets from watches and fitness bands to glasses and health aids will power the next technological revolution. With mass adoption of wearable devices comes a new data ecosystem that must be protected. Wearables open new pathways that facilitate the tracking, sharing and storing of consumers’ personal health, location and daily activity data. Consumers have some idea of the data these devices capture, but most don’t realize how revealing and...
SYS-CON Events announced today that Ericsson has been named “Gold Sponsor” of SYS-CON's @ThingsExpo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. Ericsson is a world leader in the rapidly changing environment of communications technology – providing equipment, software and services to enable transformation through mobility. Some 40 percent of global mobile traffic runs through networks we have supplied. More than 1 billion subscribers around the world re...
We’ve worked with dozens of early adopters across numerous industries and will debunk common misperceptions, which starts with understanding that many of the connected products we’ll use over the next 5 years are already products, they’re just not yet connected. With an IoT product, time-in-market provides much more essential feedback than ever before. Innovation comes from what you do with the data that the connected product provides in order to enhance the customer experience and optimize busi...
The increasing popularity of the Internet of Things necessitates that our physical and cognitive relationship with wearable technology will change rapidly in the near future. This advent means logging has become a thing of the past. Before, it was on us to track our own data, but now that data is automatically available. What does this mean for mHealth and the "connected" body? In her session at @ThingsExpo, Lisa Calkins, CEO and co-founder of Amadeus Consulting, will discuss the impact of wea...
trust and privacy in their ecosystem. Assurance and protection of device identity, secure data encryption and authentication are the key security challenges organizations are trying to address when integrating IoT devices. This holds true for IoT applications in a wide range of industries, for example, healthcare, consumer devices, and manufacturing. In his session at @ThingsExpo, Lancen LaChance, vice president of product management, IoT solutions at GlobalSign, will teach IoT developers how t...
A critical component of any IoT project is the back-end systems that capture data from remote IoT devices and structure it in a way to answer useful questions. Traditional data warehouse and analytical systems are mature technologies that can be used to handle large data sets, but they are not well suited to many IoT-scale products and the need for real-time insights. At Fuze, we have developed a backend platform as part of our mobility-oriented cloud service that uses Big Data-based approache...
In his session at @ThingsExpo, Chris Klein, CEO and Co-founder of Rachio, will discuss next generation communities that are using IoT to create more sustainable, intelligent communities. One example is Sterling Ranch, a 10,000 home development that – with the help of Siemens – will integrate IoT technology into the community to provide residents with energy and water savings as well as intelligent security. Everything from stop lights to sprinkler systems to building infrastructures will run ef...
Digital payments using wearable devices such as smart watches, fitness trackers, and payment wristbands are an increasing area of focus for industry participants, and consumer acceptance from early trials and deployments has encouraged some of the biggest names in technology and banking to continue their push to drive growth in this nascent market. Wearable payment systems may utilize near field communication (NFC), radio frequency identification (RFID), or quick response (QR) codes and barcodes...
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
You deployed your app with the Bluemix PaaS and it's gaining some serious traction, so it's time to make some tweaks. Did you design your application in a way that it can scale in the cloud? Were you even thinking about the cloud when you built the app? If not, chances are your app is going to break. Check out this webcast to learn various techniques for designing applications that will scale successfully in Bluemix, for the confidence you need to take your apps to the next level and beyond.
SYS-CON Events announced today that Peak 10, Inc., a national IT infrastructure and cloud services provider, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Peak 10 provides reliable, tailored data center and network services, cloud and managed services. Its solutions are designed to scale and adapt to customers’ changing business needs, enabling them to lower costs, improve performance and focus inter...
So, you bought into the current machine learning craze and went on to collect millions/billions of records from this promising new data source. Now, what do you do with them? Too often, the abundance of data quickly turns into an abundance of problems. How do you extract that "magic essence" from your data without falling into the common pitfalls? In her session at @ThingsExpo, Natalia Ponomareva, Software Engineer at Google, will provide tips on how to be successful in large scale machine lear...
You think you know what’s in your data. But do you? Most organizations are now aware of the business intelligence represented by their data. Data science stands to take this to a level you never thought of – literally. The techniques of data science, when used with the capabilities of Big Data technologies, can make connections you had not yet imagined, helping you discover new insights and ask new questions of your data. In his session at @ThingsExpo, Sarbjit Sarkaria, data science team lead ...
SYS-CON Events announced today that Fusion, a leading provider of cloud services, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Fusion, a leading provider of integrated cloud solutions to small, medium and large businesses, is the industry's single source for the cloud. Fusion's advanced, proprietary cloud service platform enables the integration of leading edge solutions in the cloud, including cloud...