|By Manuel Weiss
|March 11, 2014 01:11 PM EDT
After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.
Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.
There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.
One of my favorite all-time tweets is from @DevOps_Borat
What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.
1 – Fork Rate
We once had a problem where IPv6 was intentionally disabled on a box via
options ipv6 disable=1 and
alias ipv6 off in
/etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created,
modprobe would spawn, checking
net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in
/proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.
2 – flow control packets
TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)
$ /usr/sbin/ethtool -S eth0 | grep flow_control
Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.
3 – Swap In/Out Rate
It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.
4 – Server Boot Notification
Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.
5 – NTP Clock Offset
If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running
ntpd on your servers. Generally there are 3 things to check for. 1) That
ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.
We use check_ntp_time for this check.
6 – DNS Resolutions
Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.
External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).
7 – SSL Expiration
It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.
command_line $USER1$/check_http --ssl -C 14 -H $ARG1$
8 – DELL OpenManage Server Administrator (OMSA)
We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.
9 – Connection Limits
You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?
Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with
ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.
10 – Load Balancer Status.
We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).
Various Other things to watch
New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).
We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!
Read the original blog entry...
The essence of data analysis involves setting up data pipelines that consist of several operations that are chained together – starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis.
May. 25, 2016 08:30 PM EDT Reads: 1,194
The IoT is changing the way enterprises conduct business.
In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discuss how businesses can gain an edge over competitors by empowering consumers to take control through IoT. We'll cite examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He'll also highlight how IoT can revitalize and restore outdated business models, making them profitable...
May. 25, 2016 08:15 PM EDT Reads: 2,528
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives.
Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
May. 25, 2016 08:00 PM EDT Reads: 1,871
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location.
With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit y...
May. 25, 2016 06:00 PM EDT Reads: 1,884
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY.
MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device.
For more information, please visit https://www.mangoapps.com/.
May. 25, 2016 05:45 PM EDT Reads: 457
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York.
“CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...
May. 25, 2016 05:00 PM EDT Reads: 870
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
May. 25, 2016 05:00 PM EDT Reads: 1,721
Designing IoT applications is complex, but deploying them in a scalable fashion is even more complex. A scalable, API first IaaS cloud is a good start, but in order to understand the various components specific to deploying IoT applications, one needs to understand the architecture of these applications and figure out how to scale these components independently.
In his session at @ThingsExpo, Nara Rajagopalan is CEO of Accelerite, will discuss the fundamental architecture of IoT applications, ...
May. 25, 2016 04:45 PM EDT Reads: 946
In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects).
Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social med...
May. 25, 2016 04:00 PM EDT Reads: 1,192
SYS-CON Events announced today that Enzu, a leading provider of cloud hosting solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY.
Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to foc...
May. 25, 2016 03:45 PM EDT Reads: 2,121
Customer experience has become a competitive differentiator for companies, and it’s imperative that brands seamlessly connect the customer journey across all platforms. With the continued explosion of IoT, join us for a look at how to build a winning digital foundation in the connected era – today and in the future.
In his session at @ThingsExpo, Chris Nguyen, Group Product Marketing Manager at Adobe, will discuss how to successfully leverage mobile, rapidly deploy content, capture real-time d...
May. 25, 2016 02:45 PM EDT Reads: 1,439
IoT generates lots of temporal data. But how do you unlock its value?
How do you coordinate the diverse moving parts that must come together when developing your IoT product?
What are the key challenges addressed by Data as a Service?
How does cloud computing underlie and connect the notions of Digital and DevOps
What is the impact of the API economy?
What is the business imperative for Cognitive Computing?
Get all these questions and hundreds more like them answered at the 18th Cloud Expo...
May. 25, 2016 02:15 PM EDT Reads: 2,130
As cloud and storage projections continue to rise, the number of organizations moving to the cloud is escalating and it is clear cloud storage is here to stay. However, is it secure?
Data is the lifeblood for government entities, countries, cloud service providers and enterprises alike and losing or exposing that data can have disastrous results.
There are new concepts for data storage on the horizon that will deliver secure solutions for storing and moving sensitive data around the world.
May. 25, 2016 02:00 PM EDT Reads: 1,143
What a difference a year makes. Organizations aren’t just talking about IoT possibilities, it is now baked into their core business strategy. With IoT, billions of devices generating data from different companies on different networks around the globe need to interact. From efficiency to better customer insights to completely new business models, IoT will turn traditional business models upside down. In the new customer-centric age, the key to success is delivering critical services and apps wit...
May. 25, 2016 01:45 PM EDT Reads: 959
SYS-CON Events announced today that 24Notion has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York.
24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to con...
May. 25, 2016 09:45 AM EDT Reads: 1,716
WebRTC is bringing significant change to the communications landscape that will bridge the worlds of web and telephony, making the Internet the new standard for communications. Cloud9 took the road less traveled and used WebRTC to create a downloadable enterprise-grade communications platform that is changing the communication dynamic in the financial sector.
In his session at @ThingsExpo, Leo Papadopoulos, CTO of Cloud9, will discuss the importance of WebRTC and how it enables companies to fo...
May. 25, 2016 04:45 AM EDT Reads: 2,444
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
May. 25, 2016 04:15 AM EDT Reads: 3,059
Korean Broadcasting System (KBS) will feature the upcoming 18th Cloud Expo | @ThingsExpo in a New York news documentary about the "New IT for the Future." The documentary will cover how big companies are transmitting or adopting the new IT for the future and will be filmed on the expo floor between June 7-June 9, 2016, at the Javits Center in New York City, New York. KBS has long been a leader in the development of the broadcasting culture of Korea. As the key public service broadcaster of Korea...
May. 25, 2016 04:00 AM EDT Reads: 1,723
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York and Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty ...
May. 24, 2016 06:00 PM EDT Reads: 4,690
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos.
In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will discuss the vast to...
May. 24, 2016 04:00 PM EDT Reads: 2,388