Click here to close now.




















Welcome!

Agile Computing Authors: Gregor Petri, Liz McMillan, Samuel Scott, Continuum Blog, Elizabeth White

Blog Feed Post

10 Things We Forgot to Monitor

10 Things We Forgot to Monitor

After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.

Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.


There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 – Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

2 – flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 – Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

4 – Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

5 – NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

6 – DNS Resolutions

Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 – SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 – DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 – Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 – Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).


We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!

Codeship – A hosted Continuous Deployment platform for web applications

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

@ThingsExpo Stories
As more intelligent IoT applications shift into gear, they’re merging into the ever-increasing traffic flow of the Internet. It won’t be long before we experience bottlenecks, as IoT traffic peaks during rush hours. Organizations that are unprepared will find themselves by the side of the road unable to cross back into the fast lane. As billions of new devices begin to communicate and exchange data – will your infrastructure be scalable enough to handle this new interconnected world?
While many app developers are comfortable building apps for the smartphone, there is a whole new world out there. In his session at @ThingsExpo, Narayan Sainaney, Co-founder and CTO of Mojio, will discuss how the business case for connected car apps is growing and, with open platform companies having already done the heavy lifting, there really is no barrier to entry.
WebRTC has had a real tough three or four years, and so have those working with it. Only a few short years ago, the development world were excited about WebRTC and proclaiming how awesome it was. You might have played with the technology a couple of years ago, only to find the extra infrastructure requirements were painful to implement and poorly documented. This probably left a bitter taste in your mouth, especially when things went wrong.
As more and more data is generated from a variety of connected devices, the need to get insights from this data and predict future behavior and trends is increasingly essential for businesses. Real-time stream processing is needed in a variety of different industries such as Manufacturing, Oil and Gas, Automobile, Finance, Online Retail, Smart Grids, and Healthcare. Azure Stream Analytics is a fully managed distributed stream computation service that provides low latency, scalable processing of streaming data in the cloud with an enterprise grade SLA. It features built-in integration with Azur...
Through WebRTC, audio and video communications are being embedded more easily than ever into applications, helping carriers, enterprises and independent software vendors deliver greater functionality to their end users. With today’s business world increasingly focused on outcomes, users’ growing calls for ease of use, and businesses craving smarter, tighter integration, what’s the next step in delivering a richer, more immersive experience? That richer, more fully integrated experience comes about through a Communications Platform as a Service which allows for messaging, screen sharing, video...
SYS-CON Events announced today that IceWarp will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. IceWarp, the leader of cloud and on-premise messaging, delivers secured email, chat, documents, conferencing and collaboration to today's mobile workforce, all in one unified interface
The Internet of Things (IoT) is about the digitization of physical assets including sensors, devices, machines, gateways, and the network. It creates possibilities for significant value creation and new revenue generating business models via data democratization and ubiquitous analytics across IoT networks. The explosion of data in all forms in IoT requires a more robust and broader lens in order to enable smarter timely actions and better outcomes. Business operations become the key driver of IoT applications and projects. Business operations, IT, and data scientists need advanced analytics t...
With the proliferation of connected devices underpinning new Internet of Things systems, Brandon Schulz, Director of Luxoft IoT – Retail, will be looking at the transformation of the retail customer experience in brick and mortar stores in his session at @ThingsExpo. Questions he will address include: Will beacons drop to the wayside like QR codes, or be a proximity-based profit driver? How will the customer experience change in stores of all types when everything can be instrumented and analyzed? As an area of investment, how might a retail company move towards an innovation methodolo...
Too often with compelling new technologies market participants become overly enamored with that attractiveness of the technology and neglect underlying business drivers. This tendency, what some call the “newest shiny object syndrome,” is understandable given that virtually all of us are heavily engaged in technology. But it is also mistaken. Without concrete business cases driving its deployment, IoT, like many other technologies before it, will fade into obscurity.
Consumer IoT applications provide data about the user that just doesn’t exist in traditional PC or mobile web applications. This rich data, or “context,” enables the highly personalized consumer experiences that characterize many consumer IoT apps. This same data is also providing brands with unprecedented insight into how their connected products are being used, while, at the same time, powering highly targeted engagement and marketing opportunities. In his session at @ThingsExpo, Nathan Treloar, President and COO of Bebaio, will explore examples of brands transforming their businesses by t...
SYS-CON Events announced today that Micron Technology, Inc., a global leader in advanced semiconductor systems, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Micron’s broad portfolio of high-performance memory technologies – including DRAM, NAND and NOR Flash – is the basis for solid state drives, modules, multichip packages and other system solutions. Backed by more than 35 years of technology leadership, Micron's memory solutions enable the world's most innovative computing, consumer,...
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advanced analytics, and DevOps to advance innovation and increase agility. Specializing in designing, imple...
Akana has announced the availability of the new Akana Healthcare Solution. The API-driven solution helps healthcare organizations accelerate their transition to being secure, digitally interoperable businesses. It leverages the Health Level Seven International Fast Healthcare Interoperability Resources (HL7 FHIR) standard to enable broader business use of medical data. Akana developed the Healthcare Solution in response to healthcare businesses that want to increase electronic, multi-device access to health records while reducing operating costs and complying with government regulations.
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
For IoT to grow as quickly as analyst firms’ project, a lot is going to fall on developers to quickly bring applications to market. But the lack of a standard development platform threatens to slow growth and make application development more time consuming and costly, much like we’ve seen in the mobile space. In his session at @ThingsExpo, Mike Weiner, Product Manager of the Omega DevCloud with KORE Telematics Inc., discussed the evolving requirements for developers as IoT matures and conducted a live demonstration of how quickly application development can happen when the need to comply wit...
The Internet of Everything (IoE) brings together people, process, data and things to make networked connections more relevant and valuable than ever before – transforming information into knowledge and knowledge into wisdom. IoE creates new capabilities, richer experiences, and unprecedented opportunities to improve business and government operations, decision making and mission support capabilities.
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Architect for the Internet of Things and Intelligent Systems, described how to revolutionize your archit...
MuleSoft has announced the findings of its 2015 Connectivity Benchmark Report on the adoption and business impact of APIs. The findings suggest traditional businesses are quickly evolving into "composable enterprises" built out of hundreds of connected software services, applications and devices. Most are embracing the Internet of Things (IoT) and microservices technologies like Docker. A majority are integrating wearables, like smart watches, and more than half plan to generate revenue with APIs within the next year.
Growth hacking is common for startups to make unheard-of progress in building their business. Career Hacks can help Geek Girls and those who support them (yes, that's you too, Dad!) to excel in this typically male-dominated world. Get ready to learn the facts: Is there a bias against women in the tech / developer communities? Why are women 50% of the workforce, but hold only 24% of the STEM or IT positions? Some beginnings of what to do about it! In her Opening Keynote at 16th Cloud Expo, Sandy Carter, IBM General Manager Cloud Ecosystem and Developers, and a Social Business Evangelist, d...
In his keynote at 16th Cloud Expo, Rodney Rogers, CEO of Virtustream, discussed the evolution of the company from inception to its recent acquisition by EMC – including personal insights, lessons learned (and some WTF moments) along the way. Learn how Virtustream’s unique approach of combining the economics and elasticity of the consumer cloud model with proper performance, application automation and security into a platform became a breakout success with enterprise customers and a natural fit for the EMC Federation.