|By Andreas Grabner||
|November 4, 2013 03:00 PM EST||
I personally don't like the term "War Room" when describing a firefighting situation that many software companies have to deal with when systems go down or have problems. The way these war rooms typically play out is that key personnel (engineers, operations, business) are summoned into a room until the problem is solved. This was the case back with the Apollo 13 mission and still is now when we look at the famous Facebook war room from Dec 2012:
The War Room back then - And Now: Not a whole lot different
What's the problem with these pictures? There are a lot of people in the room that have no clue whether the problem on hand is actually something they can fix or are responsible for. All of these people are summoned without first figuring out which people should look at the problem. Why is that? Because the collected "evidence" in the form of infrastructure monitoring data, log files, user complaints, etc., just shows symptoms but doesn't tell us anything about the actual impact and root cause of issues:
Would you know whom to bring into a war room based on these "facts"? Would you want to be one of them?
Looking at the previous image, it is hard to tell which people need to get in a room. Do we just need an Ops guy to restart the process that consumes all of the CPU? Or do we need an application expert that sifts through log files? Do we need to contact our mobile solution provider because it is an actual problem in the 3rd party mobile native app? The typical MO is to simply call-in everybody to figure out the root cause of the problem and with that pulling critical resources from other important projects without even knowing if these folks can actually help solving these problems. How can we change this? By asking the right questions first!
The 10 Real Questions to Ask
You don't need nice and shiny dashboards that show you an aggregated overview of twitter statuses, infrastructure health or insight into slow application transactions. You need data to answer the following questions - whether it is presented in nice dashboards or log files doesn't really matter:
Having answers to these 10 questions avoids calling too many people in a war room and improves handling of critical application problems
1. Is an individual user complaining?
Is it "just" the CEO that complains about a problem with your newly deployed internal app because a report doesn't work on his old IE6? Or is it "just" the end user in a remote location that still uses dial-up? Knowing whether a problem just happens for a single or a very small group of users is important to prioritize.
Analyzing the problem of the complaining user lets us assess whether it is a problem related to just "that" user, e.g, using an unsupported browser version, slow network connectivity,...
2. Are "all" users impacted?
If a large number of users are impacted but you may not have individuals that really complain about it you still need to know as it is very critical to you fix any problems that impact a large number of your users?
Having the evidence that a large number of people in a certain region, using a certain browser or a certain device makes it easy to prioritize this issue
3. Is the problem in the application?
The next question, after knowing whether users are impacted or not, is to figure out if the problem is in the application or not. This allows us to call in the application experts, architects and developers if needed. Looking at the performance distribution gives us an overview where our hotspots really are:
Where are the performance and problem hotspots? Is it really the application? Or do we need to involve other teams?
4. Is there a problem in the delivery chain?
Modern web applications rely on a long list of services along the delivery chain that lies outside of our own data center. That includes CDNs, third-party services, ISPs or mobile networks. Knowing the status of these services and their impact on end user performance of our own application allows us to answer whether to look into our own data center or calling up Akamai, Facebook & Co:
Do CDNs or other third-party services experience any performance issues and is that the root cause of our complaining users?
5. Is one uncritical transaction impacted?
When the error rate goes up - is it a critical transaction such as search? Or is it rather uncritical such as the Contact Page. Or is a BOT causing lots of errors because it crawls through pages that do not exist anyway or that require authentication and with that skews the overall error rate?
Analyzing which transactions drive the error rate may show you that these are not critical because either caused by a BOT or on pages that are not business critical
6. Are critical transactions impacted?
What if your critical transactions are impacted such as the landing page, login, search, or entering a ticket in your support system? These are critical transactions to you, your end users, or your colleagues that need to use the back office software for their daily tasks. If these are impacted you need to act fast. Therefore it is important to monitor these critical transactions on failure rate as well as performance. If these are impacted it is more important to act than other transactions that are not vital to your business - and - you also know which subject matter experts to call:
Monitoring your critical transactions allows you to identify problems on those areas that are critical for your business
7. Is the problem related to bad coding?
If application response time is getting slower, the first question is whether it is because of bad coding. Analyzing the performance hotspot to the code level can tell you whether most of the time is spent because of inefficient algorithms or just not following coding and architectural best practices:
Throwing thousands of exceptions to control program flow is not a good coding practice and also impacts performance
8. Does the infrastructure cause an issue?
What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required to run the Garbage Collector is not available because the machine also runs lots of other services on an already over utilized machine? In that case it is time to think about the infrastructure - better distributing these applications and services or scaling the infrastructure:
Where does the memory shortage come from? Does it impact other processes on that machine? Which processes to move to a different machine?
9. Is the AppServer the issue?
Depending on the AppServer you are using you have multiple configuration options to optimize the usage for your environment. The question remains whether the AppServer might be responsible for performance issues caused by an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, ...) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem contact your IBM, Oracle, Microsoft ... specialist:
A global synchronized logging feature of IBMs WebSphere caused this performance issue which can be resolved through configuration settings in the AppServer
10. Is the problem in the virtual machine?
Leveraging virtual compute power - whether it is from your local running VM server farm or running in one of the cloud providers - provides lots of flexibility. But it can also be the reason for performance problems if the virtual machines are not properly sized or are battling for resources with other virtual machines on the same virtual server. Knowing the impact of virtualization on the application allows you to call in the VM experts and not the app developers to solve a problem:
Understanding what is going in EC2, Azure or your VMware ESX Server allows you to figure out whether the virtualize environment is the root cause
Have an Answer to These Questions?
Now that you have an idea about the right questions to ask before you call a war room session together - or before you accept a call into such scenario, you can start focusing on preventing these sessions. Whether you are a developer, architect or on the business side, make sure you have the real facts available in order to get through these situations as fast as possible by calling in the RIGHT people and giving them the RIGHT data to analyze.
Better than spending time in War Rooms however is to prevent the number of times these situations come up. If you want to learn more about this check out some of the other blogs we recently wrote such as Performance-focused DevOps or - in case you happen to be getting ready for the holiday shopping season - Verify Readiness in Test & Pre-Production.
SYS-CON Events announced today that MathFreeOn will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. MathFreeOn is Software as a Service (SaaS) used in Engineering and Math education. Write scripts and solve math problems online. MathFreeOn provides online courses for beginners or amateurs who have difficulties in writing scripts. In accordance with various mathematical topics, there are more tha...
Oct. 21, 2016 12:15 PM EDT Reads: 857
@ThingsExpo has been named the Top 5 Most Influential Internet of Things Brand by Onalytica in the ‘The Internet of Things Landscape 2015: Top 100 Individuals and Brands.' Onalytica analyzed Twitter conversations around the #IoT debate to uncover the most influential brands and individuals driving the conversation. Onalytica captured data from 56,224 users. The PageRank based methodology they use to extract influencers on a particular topic (tweets mentioning #InternetofThings or #IoT in this ...
Oct. 21, 2016 12:00 PM EDT Reads: 8,030
Cloud based infrastructure deployment is becoming more and more appealing to customers, from Fortune 500 companies to SMEs due to its pay-as-you-go model. Enterprise storage vendors are able to reach out to these customers by integrating in cloud based deployments; this needs adaptability and interoperability of the products confirming to cloud standards such as OpenStack, CloudStack, or Azure. As compared to off the shelf commodity storage, enterprise storages by its reliability, high-availabil...
Oct. 21, 2016 11:00 AM EDT Reads: 906
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smar...
Oct. 21, 2016 10:50 AM EDT Reads: 159
Complete Internet of Things (IoT) embedded device security is not just about the device but involves the entire product’s identity, data and control integrity, and services traversing the cloud. A device can no longer be looked at as an island; it is a part of a system. In fact, given the cross-domain interactions enabled by IoT it could be a part of many systems. Also, depending on where the device is deployed, for example, in the office building versus a factory floor or oil field, security ha...
Oct. 21, 2016 10:45 AM EDT Reads: 1,633
SYS-CON Events announced today that Transparent Cloud Computing (T-Cloud) Consortium will exhibit at the 19th International Cloud Expo®, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. The Transparent Cloud Computing Consortium (T-Cloud Consortium) will conduct research activities into changes in the computing model as a result of collaboration between "device" and "cloud" and the creation of new value and markets through organic data proces...
Oct. 21, 2016 10:30 AM EDT Reads: 1,219
Donna Yasay, President of HomeGrid Forum, today discussed with a panel of technology peers how certification programs are at the forefront of interoperability, and the answer for vendors looking to keep up with today's growing industry for smart home innovation. "To ensure multi-vendor interoperability, accredited industry certification programs should be used for every product to provide credibility and quality assurance for retail and carrier based customers looking to add ever increasing num...
Oct. 21, 2016 09:15 AM EDT Reads: 242
@ThingsExpo has been named the Top 5 Most Influential M2M Brand by Onalytica in the ‘Machine to Machine: Top 100 Influencers and Brands.' Onalytica analyzed the online debate on M2M by looking at over 85,000 tweets to provide the most influential individuals and brands that drive the discussion. According to Onalytica the "analysis showed a very engaged community with a lot of interactive tweets. The M2M discussion seems to be more fragmented and driven by some of the major brands present in the...
Oct. 21, 2016 08:45 AM EDT Reads: 11,117
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
Oct. 21, 2016 08:45 AM EDT Reads: 1,365
Machine Learning helps make complex systems more efficient. By applying advanced Machine Learning techniques such as Cognitive Fingerprinting, wind project operators can utilize these tools to learn from collected data, detect regular patterns, and optimize their own operations. In his session at 18th Cloud Expo, Stuart Gillen, Director of Business Development at SparkCognition, discussed how research has demonstrated the value of Machine Learning in delivering next generation analytics to impr...
Oct. 21, 2016 08:00 AM EDT Reads: 5,573
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Oct. 21, 2016 07:45 AM EDT Reads: 3,731
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
Oct. 21, 2016 07:15 AM EDT Reads: 1,280
Virgil consists of an open-source encryption library, which implements Cryptographic Message Syntax (CMS) and Elliptic Curve Integrated Encryption Scheme (ECIES) (including RSA schema), a Key Management API, and a cloud-based Key Management Service (Virgil Keys). The Virgil Keys Service consists of a public key service and a private key escrow service.
Oct. 21, 2016 07:15 AM EDT Reads: 877
Web Real-Time Communication APIs have quickly revolutionized what browsers are capable of. In addition to video and audio streams, we can now bi-directionally send arbitrary data over WebRTC's PeerConnection Data Channels. With the advent of Progressive Web Apps and new hardware APIs such as WebBluetooh and WebUSB, we can finally enable users to stitch together the Internet of Things directly from their browsers while communicating privately and securely in a decentralized way.
Oct. 21, 2016 06:45 AM EDT Reads: 1,788
Amazon has gradually rolled out parts of its IoT offerings, but these are just the tip of the iceberg. In addition to optimizing their backend AWS offerings, Amazon is laying the ground work to be a major force in IoT - especially in the connected home and office. In his session at @ThingsExpo, Chris Kocher, founder and managing director of Grey Heron, explained how Amazon is extending its reach to become a major force in IoT by building on its dominant cloud IoT platform, its Dash Button strat...
Oct. 21, 2016 06:15 AM EDT Reads: 4,638
Two weeks ago (November 3-5), I attended the Cloud Expo Silicon Valley as a speaker, where I presented on the security and privacy due diligence requirements for cloud solutions. Cloud security is a topical issue for every CIO, CISO, and technology buyer. Decision-makers are always looking for insights on how to mitigate the security risks of implementing and using cloud solutions. Based on the presentation topics covered at the conference, as well as the general discussions heard between sessi...
Oct. 21, 2016 05:45 AM EDT Reads: 5,050
For basic one-to-one voice or video calling solutions, WebRTC has proven to be a very powerful technology. Although WebRTC’s core functionality is to provide secure, real-time p2p media streaming, leveraging native platform features and server-side components brings up new communication capabilities for web and native mobile applications, allowing for advanced multi-user use cases such as video broadcasting, conferencing, and media recording.
Oct. 21, 2016 05:00 AM EDT Reads: 3,941
Fifty billion connected devices and still no winning protocols standards. HTTP, WebSockets, MQTT, and CoAP seem to be leading in the IoT protocol race at the moment but many more protocols are getting introduced on a regular basis. Each protocol has its pros and cons depending on the nature of the communications. Does there really need to be only one protocol to rule them all? Of course not. In his session at @ThingsExpo, Chris Matthieu, co-founder and CTO of Octoblu, walk you through how Oct...
Oct. 21, 2016 04:30 AM EDT Reads: 3,074
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
Oct. 21, 2016 04:15 AM EDT Reads: 1,717
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.
Oct. 21, 2016 04:00 AM EDT Reads: 10,932