|By Andreas Grabner||
|November 4, 2013 03:00 PM EST||
I personally don't like the term "War Room" when describing a firefighting situation that many software companies have to deal with when systems go down or have problems. The way these war rooms typically play out is that key personnel (engineers, operations, business) are summoned into a room until the problem is solved. This was the case back with the Apollo 13 mission and still is now when we look at the famous Facebook war room from Dec 2012:
The War Room back then - And Now: Not a whole lot different
What's the problem with these pictures? There are a lot of people in the room that have no clue whether the problem on hand is actually something they can fix or are responsible for. All of these people are summoned without first figuring out which people should look at the problem. Why is that? Because the collected "evidence" in the form of infrastructure monitoring data, log files, user complaints, etc., just shows symptoms but doesn't tell us anything about the actual impact and root cause of issues:
Would you know whom to bring into a war room based on these "facts"? Would you want to be one of them?
Looking at the previous image, it is hard to tell which people need to get in a room. Do we just need an Ops guy to restart the process that consumes all of the CPU? Or do we need an application expert that sifts through log files? Do we need to contact our mobile solution provider because it is an actual problem in the 3rd party mobile native app? The typical MO is to simply call-in everybody to figure out the root cause of the problem and with that pulling critical resources from other important projects without even knowing if these folks can actually help solving these problems. How can we change this? By asking the right questions first!
The 10 Real Questions to Ask
You don't need nice and shiny dashboards that show you an aggregated overview of twitter statuses, infrastructure health or insight into slow application transactions. You need data to answer the following questions - whether it is presented in nice dashboards or log files doesn't really matter:
Having answers to these 10 questions avoids calling too many people in a war room and improves handling of critical application problems
1. Is an individual user complaining?
Is it "just" the CEO that complains about a problem with your newly deployed internal app because a report doesn't work on his old IE6? Or is it "just" the end user in a remote location that still uses dial-up? Knowing whether a problem just happens for a single or a very small group of users is important to prioritize.
Analyzing the problem of the complaining user lets us assess whether it is a problem related to just "that" user, e.g, using an unsupported browser version, slow network connectivity,...
2. Are "all" users impacted?
If a large number of users are impacted but you may not have individuals that really complain about it you still need to know as it is very critical to you fix any problems that impact a large number of your users?
Having the evidence that a large number of people in a certain region, using a certain browser or a certain device makes it easy to prioritize this issue
3. Is the problem in the application?
The next question, after knowing whether users are impacted or not, is to figure out if the problem is in the application or not. This allows us to call in the application experts, architects and developers if needed. Looking at the performance distribution gives us an overview where our hotspots really are:
Where are the performance and problem hotspots? Is it really the application? Or do we need to involve other teams?
4. Is there a problem in the delivery chain?
Modern web applications rely on a long list of services along the delivery chain that lies outside of our own data center. That includes CDNs, third-party services, ISPs or mobile networks. Knowing the status of these services and their impact on end user performance of our own application allows us to answer whether to look into our own data center or calling up Akamai, Facebook & Co:
Do CDNs or other third-party services experience any performance issues and is that the root cause of our complaining users?
5. Is one uncritical transaction impacted?
When the error rate goes up - is it a critical transaction such as search? Or is it rather uncritical such as the Contact Page. Or is a BOT causing lots of errors because it crawls through pages that do not exist anyway or that require authentication and with that skews the overall error rate?
Analyzing which transactions drive the error rate may show you that these are not critical because either caused by a BOT or on pages that are not business critical
6. Are critical transactions impacted?
What if your critical transactions are impacted such as the landing page, login, search, or entering a ticket in your support system? These are critical transactions to you, your end users, or your colleagues that need to use the back office software for their daily tasks. If these are impacted you need to act fast. Therefore it is important to monitor these critical transactions on failure rate as well as performance. If these are impacted it is more important to act than other transactions that are not vital to your business - and - you also know which subject matter experts to call:
Monitoring your critical transactions allows you to identify problems on those areas that are critical for your business
7. Is the problem related to bad coding?
If application response time is getting slower, the first question is whether it is because of bad coding. Analyzing the performance hotspot to the code level can tell you whether most of the time is spent because of inefficient algorithms or just not following coding and architectural best practices:
Throwing thousands of exceptions to control program flow is not a good coding practice and also impacts performance
8. Does the infrastructure cause an issue?
What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required to run the Garbage Collector is not available because the machine also runs lots of other services on an already over utilized machine? In that case it is time to think about the infrastructure - better distributing these applications and services or scaling the infrastructure:
Where does the memory shortage come from? Does it impact other processes on that machine? Which processes to move to a different machine?
9. Is the AppServer the issue?
Depending on the AppServer you are using you have multiple configuration options to optimize the usage for your environment. The question remains whether the AppServer might be responsible for performance issues caused by an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, ...) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem contact your IBM, Oracle, Microsoft ... specialist:
A global synchronized logging feature of IBMs WebSphere caused this performance issue which can be resolved through configuration settings in the AppServer
10. Is the problem in the virtual machine?
Leveraging virtual compute power - whether it is from your local running VM server farm or running in one of the cloud providers - provides lots of flexibility. But it can also be the reason for performance problems if the virtual machines are not properly sized or are battling for resources with other virtual machines on the same virtual server. Knowing the impact of virtualization on the application allows you to call in the VM experts and not the app developers to solve a problem:
Understanding what is going in EC2, Azure or your VMware ESX Server allows you to figure out whether the virtualize environment is the root cause
Have an Answer to These Questions?
Now that you have an idea about the right questions to ask before you call a war room session together - or before you accept a call into such scenario, you can start focusing on preventing these sessions. Whether you are a developer, architect or on the business side, make sure you have the real facts available in order to get through these situations as fast as possible by calling in the RIGHT people and giving them the RIGHT data to analyze.
Better than spending time in War Rooms however is to prevent the number of times these situations come up. If you want to learn more about this check out some of the other blogs we recently wrote such as Performance-focused DevOps or - in case you happen to be getting ready for the holiday shopping season - Verify Readiness in Test & Pre-Production.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
Jun. 27, 2016 07:30 AM EDT Reads: 979
SYS-CON Events announced today that Bsquare has been named “Silver Sponsor” of SYS-CON's @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. For more than two decades, Bsquare has helped its customers extract business value from a broad array of physical assets by making them intelligent, connecting them, and using the data they generate to optimize business processes.
Jun. 26, 2016 05:00 PM EDT Reads: 1,211
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Jun. 26, 2016 05:00 PM EDT Reads: 1,302
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
Jun. 26, 2016 04:00 PM EDT Reads: 1,257
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Jun. 26, 2016 04:00 PM EDT Reads: 1,318
Connected devices and the industrial internet are growing exponentially every year with Cisco expecting 50 billion devices to be in operation by 2020. In this period of growth, location-based insights are becoming invaluable to many businesses as they adopt new connected technologies. Knowing when and where these devices connect from is critical for a number of scenarios in supply chain management, disaster management, emergency response, M2M, location marketing and more. In his session at @Th...
Jun. 26, 2016 03:00 PM EDT Reads: 982
It is one thing to build single industrial IoT applications, but what will it take to build the Smart Cities and truly society changing applications of the future? The technology won’t be the problem, it will be the number of parties that need to work together and be aligned in their motivation to succeed. In his Day 2 Keynote at @ThingsExpo, Henrik Kenani Dahlgren, Portfolio Marketing Manager at Ericsson, discussed how to plan to cooperate, partner, and form lasting all-star teams to change t...
Jun. 26, 2016 02:00 PM EDT Reads: 1,147
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, explored the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences between wh...
Jun. 26, 2016 01:00 PM EDT Reads: 806
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
Jun. 26, 2016 12:00 PM EDT Reads: 1,290
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Jun. 26, 2016 12:00 PM EDT Reads: 1,110
There is little doubt that Big Data solutions will have an increasing role in the Enterprise IT mainstream over time. Big Data at Cloud Expo - to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA - has announced its Call for Papers is open. Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is...
Jun. 26, 2016 12:00 PM EDT Reads: 1,351
In his general session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, discussed cloud as a ‘better data center’ and how it adds new capacity (faster) and improves application availability (redundancy). The cloud is a ‘Dynamic Tool for Dynamic Apps’ and resource allocation is an integral part of your application architecture, so use only the resources you need and allocate /de-allocate resources on the fly.
Jun. 26, 2016 12:00 PM EDT Reads: 1,089
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life sett...
Jun. 26, 2016 11:00 AM EDT Reads: 1,118
Machine Learning helps make complex systems more efficient. By applying advanced Machine Learning techniques such as Cognitive Fingerprinting, wind project operators can utilize these tools to learn from collected data, detect regular patterns, and optimize their own operations. In his session at 18th Cloud Expo, Stuart Gillen, Director of Business Development at SparkCognition, discussed how research has demonstrated the value of Machine Learning in delivering next generation analytics to imp...
Jun. 26, 2016 09:45 AM EDT Reads: 634
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Bradley Holt, Developer Advocate a...
Jun. 26, 2016 08:45 AM EDT Reads: 748
Cognitive Computing is becoming the foundation for a new generation of solutions that have the potential to transform business. Unlike traditional approaches to building solutions, a cognitive computing approach allows the data to help determine the way applications are designed. This contrasts with conventional software development that begins with defining logic based on the current way a business operates. In her session at 18th Cloud Expo, Judith S. Hurwitz, President and CEO of Hurwitz & ...
Jun. 25, 2016 03:00 PM EDT Reads: 1,622
SYS-CON Events announced today that ReadyTalk, a leading provider of online conferencing and webinar services, has been named Vendor Presentation Sponsor at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. ReadyTalk delivers audio and web conferencing services that inspire collaboration and enable the Future of Work for today’s increasingly digital and mobile workforce. By combining intuitive, innovative tec...
Jun. 24, 2016 01:00 PM EDT Reads: 1,362
Amazon has gradually rolled out parts of its IoT offerings, but these are just the tip of the iceberg. In addition to optimizing their backend AWS offerings, Amazon is laying the ground work to be a major force in IoT - especially in the connected home and office. In his session at @ThingsExpo, Chris Kocher, founder and managing director of Grey Heron, explained how Amazon is extending its reach to become a major force in IoT by building on its dominant cloud IoT platform, its Dash Button strat...
Jun. 24, 2016 12:00 PM EDT Reads: 1,659
industrial company for a multi-year contract initially valued at over $4.0 million. In addition to DataV software, Bsquare will also provide comprehensive systems integration, support and maintenance services. DataV leverages advanced data analytics, predictive reasoning, data-driven diagnostics, and automated orchestration of remediation actions in order to improve asset uptime while reducing service and warranty costs.
Jun. 22, 2016 11:00 AM EDT Reads: 1,373
Vidyo, Inc., has joined the Alliance for Open Media. The Alliance for Open Media is a non-profit organization working to define and develop media technologies that address the need for an open standard for video compression and delivery over the web. As a member of the Alliance, Vidyo will collaborate with industry leaders in pursuit of an open and royalty-free AOMedia Video codec, AV1. Vidyo’s contributions to the organization will bring to bear its long history of expertise in codec technolo...
Jun. 19, 2016 12:45 PM EDT Reads: 1,279