|By Andreas Grabner||
|November 4, 2013 03:00 PM EST||
I personally don't like the term "War Room" when describing a firefighting situation that many software companies have to deal with when systems go down or have problems. The way these war rooms typically play out is that key personnel (engineers, operations, business) are summoned into a room until the problem is solved. This was the case back with the Apollo 13 mission and still is now when we look at the famous Facebook war room from Dec 2012:
The War Room back then - And Now: Not a whole lot different
What's the problem with these pictures? There are a lot of people in the room that have no clue whether the problem on hand is actually something they can fix or are responsible for. All of these people are summoned without first figuring out which people should look at the problem. Why is that? Because the collected "evidence" in the form of infrastructure monitoring data, log files, user complaints, etc., just shows symptoms but doesn't tell us anything about the actual impact and root cause of issues:
Would you know whom to bring into a war room based on these "facts"? Would you want to be one of them?
Looking at the previous image, it is hard to tell which people need to get in a room. Do we just need an Ops guy to restart the process that consumes all of the CPU? Or do we need an application expert that sifts through log files? Do we need to contact our mobile solution provider because it is an actual problem in the 3rd party mobile native app? The typical MO is to simply call-in everybody to figure out the root cause of the problem and with that pulling critical resources from other important projects without even knowing if these folks can actually help solving these problems. How can we change this? By asking the right questions first!
The 10 Real Questions to Ask
You don't need nice and shiny dashboards that show you an aggregated overview of twitter statuses, infrastructure health or insight into slow application transactions. You need data to answer the following questions - whether it is presented in nice dashboards or log files doesn't really matter:
Having answers to these 10 questions avoids calling too many people in a war room and improves handling of critical application problems
1. Is an individual user complaining?
Is it "just" the CEO that complains about a problem with your newly deployed internal app because a report doesn't work on his old IE6? Or is it "just" the end user in a remote location that still uses dial-up? Knowing whether a problem just happens for a single or a very small group of users is important to prioritize.
Analyzing the problem of the complaining user lets us assess whether it is a problem related to just "that" user, e.g, using an unsupported browser version, slow network connectivity,...
2. Are "all" users impacted?
If a large number of users are impacted but you may not have individuals that really complain about it you still need to know as it is very critical to you fix any problems that impact a large number of your users?
Having the evidence that a large number of people in a certain region, using a certain browser or a certain device makes it easy to prioritize this issue
3. Is the problem in the application?
The next question, after knowing whether users are impacted or not, is to figure out if the problem is in the application or not. This allows us to call in the application experts, architects and developers if needed. Looking at the performance distribution gives us an overview where our hotspots really are:
Where are the performance and problem hotspots? Is it really the application? Or do we need to involve other teams?
4. Is there a problem in the delivery chain?
Modern web applications rely on a long list of services along the delivery chain that lies outside of our own data center. That includes CDNs, third-party services, ISPs or mobile networks. Knowing the status of these services and their impact on end user performance of our own application allows us to answer whether to look into our own data center or calling up Akamai, Facebook & Co:
Do CDNs or other third-party services experience any performance issues and is that the root cause of our complaining users?
5. Is one uncritical transaction impacted?
When the error rate goes up - is it a critical transaction such as search? Or is it rather uncritical such as the Contact Page. Or is a BOT causing lots of errors because it crawls through pages that do not exist anyway or that require authentication and with that skews the overall error rate?
Analyzing which transactions drive the error rate may show you that these are not critical because either caused by a BOT or on pages that are not business critical
6. Are critical transactions impacted?
What if your critical transactions are impacted such as the landing page, login, search, or entering a ticket in your support system? These are critical transactions to you, your end users, or your colleagues that need to use the back office software for their daily tasks. If these are impacted you need to act fast. Therefore it is important to monitor these critical transactions on failure rate as well as performance. If these are impacted it is more important to act than other transactions that are not vital to your business - and - you also know which subject matter experts to call:
Monitoring your critical transactions allows you to identify problems on those areas that are critical for your business
7. Is the problem related to bad coding?
If application response time is getting slower, the first question is whether it is because of bad coding. Analyzing the performance hotspot to the code level can tell you whether most of the time is spent because of inefficient algorithms or just not following coding and architectural best practices:
Throwing thousands of exceptions to control program flow is not a good coding practice and also impacts performance
8. Does the infrastructure cause an issue?
What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required to run the Garbage Collector is not available because the machine also runs lots of other services on an already over utilized machine? In that case it is time to think about the infrastructure - better distributing these applications and services or scaling the infrastructure:
Where does the memory shortage come from? Does it impact other processes on that machine? Which processes to move to a different machine?
9. Is the AppServer the issue?
Depending on the AppServer you are using you have multiple configuration options to optimize the usage for your environment. The question remains whether the AppServer might be responsible for performance issues caused by an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, ...) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem contact your IBM, Oracle, Microsoft ... specialist:
A global synchronized logging feature of IBMs WebSphere caused this performance issue which can be resolved through configuration settings in the AppServer
10. Is the problem in the virtual machine?
Leveraging virtual compute power - whether it is from your local running VM server farm or running in one of the cloud providers - provides lots of flexibility. But it can also be the reason for performance problems if the virtual machines are not properly sized or are battling for resources with other virtual machines on the same virtual server. Knowing the impact of virtualization on the application allows you to call in the VM experts and not the app developers to solve a problem:
Understanding what is going in EC2, Azure or your VMware ESX Server allows you to figure out whether the virtualize environment is the root cause
Have an Answer to These Questions?
Now that you have an idea about the right questions to ask before you call a war room session together - or before you accept a call into such scenario, you can start focusing on preventing these sessions. Whether you are a developer, architect or on the business side, make sure you have the real facts available in order to get through these situations as fast as possible by calling in the RIGHT people and giving them the RIGHT data to analyze.
Better than spending time in War Rooms however is to prevent the number of times these situations come up. If you want to learn more about this check out some of the other blogs we recently wrote such as Performance-focused DevOps or - in case you happen to be getting ready for the holiday shopping season - Verify Readiness in Test & Pre-Production.
P2P RTC will impact the landscape of communications, shifting from traditional telephony style communications models to OTT (Over-The-Top) cloud assisted & PaaS (Platform as a Service) communication services. The P2P shift will impact many areas of our lives, from mobile communication, human interactive web services, RTC and telephony infrastructure, user federation, security and privacy implications, business costs, and scalability. In his session at @ThingsExpo, Robin Raymond, Chief Architect at Hookflash, will walk through the shifting landscape of traditional telephone and voice services ...
May. 22, 2015 03:00 PM EDT Reads: 4,028
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at Internet of @ThingsExpo, James Kirkland, Chief Architect for the Internet of Things and Intelligent Systems at Red Hat, described how to revolutioniz...
May. 22, 2015 02:00 PM EDT Reads: 4,651
For IoT to grow as quickly as analyst firms’ project, a lot is going to fall on developers to quickly bring applications to market. But the lack of a standard development platform threatens to slow growth and make application development more time consuming and costly, much like we’ve seen in the mobile space. In his session at @ThingsExpo, Mike Weiner is Product Manager of the Omega DevCloud with KORE Telematics Inc., will discuss the evolving requirements for developers as IoT matures and conduct a live demonstration of how quickly application development can happen when the need to comply...
May. 22, 2015 01:00 PM EDT Reads: 1,731
Container frameworks, such as Docker, provide a variety of benefits, including density of deployment across infrastructure, convenience for application developers to push updates with low operational hand-holding, and a fairly well-defined deployment workflow that can be orchestrated. Container frameworks also enable a DevOps approach to application development by cleanly separating concerns between operations and development teams. But running multi-container, multi-server apps with containers is very hard. You have to learn five new and different technologies and best practices (libswarm, sy...
May. 22, 2015 12:00 PM EDT Reads: 1,955
SYS-CON Events announced today that DragonGlass, an enterprise search platform, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. After eleven years of designing and building custom applications, OpenCrowd has launched DragonGlass, a cloud-based platform that enables the development of search-based applications. These are a new breed of applications that utilize a search index as their backbone for data retrieval. They can easily adapt to new data sets and provide access to both structured and unstruc...
May. 22, 2015 12:00 PM EDT Reads: 1,753
Converging digital disruptions is creating a major sea change - Cisco calls this the Internet of Everything (IoE). IoE is the network connection of People, Process, Data and Things, fueled by Cloud, Mobile, Social, Analytics and Security, and it represents a $19Trillion value-at-stake over the next 10 years. In her keynote at @ThingsExpo, Manjula Talreja, VP of Cisco Consulting Services, will discuss IoE and the enormous opportunities it provides to public and private firms alike. She will share what businesses must do to thrive in the IoE economy, citing examples from several industry sector...
May. 22, 2015 12:00 PM EDT Reads: 2,004
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal an...
May. 22, 2015 11:30 AM EDT Reads: 2,529
The security devil is always in the details of the attack: the ones you've endured, the ones you prepare yourself to fend off, and the ones that, you fear, will catch you completely unaware and defenseless. The Internet of Things (IoT) is nothing if not an endless proliferation of details. It's the vision of a world in which continuous Internet connectivity and addressability is embedded into a growing range of human artifacts, into the natural world, and even into our smartphones, appliances, and physical persons. In the IoT vision, every new "thing" - sensor, actuator, data source, data con...
May. 22, 2015 11:00 AM EDT Reads: 6,214
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize supplier management. Learn about enterprise architecture strategies for designing connected systems tha...
May. 22, 2015 10:00 AM EDT Reads: 5,871
There's Big Data, then there's really Big Data from the Internet of Things. IoT is evolving to include many data possibilities like new types of event, log and network data. The volumes are enormous, generating tens of billions of logs per day, which raise data challenges. Early IoT deployments are relying heavily on both the cloud and managed service providers to navigate these challenges. In her session at Big Data Expo®, Hannah Smalltree, Director at Treasure Data, discussed how IoT, Big Data and deployments are processing massive data volumes from wearables, utilities and other machines...
May. 22, 2015 10:00 AM EDT Reads: 3,947
SYS-CON Events announced today that the "First Containers & Microservices Conference" will take place June 9-11, 2015, at the Javits Center in New York City. The “Second Containers & Microservices Conference” will take place November 3-5, 2015, at Santa Clara Convention Center, Santa Clara, CA. Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities.
May. 22, 2015 10:00 AM EDT Reads: 2,043
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud environment, and we must architect and code accordingly. At the very least, you'll have no problem fil...
May. 22, 2015 10:00 AM EDT Reads: 1,835
IoT is still a vague buzzword for many people. In his session at @ThingsExpo, Mike Kavis, Vice President & Principal Cloud Architect at Cloud Technology Partners, discussed the business value of IoT that goes far beyond the general public's perception that IoT is all about wearables and home consumer services. He also discussed how IoT is perceived by investors and how venture capitalist access this space. Other topics discussed were barriers to success, what is new, what is old, and what the future may hold. Mike Kavis is Vice President & Principal Cloud Architect at Cloud Technology Pa...
May. 22, 2015 10:00 AM EDT Reads: 7,733
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
May. 22, 2015 09:45 AM EDT Reads: 1,346
The 4th International Internet of @ThingsExpo, co-located with the 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
May. 22, 2015 09:00 AM EDT Reads: 1,528
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
May. 22, 2015 08:00 AM EDT Reads: 4,203
Since 2008 and for the first time in history, more than half of humans live in urban areas, urging cities to become “smart.” Today, cities can leverage the wide availability of smartphones combined with new technologies such as Beacons or NFC to connect their urban furniture and environment to create citizen-first services that improve transportation, way-finding and information delivery. In her session at @ThingsExpo, Laetitia Gazel-Anthoine, CEO of Connecthings, will focus on successful use cases.
May. 22, 2015 06:00 AM EDT Reads: 4,854
The explosion of connected devices / sensors is creating an ever-expanding set of new and valuable data. In parallel the emerging capability of Big Data technologies to store, access, analyze, and react to this data is producing changes in business models under the umbrella of the Internet of Things (IoT). In particular within the Insurance industry, IoT appears positioned to enable deep changes by altering relationships between insurers, distributors, and the insured. In his session at @ThingsExpo, Michael Sick, a Senior Manager and Big Data Architect within Ernst and Young's Financial Servi...
May. 22, 2015 06:00 AM EDT Reads: 4,726
The recent trends like cloud computing, social, mobile and Internet of Things are forcing enterprises to modernize in order to compete in the competitive globalized markets. However, enterprises are approaching newer technologies with a more silo-ed way, gaining only sub optimal benefits. The Modern Enterprise model is presented as a newer way to think of enterprise IT, which takes a more holistic approach to embracing modern technologies.
May. 22, 2015 06:00 AM EDT Reads: 5,966
One of the biggest impacts of the Internet of Things is and will continue to be on data; specifically data volume, management and usage. Companies are scrambling to adapt to this new and unpredictable data reality with legacy infrastructure that cannot handle the speed and volume of data. In his session at @ThingsExpo, Don DeLoach, CEO and president of Infobright, will discuss how companies need to rethink their data infrastructure to participate in the IoT, including: Data storage: Understanding the kinds of data: structured, unstructured, big/small? Analytics: What kinds and how responsiv...
May. 22, 2015 05:00 AM EDT Reads: 4,400