|By Michael Kopp||
|November 4, 2012 09:00 AM EST||
An eCommerce site that crashes seven times during the Christmas season being down for up to five hours each time it crashes is a site that loses a lot of money and reputation. It happened to one of our customers who told this story at our annual performance conference earlier this month. Among the several reasons that led to these crashes I want to share more details on one of them that I see more often with other websites as well. Load balancers on a round-robin instead of least-busy can easily lead to app server crashes caused by heap memory exhaustion. Let's dig into some details on how to identify these problems and how to avoid it.
The Symptom: Crashing Tomcat Instances
The website is deployed on six Tomcats with three front-end Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn't be handled any more with the remaining servers. Figure 1 shows the actual flow of transactions through the system highlighting unevenly distributed response time in the application servers and functional errors being reported on all tiers (red-colored server icon):
Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing
Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of exceptions in the database layer, exceptions thrown between application tiers with the web app responding with HTTP 500s:
Within 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication
The Root Cause: Inefficient Database Statements and Connection Pool Usage
The exceptions caught in the Database Layer (JDBC) were already a very good hint of the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted, which causes problems in the different components of the application:
Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering
Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:
Due to the connection pool problem a single request had to wait 3.8s on average to obtain a connection from the pool
Now - it was not only the size of the pool that was the problem - but - several very inefficient database statements that took long to execute for certain business transactions of the application. This caused the application server to hold on to the connection longer than normal. As the load balancer was configured with Round Robin the app server still got additional requests served. Eventually - just by the random nature of incoming requests - one app server received several of these requests that executed these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions that ultimately also led to a crash of the JVM. Once the first app server crashed, it didn't take too long to take the other app servers down as well.
The Solution: Optimizing App and Load Balancer
The problem was fixed by looking at the slowest database statements and optimizing them for performance by, e.g, adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.
They started by optimizing SQL Statements that took long to execute and those that got executed several times within the same transaction
They also changed the load balancer setting from Round-Robin to Least-Busy, which was the preferred setting from the LB vendor but had simply forgotten to configure in the production environment.
The Result: Site Has Not Been Down Since
Since they made the changes to the application and the load balancer the site has never gone down since. Now - the next holiday season is coming up and they are ready for the next seasonal spikes. Even though they are really confident that everything will work without any problems, they learned their lesson and are approaching performance proactively through proper load testing.
Next Steps: Proactive Performance Management
The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons: 1) they didn't test using expected peak volumes for long enough sessions and 2) didn't use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) that tested their highly interactive web site.
Their strategy for proactive performance management is that they:
- Perform load tests on the production system during low traffic hours (2 a.m.-6 a.m.), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
- Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
- Use a load testing service that uses real browsers in different locations around the U.S.
- Use an APM solution that identifies problems within the application while running the load test.
If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging
- The Odd Couple: Marrying Agile and Waterfall
- Fanning the Flames of Agile
- Internet of @ThingsExpo Silicon Valley Call for Papers Now Open
- April and May 2014 Server and StorageIO Update newsletter
- MangoApps to Exhibit at Cloud Expo New York
- WSO2 Introduces Industry’s First Enterprise Identity Bus With the Launch of WSO2 Identity Server 5.0
- Last Chance to Register for LTE World Summit
- The Butterfly Effect Within IT
- The Business Challenges Impacting Digital Transformation
- Stay Current on the Internet of Things
- Setting the Bar for Agile Architecture
- New Relic Announces General Availability of Real-Time Analytics Platform New Relic Insights
- How to Get the Best From Virtual Employees
- Global Financial Firms Can Effectively Address Technology Risk Guidelines
- .CLUB Domain Name Extension Now Available for General Registration
- MapR Technologies Announces Upcoming June Conferences
- More Mainstream Businesses Depend on Open Source
- AMAG, HP, ImageWare Systems, March Networks and StrikeForce Discuss Security Solutions in SecuritySolutionsWatch.com Interviews
- F5 to Present at Upcoming Technology and Investor Conferences
- The Odd Couple: Marrying Agile and Waterfall
- Flexera Software’s InstallShield 2014 Release Introduces New Support of Cloud and Virtualised Installations, High-DPI Displays and Touch Devices, and Agile Development
- FlexNet Manager Suite Wins CODiE Award for Best Asset Management Solution - 4th CODiE Award for Flexera Software
- Fanning the Flames of Agile
- WSO2 Guest Speakers at WSO2Con Europe 2014 Will Examine Technology Developments and Best Practices Enabling the Connected Business
- The Top 150 Players in Cloud Computing
- Who Are The All-Time Heroes of i-Technology?
- Where Are RIA Technologies Headed in 2008?
- Success, Arrogance, Rise and Fall
- AJAX World RIA Conference & Expo Kicks Off in New York City
- The Top 250 Players in the Cloud Computing Ecosystem
- Personal Branding Checklist
- i-Technology Viewpoint: Attack of the Blogs
- Exclusive Q&A with Jeff Haynie, Co-Founder & CEO, Appcelerator
- Cloud People: A Who's Who of Cloud Computing
- Ulitzer Names the World's 30 Most Influential Cloud Computing Bloggers
- Web 2.0 News and Wrapping Up "Real-World AJAX" Seminar