| By Bill McColl | Article Rating: |
|
| January 4, 2012 05:15 AM EST | Reads: |
5,459 |
In big data computing, and more generally in all commercial highly parallel software systems, speed matters more than just about anything else. The reason is straightforward, and has been known for decades.
Put very simply, when it comes to massively parallel software of the kind need to handle big data, fast is both better AND cheaper. Faster means lower latency AND lower cost.
At first this may seem counterintuitive. A high-end sports car will be much faster than a standard family sedan, but the family sedan may be much cheaper. Cheaper to buy, and cheaper to run. But massively parallel software running on commodity hardware is a quite different type of product from a car. In general, the faster it goes, the cheaper it is to run.
Time Is Money
As has been noted many times in the history of computing, if you are a factor of 50x slower, then you will need 50x more nodes to run at the same speed (even assuming perfect parallelization), or your computation will need 50x more time. In either case, it will also be much more likely that you will experience at least one of your nodes crashing during a computation. This is not to argue that automatic fault tolerance and recovery should be ignored in the pursuit of speed, but rather that these two factors need to be carefully balanced. Good design in massively parallel systems is about achieving maximum speed along with the ability to recover from a given expected level of hardware failure, via checkpointing.
The key phrase here is "a given expected level of hardware failure". In certain types of peer-to-peer services which take advantage of idle PC capacity, it is necessary to assume that all machines are extremely unreliable and may go offline at any time. However, in a commercial big data cluster it may be reasonably asssumed that almost all machines will be available almost all of the time. This means that a much more optimistic point in the design space can be chosen, one which is designed much more for speed than for pathological failure scenarios.
The MapReduce model is an example of a model where speed has been sacrificed in a major way in order to achieve scalability on very unreliable hardware. As we have noted, while this is acceptable in certain types of free peer-to-peer services, it is much less acceptable in commercial big data systems deployed at scale.
Google, the inventors of the model, were the first to recognize the throughput and latency problems with the MapReduce model. To get the realtime performance they required, they recently replaced MapReduce in their Google Instant search engine.
The MapReduce model of Apache Hadoop is slow. In fact, it's very slow compared to, for example, the kinds of MPI or BSP clusters that have been routinely used in supercomputing for more than 15 years. On exactly the same hardware, MapReduce can be several orders of magnitude slower than MPI or BSP. By using MPI rather than MapReduce, HadoopBI gives customers the best possible big data solution, not only in terms of performance - massive throughput and extremely low latency - but also in terms of economics. HadoopBI is not just the fastest Big Data BI solution, it is also the cheapest at scale.
It's Free, But Is It Fast Enough?
Another frequently misunderstood element of big data economics concerns so-called "free" software. It has been argued by some that, since big data software needs to be run on many nodes, it is really important to have software that is free. Again this is an extreme oversimplification that ignores the dominant cost issues in big data economics. At large scale, software costs will in general be much smaller than hardware or cloud costs. And commercial software vendors should ensure that they are, if they want to stay in business.
Consider the following small-scale example. A company needs to process big data continuously in order to maximize competitive advantage. For simplicity, we will assume that the cost of running a single server (in-house or cloud) for one hour is $1, and that the company has a choice between two big data software systems - system A costs $1,000 per server and system B is free, but system A is 8x faster. Choosing system A, the company requires 5 servers, working continuously, to achieve the throughput required. However, if the company chooses system B, it will require 40 servers running continuously.
Simple arithmetic shows that within just six days, the initial cost of system A has been recovered, and from then on system A gives the company massive cost savings. Even if system A is only 2x or 3x faster and more efficient than system B, the initial cost will still be recovered in a matter of a few weeks.
The economic advantages of speed at scale are magnified even more in large-scale big data systems where, with volume licensing discounts, the payback time for super-fast software is even shorter.
The lesson of the above example is simple and very important. In parallel systems, speed at scale is king, as speed equates to efficiency, and efficiency equates to massive cost savings at scale. So, to be relevant for large scale production deployments, free parallel software has to be at least as fast and efficient as the best commercial software, otherwise the economics will be solidly against it. Some examples of free software, such as the Linux operating system, have achieved this goal. It remains to be seen whether this will also be the case with highly parallel big data software. In the meantime, it's important to remember that "free software is cheap, but fast software can be even cheaper".
Published January 4, 2012 Reads 5,459
Copyright © 2012 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Bill McColl
Bill McColl left Oxford University to found Cloudscale. At Oxford he was Professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty. Along with Les Valiant of Harvard, he developed the BSP approach to parallel programming. He has led research, product, and business teams, in a number of areas: massively parallel algorithms and architectures, parallel programming languages and tools, datacenter virtualization, realtime stream processing, big data analytics, and cloud computing. He lives in Palo Alto, CA.
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York Speaker Profile: Dave Linthicum – Cloud Technology Partners
- Cloud Expo New York Speaker Profile: Jill T. Singer – Federal CIO Emeritus
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- CollabNet and UC4 Announce General Availability of Joint Enterprise DevOps Platform
- How Can Green Web Hosting Benefit Your Business?
- Big Data Isn’t About the Database, It’s About the Application
- Session Topics: 12th Cloud Expo / Cloud Expo New York
- BEA Updates WebLogic SOA Portal for Web 2.0 Era
- UNIT4 Business Software: Three Retail Accounting Tips to Help Retailers Leverage the Cloud and Back Office Systems
- Cloud Expo NY: Best Practices for Architecting Your Cloud Infrastructure
- The Rise of the Thin Client
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York Speaker Profile: Dave Linthicum – Cloud Technology Partners
- Cloud Expo New York Speaker Profile: Jill T. Singer – Federal CIO Emeritus
- Enterasys Spotlights SDN's Impact on Traditional Networking in Upcoming Webinar
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- CollabNet and UC4 Announce General Availability of Joint Enterprise DevOps Platform
- How Can Green Web Hosting Benefit Your Business?
- Big Data Isn’t About the Database, It’s About the Application
- Upcoming Bloomberg BNA Webinar Focuses on COPPA Compliance
- NASA's Twitter Account Wins Back-To-Back Shorty Awards
- Cloud Expo New York: Basics of SSD Technology and Its Use in Cloud
- Session Topics: 12th Cloud Expo / Cloud Expo New York
- The Top 150 Players in Cloud Computing
- Who Are The All-Time Heroes of i-Technology?
- Where Are RIA Technologies Headed in 2008?
- Success, Arrogance, Rise and Fall
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Personal Branding Checklist
- The Top 250 Players in the Cloud Computing Ecosystem
- i-Technology Viewpoint: Attack of the Blogs
- Exclusive Q&A with Jeff Haynie, Co-Founder & CEO, Appcelerator
- Web 2.0 News and Wrapping Up "Real-World AJAX" Seminar
- Passing Parameters to Flex That Works
- i-Technology Viewpoint: It's Time to Take the Quotation Marks Off "Web 2.0"























