|By Jerry Held||
|August 5, 2008 01:15 PM EDT||
My belief is that cloud computing will change the economics of business intelligence (BI) and enable a variety of new analytic data management projects and business possibilities. It does so by making the hardware, networking, security, and software needed to create data marts and data warehouses available on demand with a pay-as-you-go approach to usage and licensing.
A computing cloud, such as the Amazon Elastic Compute Cloud, is composed of thousands of commodity servers running multiple virtual machine instances (VMs) of the applications hosted in the cloud. As customer demand for those applications changes, new servers are added to the cloud or idled and new VMs are instantiated or terminated.
Cloud computing infrastructure differs dramatically from the infrastructure underlying most in-house data warehouses and data marts. There are no high-end servers with dozens of CPU cores, SANs, replicated systems, or proprietary data warehousing appliances available in the cloud. Therefore, a new DBMS software architecture is required to enable large volumes of data to be analyzed quickly and reliably on the cloud's commodity hardware. Recent DBMS innovations make this a reality today, and the best cloud DBMS architectures will include:
- Shared-nothing, massively parallel processing (MPP) architecture. In order to drive down the cost of creating a utility computing environment, the best cloud service providers use huge grids of identical (or similar) computing elements. Each node in the grid is typically a compute engine with its own attached storage. For a cloud database to successfully "scale out" in such an environment, it is essential that the database have a shared-nothing architecture utilizing the resources (CPU, memory, and disk) found in server nodes added to the cluster. Most databases popularly used in BI today have shared-everything or shared-storage architectures, which will limit their ability to scale in the cloud.
- Automatic high availability. Within a cloud-based analytic database cluster, node failures, node changes, and connection disruptions can occur. Given the vast number of processing elements within a cloud, these failures can be made transparent to the end user if the database has the proper built-in failover capabilities. The best cloud databases will replicate data automatically across the nodes in the cloud cluster, be able to continue running in the event of 1 or more node failures ("k-safety"), and be capable of restoring data on recovered nodes automatically -- without DBA assistance. Ideally, the replicated data will be made "active" in different sort orders for querying to increase performance.
- Ultra-high performance. One of the game-changing advantages of the cloud is the ability to get an analytic application up quickly (without waiting for hardware procurement). However, there can be some performance penalty due to Internet connectivity speeds and the virtualized cloud environment. If the analytic performance is disappointing, the advantage is lost. Fortunately, the latest shared-nothing columnar databases are designed specifically for analytic workloads, and they have demonstrated dramatic performance improvements over traditional, row-oriented databases (as verified by industry experts, such as Gartner and Forrester, and by customer benchmarks). This software performance improvement, coupled with the hardware economies of scale provided by the cloud environment, results in a new economic model and competitive advantage for cloud analytics.
- Aggressive compression. Since cloud costs are typically driven by charges for processor and disk storage utilization, aggressive data compression will result in very large cost savings. Row-oriented databases can achieve compression factors of about 30% to 50%; however, the addition of necessary indexes and materialized views often swells databases to 2 to 5 times the size of the source data. But since the data in a column tends to be more similar and repetitive than attributes within rows, column databases often achieve much higher levels of compression. They also don't require indexes. The result is normally a 4x to 20x reduction in the amount of storage needed by columnar databases and a commensurate reduction in storage costs.
- Standards-based connectivity. While there are a number of special-purpose file systems that have been developed for the cloud environment that can provide high performance, they lack the standard connectivity needed to support general-purpose business analytics. The broad base of analytic users will use existing commercial ETL and reporting software that depend on SQL, JDBC, ODBC, and other DBMS connectivity standards to load and query cloud databases. Therefore, it's imperative for cloud databases to support these connection standards to enable widespread use of analytic applications.
- "Scaling out," as the cloud itself does
- Running fast without high-end or custom hardware
- Providing high availability in a fluid computing environment
- Minimizing data storage, transfer, and CPU utilization (to keep cloud computing fees low)
Jul. 5, 2015 06:30 PM EDT Reads: 1,358
Jul. 5, 2015 06:00 PM EDT Reads: 987
Jul. 5, 2015 06:00 PM EDT Reads: 1,323
Jul. 5, 2015 05:15 PM EDT Reads: 1,391
Jul. 5, 2015 05:00 PM EDT Reads: 1,577
Jul. 5, 2015 04:45 PM EDT Reads: 1,444
Jul. 5, 2015 04:30 PM EDT Reads: 1,938
Jul. 5, 2015 03:00 PM EDT Reads: 1,323
Jul. 5, 2015 03:00 PM EDT Reads: 2,400
Jul. 5, 2015 02:30 PM EDT Reads: 1,961
Jul. 5, 2015 02:30 PM EDT Reads: 1,255
Jul. 5, 2015 01:30 PM EDT Reads: 1,562
Jul. 5, 2015 01:00 PM EDT Reads: 1,354
Jul. 5, 2015 12:00 PM EDT Reads: 1,589
Jul. 5, 2015 11:45 AM EDT Reads: 2,488
Jul. 5, 2015 09:30 AM EDT Reads: 1,683
Jul. 5, 2015 09:00 AM EDT Reads: 2,428
Jul. 3, 2015 12:00 PM EDT Reads: 2,450
Jun. 29, 2015 12:15 PM EDT Reads: 2,870
Jun. 29, 2015 11:00 AM EDT Reads: 2,309