|By Dana Gardner||
|January 13, 2014 08:30 AM EST||
If, as the adage goes, you should fight fire with fire then perhaps its equally justified to fight Big Data optimization requirements with -- Big Data.
It turns out that high-performing, cost-effective Big-Data processing helps to make the best use of dynamic storage resources by taking in all the relevant storage activities data, analyzing it and then making the best real-time choices for dynamic hybrid storage optimization.
In other words, Big Data can be exploited to better manage complex data and storage. The concept, while tricky at first, is powerful and, I believe, a harbinger of what we're going to see more of, which is to bring high intelligence to bear on many more services, products and machines.
To explore how such Big Data analysis makes good on data storage efficiency, BriefingsDirect recently sat down with optimized hybrid storage provider Nimble Storage to hear their story on the use of HP Vertica as their data analysis platform of choice. Yes, it's the same Nimble that last month had a highly successful IPO. The expert is Larry Lancaster, Chief Data Scientist at Nimble Storage Inc. in San Jose, California. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.
Here are some excerpts:
Gardner: How do you use big data to support your hybrid storage optimization value?
Lancaster: At a high level, Nimble Storage recognized early, near the inception of the product, that if we were able to collect enough operational data about how our products are performing in the field, get it back home and analyze it, we'd be able to dramatically reduce support costs. Also, we can create a feedback loop that allows engineering to improve the product very quickly, according to the demands that are being placed on the product in the field.
Looking at it from that perspective, to get it right, you need to do it from the inception of the product. If you take a look at how much data we get back for every array we sell in the field, we could be receiving anywhere from 10,000 to 100,000 data points per minute from each array. Then, we bring those back home, we put them into a database, and we run a lot of intensive analytics on those data.
Once you're doing that, you realize that as soon as you do something, you have this data you're starting to leverage. You're making support recommendations and so on, but then you realize you could do a lot more with it. We can do dynamic cache sizing. We can figure out how much cache a customer needs based on an analysis of their real workloads.
We found that big data is really paying off for us. We want to continue to increase how much it's paying off for us, but to do that we need to be able to do bigger queries faster. We have a team of data scientists and we don't want them sitting here twiddling their thumbs. That’s what brought us to Vertica at Nimble.
Using Big Data
Gardner: It's an interesting juxtaposition that you're using big data in order to better manage data and storage. What better use of it? And what sort of efficiencies are we talking about here, when you are able to get that data in that massive scale and do these analytics and then go back out into the field and adjust? What does that get for you?
Lancaster: We have a very tight feedback loop. In one release we put out, we may make some changes in the way certain things happen on the back end, for example, the way NVRAM is drained. There are some very particular details around that, and we can observe very quickly how that performs under different workloads. We can make tweaks and do a lot of tuning.
Without the kind of data we have, we might have to have multiple cases being opened on performance in the field and escalations, looking at cores, and then simulating things in the lab.
It's a very labor-intensive, slow process with very little data to base the decision on. When you bring home operational data from all your products in the field, you're now talking about being able to figure out in near real-time the distribution of workloads in the field and how people access their storage. I think we have a better understanding of the way storage works in the real world than any other storage vendor, simply because we have the data.
Gardner: So it's an interesting combination of a product lifecycle approach to getting data -- but also combining a service with a product in such a way that you're adjusting in real time.
Lancaster: That’s right. We do a lot of neat things. We do capacity forecasting. We do a lot of predictive analytics to try to figure out when the storage administrator is going to need to purchase something, rather than having them just stumble into the fact that they need to provision for equipment because they've run out of space.
A lot of things that should have been done in storage from the very beginning that sound straightforward were simply never done. We're the first company to take a comprehensive approach to it. We open and close 80 percent of our cases automatically, 90 percent of them are automatically opened.
We have a suite of tools that run on this operational data, so we don't have to call people up and say, "Please gather this data for us. Please send us these log posts. Please send us these statistics." Now, we take a case that could have taken two or three days and we turn it into something that can be done in an hour.
That’s the kind of efficiency we gain that you can see, and the InfoSight service delivers that to our customers.
Gardner: Larry, just to be clear, you're supporting both flash and traditional disk storage, but you're able to exploit the hybrid relationship between them because of this data and analysis. Tell us a little bit about how the hybrid storage works.
Challenge for hard drives
Lancaster: At a high level, you have hard drives, which are inexpensive, but they're slow for random I/O. For sequential I/O, they are all right, but for random I/O performance, they're slow. It takes time to move the platter and the head. You're looking at 5 to 10 milliseconds seek time for random read.
That's been the challenge for hard drives. Flash drives have come out and they can dramatically improve on that. Now, you're talking about microsecond-order latencies, rather than milliseconds.
But the challenge there is that they're expensive. You could go buy all flash or you could go buy all hard drives and you can live with those downsides of each. Or, you can take the best of both worlds.
Then, there's a challenge. How do I keep the data that I need to access randomly in flash, but keep the rest of the data that I don't care so much about in a frequent random-read performance, keep that on the hard drives only, and in that way, optimize my use of flash. That's the way you can save money, but it's difficult to do that.
It comes down to having some understanding of the workloads that the customer is running and being able to anticipate the best algorithms and parameters for those algorithms to make sure that the right data is in flash.
We've built up an enormous dataset covering thousands of system-years of real-world usage to tell us exactly which approaches to caching are going to deliver the most benefit. It would be hard to be the best hybrid storage solution without the kind of analytics that we're doing.
Gardner: Then, to extrapolate a little bit higher, or maybe wider, for how this benefits an organization, the analysis that you're gathering also pertains to the data lifecycle, things like disaster recovery (DR), business continuity, backups, scheduling, and so forth. Tell us how the data gathering analytics has been applied to that larger data lifecycle equation.
Lancaster: You're absolutely right. One of the things that we do is make sure that we audit all of the storage that our customers have deployed to understand how much of it is protected with local snapshots, how much of it is replicated for disaster recovery, and how much incremental space is required to increase retention time and so on.
We have very efficient snapshots, but at the end of the day, if you're making changes, snapshots still do take some amount of space. So, learning exactly what is that overhead, and how can we help you achieve your disaster recovery goals.
We have a good understanding of that in the field. We go to customers with proactive service recommendations about what they could and should do. But we also take into account the fact that they may be doing DR when we forecast how much capacity they are going to need.
It is part of a larger lifecycle that we address, but at the end of the day, for my team it's still all about analytics. It's about looking to the data as the source of truth and as the source of recommendation.
We can tell you roughly how much space you're going to need to do disaster recovery on a given type of application, because we can look in our field and see the distribution of the extra space that would take and what kind of bandwidth you're going to need. We have all that information at our fingertips.
When you start to work this way, you realize that you can do things you couldn't do before. And the things you could do before, you can do orders of magnitude better. So we're a great case of actually applying data science to the product lifecycle, but also to front-line revenue and cost enhancement.
Gardner: How can you actually get that analysis in the speed, at the scale, and at the cost that you require?
Lancaster: To give you a brief history of my awareness of HP Vertica and my involvement around the product, I don’t remember the exact year, but it may have been eight years ago roughly. At some point, there was an announcement that Mike Stonebraker was involved in a group that was going to productize the C-Store Database, which was sort of an academic experiment at UC Berkeley, to understand the benefits and capabilities of real column store.
I was immediately interested and contacted them. I was working at another storage company at the time. I had a 20 terabyte (TB) data warehouse, which at the time was one of the largest Oracle on Linux data warehouses in the world.
They didn't want to touch that opportunity just yet, because they were just starting out in alpha mode. I hooked up with them again a few years later, when I was CTO at a company called Glassbeam, where we developed what's substantially an extract, transform, and load (ETL) platform.
By then, they were well along the road. They had a great product and it was solid. So we tried it out, and I have to tell you, I fell in love with Vertica because of the performance benefits that it provided.
When you start thinking about collecting as many different data points as we like to collect, you have to recognize that you’re going to end up with a couple choices on a row store. Either you're going to have very narrow tables and a lot of them or else you're going to be wasting a lot of I/O overhead, retrieving entire rows where you just need a couple fields.
That was what piqued my interest at first. But as I began to use it more and more at Glassbeam, I realized that the performance benefits you could gain by using HP Vertica properly were another order of magnitude beyond what you would expect just with the column-store efficiency.
That's because of certain features that Vertica allows, such as something called pre-join projections. We can drill into that sort of stuff more if you like, but, at a high-level, it lets you maintain the normalized logical integrity of your schema, while having under the hood, an optimized denormalized query performance physically on disk.
Now you might ask you can be efficient if you have a denormalized structure on disk. It's because Vertica allows you to do some very efficient types of encoding on your data. So all of the low cardinality columns that would have been wasting space in a row store end up taking almost no space at all.
What you find, at least it's been my impression, is that Vertica is the data warehouse that you would have wanted to have built 10 or 20 years ago, but nobody had done it yet.
Nowadays, when I'm evaluating other big data platforms, I always have to look at it from the perspective of it's great, we can get some parallelism here, and there are certain operations that we can do that might be difficult on other platforms, but I always have to compare it to Vertica. Frankly, I always find that Vertica comes out on top in terms of features, performance, and usability.
Gardner: When you arrived there at Nimble Storage, what were they using, and where are you now on your journey into a transition to Vertica?
Lancaster: I built the environment here from the ground up. When I got here, there were roughly 30 people. It's a very small company. We started with Postgres. We started with something free. We didn’t want to have a large budget dedicated to the backing infrastructure just yet. We weren’t ready to monetize it yet.
So, we started on Postgres and we've scaled up now to the point where we have about 100 TBs on Postgres. We get decent performance out of the database for the things that we absolutely need to do, which are micro-batch updates and transactional activity. We get that performance because the database lives on Nimble Storage.
I don't know what the largest unsharded Postgres instance is in the world, but I feel like I have one of them. It's a challenge to manage and leverage. Now, we've gotten to the point where we're really enjoying doing larger queries. We really want to understand the entire installed base of how we want to do analyses that extend across the entire base.
We want to understand the lifecycle of a volume. We want to understand how it grows, how it lives, what its performance characteristics are, and then how gradually it falls into senescence when people stop using it. It turns out there is a lot of really rich information that we now have access to to understand storage lifecycles in a way I don't think was possible before.
But to do that, we need to take our infrastructure to the next level. So we've been doing that and we've loaded a large number of our sensor data that’s the numerical data I have talked about into Vertica, started to compare the queries, and then started to use Vertica more and more for all the analysis we're doing.
Internally, we're using Vertica, just because of the performance benefits. I can give you an example. We had a particular query, a particularly large query. It was to look at certain aspects of latency over a month across the entire installed base to understand a little bit about the distribution, depending on different factors, and so on.
We ran that query in Postgres, and depending on how busy the server was, it took anywhere from 12 to 24 hours to run. On Vertica, to run the same query on the same data takes anywhere from three to seven seconds.
I anticipated that because we were aware upfront of the benefits we'd be getting. I've seen it before. We knew how to structure our projections to get that kind of performance. We knew what kind of infrastructure we'd need under it. I'm really excited. We're getting exactly what we wanted and better.
This is only a three node cluster. Look at the performance we're getting. On the smaller queries, we're getting sub-second latencies. On the big ones, we're getting sub-10 second latencies. It's absolutely amazing. It's game changing.
People can sit at their desktops now, manipulate data, come up with new ideas and iterate without having to run a batch and go home. It's a dramatic productivity increase. Data scientists tend to be fairly impatient. They're highly paid people, and you don’t want them sitting at their desk waiting to get an answer out of the database. It's not the best use of their time.
Gardner: Larry, is there another aspect to the HP Vertica value when it comes to the cloud model for deployment? It seems to me that if Nimble Storage continues to grow rapidly and scales that, bringing all that data back to a central single point might be problematic. Having it distributed or in different cloud deployment models might make sense. Is there something about the way Vertica works within a cloud services deployment that is of interest to you as well?
Lancaster: There's the ease of adding nodes without downtime, the fact that you can create a K-safe cluster. If my cluster is 16 nodes wide now, and I want two nodes redundancy, it's very similar to RAID. You can specify that, and the database will take care of that for you. You don’t have to worry about the database going down and losing data as a result of the node failure every time or two.
I love the fact that you don’t have to pay extra for that. If I want to put more cores or nodes on it or I want to put more redundancy into my design, I can do that without paying more for it. Wow! That’s kind of revolutionary in itself.
It's great to see a database company incented to give you great performance. They're incented to help you work better with more nodes and more cores. They don't have to worry about people not being able to pay the additional license fees to deploy more resources. In that sense, it's great.
We have our own private cloud -- that’s how I like to think of it -- at an offsite colocation facility. We do DR through Nimble Storage. At the same time, we have a K-safe cluster. We had a hardware glitch on one of the nodes last week, and the other two nodes stayed up, served data, and everything was fine.
Those kinds of features are critical, and that ability to be flexible and expand is critical for someone who is trying to build a large cloud infrastructure, because you're never going to know in advance exactly how much you're going to need.
If you do your job right as a cloud provider, people just want more and more and more. You want to get them hooked and you want to get them enjoying the experience. Vertica lets you do that.
You may also be interested in:
- MZI Healthcare Identifies Big Data Patient Productivity Gems Using HP Vertica
- Thought Leader Interview: HP's Global CISO Brett Wahlin on the future of Security and Risk
- Panel explains how CSC creates a tough cybersecurity posture against global threats
- Risk and complexity: Businesses need to get a grip
- HP Vertica General Manager Colin Mahony on the next generation of analytics platforms
- Advanced IT monitoring Delivers Predictive Diagnostics Focus to United Airlines
- CSC and HP team up to define the new state needed for comprehensive enterprise cybersecurity
- BYOD brings new security challenges for IT: Allowing greater access while protecting networks
- HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Queries for Infinity Insurance
SYS-CON Events announced today that IBM Cloud Data Services has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. IBM Cloud Data Services offers a portfolio of integrated, best-of-breed cloud data services for developers focused on mobile computing and analytics use cases.
Oct. 4, 2015 01:00 PM EDT Reads: 509
As enterprises capture more and more data of all types – structured, semi-structured, and unstructured – data discovery requirements for business intelligence (BI), Big Data, and predictive analytics initiatives grow more complex. A company’s ability to become data-driven and compete on analytics depends on the speed with which it can provision their analytics applications with all relevant information. The task of finding data has traditionally resided with IT, but now organizations increasingly turn towards data source discovery tools to find the right data, in context, for business users, d...
Oct. 4, 2015 01:00 PM EDT Reads: 349
SYS-CON Events announced today that ProfitBricks, the provider of painless cloud infrastructure, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. ProfitBricks is the IaaS provider that offers a painless cloud experience for all IT users, with no learning curve. ProfitBricks boasts flexible cloud servers and networking, an integrated Data Center Designer tool for visual control over the cloud and the best price/performance value available. ProfitBricks was named one of the coolest Clo...
Oct. 4, 2015 01:00 PM EDT Reads: 663
Clearly the way forward is to move to cloud be it bare metal, VMs or containers. One aspect of the current public clouds that is slowing this cloud migration is cloud lock-in. Every cloud vendor is trying to make it very difficult to move out once a customer has chosen their cloud. In his session at 17th Cloud Expo, Naveen Nimmu, CEO of Clouber, Inc., will advocate that making the inter-cloud migration as simple as changing airlines would help the entire industry to quickly adopt the cloud without worrying about any lock-in fears. In fact by having standard APIs for IaaS would help PaaS expl...
Oct. 4, 2015 12:30 PM EDT Reads: 367
Learn how IoT, cloud, social networks and last but not least, humans, can be integrated into a seamless integration of cooperative organisms both cybernetic and biological. This has been enabled by recent advances in IoT device capabilities, messaging frameworks, presence and collaboration services, where devices can share information and make independent and human assisted decisions based upon social status from other entities. In his session at @ThingsExpo, Michael Heydt, founder of Seamless Thingies, will discuss and demonstrate how devices and humans can be integrated from a simple clust...
Oct. 4, 2015 12:00 PM EDT Reads: 599
“The Internet of Things transforms the way organizations leverage machine data and gain insights from it,” noted Splunk’s CTO Snehal Antani, as Splunk announced accelerated momentum in Industrial Data and the IoT. The trend is driven by Splunk’s continued investment in its products and partner ecosystem as well as the creativity of customers and the flexibility to deploy Splunk IoT solutions as software, cloud services or in a hybrid environment. Customers are using Splunk® solutions to collect and correlate data from control systems, sensors, mobile devices and IT systems for a variety of Ind...
Oct. 4, 2015 11:45 AM EDT Reads: 548
As more and more data is generated from a variety of connected devices, the need to get insights from this data and predict future behavior and trends is increasingly essential for businesses. Real-time stream processing is needed in a variety of different industries such as Manufacturing, Oil and Gas, Automobile, Finance, Online Retail, Smart Grids, and Healthcare. Azure Stream Analytics is a fully managed distributed stream computation service that provides low latency, scalable processing of streaming data in the cloud with an enterprise grade SLA. It features built-in integration with Azur...
Oct. 4, 2015 11:00 AM EDT Reads: 705
Apps and devices shouldn't stop working when there's limited or no network connectivity. Learn how to bring data stored in a cloud database to the edge of the network (and back again) whenever an Internet connection is available. In his session at 17th Cloud Expo, Bradley Holt, Developer Advocate at IBM Cloud Data Services, will demonstrate techniques for replicating cloud databases with devices in order to build offline-first mobile or Internet of Things (IoT) apps that can provide a better, faster user experience, both offline and online. The focus of this talk will be on IBM Cloudant, Apa...
Oct. 4, 2015 11:00 AM EDT Reads: 341
You have your devices and your data, but what about the rest of your Internet of Things story? Two popular classes of technologies that nicely handle the Big Data analytics for Internet of Things are Apache Hadoop and NoSQL. Hadoop is designed for parallelizing analytical work across many servers and is ideal for the massive data volumes you create with IoT devices. NoSQL databases such as Apache HBase are ideal for storing and retrieving IoT data as “time series data.”
Oct. 4, 2015 10:45 AM EDT Reads: 349
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Oct. 4, 2015 09:00 AM EDT Reads: 540
Mobile messaging has been a popular communication channel for more than 20 years. Finnish engineer Matti Makkonen invented the idea for SMS (Short Message Service) in 1984, making his vision a reality on December 3, 1992 by sending the first message ("Happy Christmas") from a PC to a cell phone. Since then, the technology has evolved immensely, from both a technology standpoint, and in our everyday uses for it. Originally used for person-to-person (P2P) communication, i.e., Sally sends a text message to Betty – mobile messaging now offers tremendous value to businesses for customer and empl...
Oct. 4, 2015 08:30 AM EDT Reads: 150
Organizations already struggle with the simple collection of data resulting from the proliferation of IoT, lacking the right infrastructure to manage it. They can't only rely on the cloud to collect and utilize this data because many applications still require dedicated infrastructure for security, redundancy, performance, etc. In his session at 17th Cloud Expo, Emil Sayegh, CEO of Codero Hosting, will discuss how in order to resolve the inherent issues, companies need to combine dedicated and cloud solutions through hybrid hosting – a sustainable solution for the data required to manage I...
Oct. 4, 2015 08:00 AM EDT Reads: 379
SYS-CON Events announced today that MobiDev, a software development company, will exhibit at the 17th International Cloud Expo®, which will take place November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software development company with representative offices in Atlanta (US), Sheffield (UK) and Würzburg (Germany); and development centers in Ukraine. Since 2009 it has grown from a small group of passionate engineers and business managers to a full-scale mobile software company with over 150 developers, designers, quality assurance engineers, project manage...
Oct. 4, 2015 04:00 AM EDT Reads: 659
The broad selection of hardware, the rapid evolution of operating systems and the time-to-market for mobile apps has been so rapid that new challenges for developers and engineers arise every day. Security, testing, hosting, and other metrics have to be considered through the process. In his session at Big Data Expo, Walter Maguire, Chief Field Technologist, HP Big Data Group, at Hewlett-Packard, will discuss the challenges faced by developers and a composite Big Data applications builder, focusing on how to help solve the problems that developers are continuously battling.
Oct. 4, 2015 04:00 AM EDT Reads: 322
SYS-CON Events announced today that Cloud Raxak has been named “Media & Session Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Raxak Protect automates security compliance across private and public clouds. Using the SaaS tool or managed service, developers can deploy cloud apps quickly, cost-effectively, and without error.
Oct. 3, 2015 01:15 PM EDT Reads: 570
Who are you? How do you introduce yourself? Do you use a name, or do you greet a friend by the last four digits of his social security number? Assuming you don’t, why are we content to associate our identity with 10 random digits assigned by our phone company? Identity is an issue that affects everyone, but as individuals we don’t spend a lot of time thinking about it. In his session at @ThingsExpo, Ben Klang, Founder & President of Mojo Lingo, will discuss the impact of technology on identity. Should we federate, or not? How should identity be secured? Who owns the identity? How is identity ...
Oct. 3, 2015 11:00 AM EDT Reads: 396
SYS-CON Events announced today that Solgeniakhela will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgeniakhela is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional Social, Mobile and Cloud user experiences, our solutions help large and medium-sized organizations dramatically improve productivity, reduce collaboration costs, and increase the overall enterprise value by bringing ...
Oct. 2, 2015 10:00 PM EDT Reads: 537
Sensors and effectors of IoT are solving problems in new ways, but small businesses have been slow to join the quantified world. They’ll need information from IoT using applications as varied as the businesses themselves. In his session at @ThingsExpo, Roger Meike, Distinguished Engineer, Director of Technology Innovation at Intuit, will show how IoT manufacturers can use open standards, public APIs and custom apps to enable the Quantified Small Business. He will use a Raspberry Pi to connect sensors to web services, and cloud integration to connect accounting and data, providing a Bluetooth...
Oct. 2, 2015 03:30 PM EDT Reads: 337
SYS-CON Events announced today that Micron Technology, Inc., a global leader in advanced semiconductor systems, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Micron’s broad portfolio of high-performance memory technologies – including DRAM, NAND and NOR Flash – is the basis for solid state drives, modules, multichip packages and other system solutions. Backed by more than 35 years of technology leadership, Micron's memory solutions enable the world's most innovative computing, consumer,...
Oct. 2, 2015 07:00 AM EDT Reads: 553
Nowadays, a large number of sensors and devices are connected to the network. Leading-edge IoT technologies integrate various types of sensor data to create a new value for several business decision scenarios. The transparent cloud is a model of a new IoT emergence service platform. Many service providers store and access various types of sensor data in order to create and find out new business values by integrating such data.
Oct. 1, 2015 02:30 PM EDT Reads: 394