Welcome!

Agile Computing Authors: Pat Romanski, Shelly Palmer, Glenda Sims, Paul Simmons, APM Blog

Related Topics: @DXWorldExpo, Agile Computing, @CloudExpo

@DXWorldExpo: Blog Post

Using Big Data to Improve Online Buying and Selling | @BigDataExpo #BigData

How Etsy uses big data for ecommerce to put buyers and sellers in the best light

The next BriefingsDirect big data case study discussion explores how Etsy, a global e-commerce site focused on handmade and vintage items, uses data science to improve buyers and sellers’ discovery and shopping experiences.

We'll learn how mining big data at speed and volume helps Etsy define and distribute top trends, and allows those with specific interests to find items that will best appeal to them.

To learn more about leveraging big data in the e-commerce space, please join Chris Bohn aka “CB,” a Senior Data Engineer at Etsy, based in Brooklyn, New York. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Tell us about Etsy for those that aren’t familiar with it. I've heard it described as it’s like being able to go through your grandmother's basement. Is that fair?

CB: Well, I hope it’s not as musty and dusty as my grandmother’s basement. The best way to describe it is that Etsy is a marketplace. We create a marketplace for sellers of handcrafted goods and the people who want to buy those goods.

We've been around for 10 years. We're the leader in this space and we went public in 2015. Just some quick little metrics. The total of value of the merchandise sold on Etsy in 2014 was about $1.93 billion. We have about 1.5 million sellers and about 22 million buyers.

Gardner: That's an awful lot of stuff that’s being moved around. What does the big data and analytics role bring to the table?

CB: It’s all about understanding more about our customers, both buyers and sellers. We want to know more about them and make the buying experience easier for them. We want them to be able to find products easier. Too much choice sometimes is no choice. You want to get them to the product they want to buy as quickly as possible.

We also want to know how people are different in their shopping habits across the geography of the world. There are some people in different countries that transact differently than we do here in the States, and big data lets us get some insight into that.

Gardner: Is this insight derived primarily from what they do via their clickstreams, what they're doing online? Or are there other ways that you can determine insights that then you can share among yourself and also back to your users?

Data architecture

CB: I'll describe our data architecture a little bit. When Etsy started out, we had a monolithic Postgres database and we threw everything in there. We had listings, users, sellers, buyers, conversations, and forums. It was all in there, but we outgrew that really quickly, and so the solution to that was to shard horizontally.

CB

Now we have many hundreds of sharded MySQL servers, horizontal. Then we decided that we needed to do some analytics on this stuff. So we scratched our heads. This was about five years ago. So we said, "Let’s just set up a Postgres server and we'll copy all the data from these shards into the Postgres server that we call BI server." And we got that done.

Then, we kind of scratched our heads and said, "Wait a minute. We just came full circle. We started with a monolithic database, then we went sharded, and now all the data is back monolithic."

It didn't perform well, because it's hard to get the volume of big data into that database. A relational database like Postgres just isn’t designed to do analytic-type queries. Those are big aggregations, and Postgres, even though it is a great relational database, is really tailored for single-record lookup.

So we decided to get something else going on. About three-and-a-half years ago, we set about searching for the replacement to our monolithic business-intelligence (BI) database and looked at what the landscape was. There were a number of very worthy products out there, but we eventually settled on HPE Vertica for a number of reasons.

One of those is that it derives, in large part, from Postgres. Postgres has a Berkeley license. So  companies could take it private. They can take that code and they don’t have to republish it out to the community, unlike other types of open source copyright agreements.

So we found out that the parser was right out of Postgres and all the date handling and typecasting stuff that is usually different from database to database was exactly spot-on the same between Vertica and Postgres. Also, data ingestion via the copy command is the best way to bulk-load data, exactly the same in both, and it’s the same format.

There were a number of very worthy products out there, but we eventually settled on Vertica for a number of reasons.

We said, "This looks good, because we can get the data in quickly, and queries will probably not have to be edited much." So that's where we went. We experimented with it and we found exactly that. Queries would run unchanged, except they ran a lot faster and we were able to get the data in easily.

We built some data replication tools to get data from the shards and also some legacy Postgres databases that we had laying around for billing and got that all data into HPE Vertica.

Then, we built some tools that allowed our analysts to bring over custom tables they had created on that old BI machine. We were able to get up to speed really quickly with Vertica, and boom, we had an analytics database that we were able to hit the ground running with it.

Gardner: And is the challenge for you about the variety of that data? Is it about the velocity that you need to move it in and out? Is it about simply volume that you just have so much of it, or a little of some of those?

All of the above

CB: It’s really all of those problems. Velocity-wise, we want our replication system to be eventually consistent, and we want it to be as near real-time as possible. There is a challenge in that, because you really start to get into micro-batching data in.

This is where we ended up having to pay off some technical debt, because years ago, disk storage was fairly pricey, and databases were designed to minimize storage. Practices grew up around that fact. So data would get deleted and updated. That's the policy that the early originators of Etsy followed when they designed the first database for it.

Eventually what we have got now is lossy data. If someone changes the description or the tags that are associated with a listing, the old ones go away. They are lost forever. And that's too bad, because if we kept those, we can do analytics on a product that wasn’t selling for a long time and all of a sudden it started selling. What changed? We would love to do analytics on that, but we can't do it because of the loss of data. That's one thing that we learned in this whole process.

But getting back to your question here about velocity and then also the volume of data, we have a lot of data from our production databases. We need to get it all into Vertica. We also have a lot of clickstream data. Etsy is a top 50 website, I believe, for traffic, and that generates a lot of clicks and that all gets put into Vertica.

This is where we ended up having to pay off some technical debt, because years ago, disk storage was fairly pricey, and databases were designed to minimize storage.

We run big batch jobs every night to load that. It's important that we have that, because one of the biggest things that our analytics like to do is correlate clickstream data with our production data. Clickstream data doesn't have a lot of information about the user who is doing those clicks. It’s just information about their path through the site at that time.

To really get a value-add on that, you want to be able to join on your user details tables, so that you can know where this person lives, how old they are, or their buying history in the past. You need to be able to join those, too, and we do that in HPE Vertica.

Gardner: CB, give us a sense about the paybacks, when you do this well, when you've architected, and when you've paid your technical debts, as you put it. How are your analysts able to leverage this in order to make your business better and make the experience of your users better?

CB: When we first installed Vertica, it was just a small group of analysts that were using it. Our analytics program was fairly new, but it just exploded. Everybody started to jump in on it, because all of a sudden, there was a database with which you could write good SQL, with a rich SQL engine, and get fantastic results quickly.

The results weren’t that different from what we were getting in the past, but they were just coming to us so fast, the cycle of getting information was greatly shortened. Getting result sets was so much better that it was like a whole different world. It’s like the Pony Express versus email. That’s the kind of difference it was. So everybody started jumping in on it.

More dashboards

Engineers who were adding new facets of the product wanted to have dashboards, more or less real time, so they could monitor what the thing was doing. For example, we added postage to Etsy, so that our sellers can have preprinted labels. We'd like to monitor that in real time to see how it's this going. Is it going well or what?

That was something that took a long time to analyze before we got into big-data analytics. All of a sudden, we had Vertica and we could do that for them, and that pattern has repeated with other groups in the company.

We're doing different aspects of the site. All of a sudden, you have your marketing people, your finance people, saying, "Wow, I can run these financial reports that used to take days in literally seconds." There was a lot of demand. Etsy has about 750 employees and we have way more than 200 Vertica accounts. That shows you how popular it is.

One anecdotal story. I've been wanting to update Vertica for the past couple of months. The woman who runs our analytics team said, "Don't you dare. I have to run Q2 numbers. Everybody is working on this stuff. You have to wait until this certain week to be able to do that." It’s not just HPE Vertica, but big data is now relied on for so many things in the company.

Gardner: So the technology led to the culture. Many times we think it's the other way around, but having that ability to do those easy SQL queries and get information opened up people's imagination, but it sounds like it has gone beyond that. You have a data-driven company now.

CB: That's an astute observation. You're right. This is technology that has driven the culture. It's really changed the way people do their job at Etsy. And I hear that elsewhere also, just talking to other companies and stuff. It really has been impactful.

This is technology that has driven the culture. It's really changed the way people do their job at Etsy.

Gardner: Just for the sake of those of our readers who are on the operations side, how do you support your data infrastructure? Are you thinking about cloud? Are you on-prem? Are you split between different data centers? How does that work?

CB: I have some interesting data points there for you. Five-plus years ago, we started doing Hadoop stuff, and we started out spinning up Hadoop in Amazon Web Service (AWS).

We would run nightly jobs. We collected all of the search terms that were used and buying patterns and we fed these into MapReduce jobs. The output from that then went into MATLAB, and we would get a set of rules out of that, that then would drive our search engine, basically improving search.

Commodity hardware

We did that for a while and then realized we were spending a lot of money in AWS. It was many thousands of dollars a month. We said, "Wait a minute. This is crazy. We could actually buy our own servers. This is commodity hardware that this can run on, and we can run this in our own data center. We will get the data in faster, because there are bigger pipes." So that's what we did.

We created what we call Etsydoop, which has got 200+ nodes and we actually save a lot of money doing it that way. That's how we got into it.

We really have a bifurcated data analytics, big-data system. On the one hand, we have Vertica for doing ad hoc queries, because the analysts and the people out there understand SQL and they demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.

But the tradeoff is that those are hard jobs to write. Even a good engineer is not going to get it right every time, and for most analysts, it's probably a little bit beyond their reach to get down, roll up their sleeves, and get into actual coding and that kind of stuff.

The analysts and the people out there understand SQL and they demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.

But they're great at SQL, and we want to encourage exploration and discovering new things. We've discovered things about our business just by some of these analysts wildcatting in the database, finding interesting stuff, and then exploring it, and we want to encourage that. That's really important.

Gardner: CB, in getting to understand Etsy a little bit more, I saw that you have something called Top Trends and Etsy Finds, ways that you can help people with affinity for a product or a craft or some interest to pursue that. Did that come about as a result of these technologies that you have put in place, or did they have a set of requirements that they wanted to be able to do this and then went after you to try to accommodate it? How do you pull off that Etsy Finds capability?

CB: A lot of that is cross-architecture. Some of our production data is used to find that. Then, a lot of the hard crunching is done in Vertica to find that. Some of it is MapReduce. There's a whole mix of things that go into that.

I couldn't claim for Etsy Finds, for example, that it’s all big data. There are other things that go in there, but definitely HPE Vertica plays a role in that stuff.

I'll give you another example, fraud. We fingerprint a lot of our users digitally, because we have problems with resellers. These are people who are selling resold mass-produced stuff on Etsy. It's not huge, but it's an annoyance. Those products compete against really quality handmade products that our regular sellers sell in their shops.

Sometimes it’s like a game of Whack-a-Mole. You knock one of these guys down -- sometimes they're from the Far East or other parts of the world -- and as soon as you knock one down, another one pops up. Being able to capture them quickly is really important, and we use Vertica for that. We have a team that works just on that problem.

What's next?

Gardner: Thinking about the future, with this great architecture, with your ability to do things like fraud detection and affinity correlations, what's next? What can you do that will help make Etsy more impactful in its market and make your users more engaged?

CB: The whole idea behind databases and computing in general is just making things faster. When the first punch-card machines came out in the 1930s or whatever, the phone companies could do faster billing, because billing was just getting out of control. That’s where the roots of IBM lie.

As time went by, punch cards were slow and they wanted to go faster. So they developed magnetic tape, and then spinning rust disks. Now, we're into SSDs, the flash drives. And it’s the same way with databases and getting answers. You always want to get answers faster.

We do a lot of A/B testing. We have the ability to set the site so that maybe a small percentage of users get an A path through the site, and the others a B path, and there's control stuff on that. We analyze those results. This is how we test to see if this kind of button work better than this other one. Is the placement right? If we just skip this page, is it easier for someone to buy something?

The whole idea behind databases and computing in general is just making things faster.

 

So we do A/B testing. In the past, we've done it where we had to run the test, gather the data, and then comb through it manually. But now with Vertica, the turnaround time to iterate over each cycle of an A/B test has shrunk dramatically. We get our data from the clickstreams, which go into Vertica, and then the next day, we can run the A/B test results on that.

The next step is shrinking that even more. One of the themes that’s out there at the various big data conferences is streaming analytics. That's a really big thing. There is a new database out there called PipelineDB, a fork of Postgres. It allows you to create an event steam into Postgres.

You can then create a view and a window on top of that stream. Then you can pump your event data, like your clickstream data, and you can join the data in that window to your regular Postgres tables, which is really great, because we could get A/B information in real time. You set up a one minute turnaround as opposed to one day. I think that’s where a lot of things are going.

If you just look at the history of big data, MapReduce started about 10 years ago at Google, and that was batch jobs, overnight runs. Then, we started getting into the columnar stores to make databases like Vertica possible, and it’s really great for aggregation. That kicked it up to the next level.

Another thing is real-time analytics. It’s not going to replace any of these things, just like Vertica didn't replace Hadoop. They're complementary. Real-time streaming analytics will be complementary. So we're continuing to add these tools to our big data toolbox.

Gardner: It has compressed those feedback loops if we provide that capability into innovative, creative organization. The technology might drive the culture, and who knows what sort of benefits they will derive from that.

All plugged in

CB: That's very true. You touched earlier about how we do our infrastructure. I'm in data engineering, and we're responsible for making sure that our big databases are healthy and running right. But we also have our operations department. They're working on the actual pipes and hardware and making sure it’s all plugged in. It's tough to get all this stuff working right, but if you have the right people, it can happen.

I mentioned earlier about AWS. The reason we were able to move off of that and save money is because we have the people who can do it. When you start using AWS extensively, what you're doing is you are paying for a very high priced but good IT staff at Amazon. If you have got a good IT staff of your own, you're probably going to be able to realize some efficiencies there, and that's why really we moved over. We do it all ourselves.

Gardner: Having it as a core competency might be an important thing moving forward. The whole idea behind databases and computing in general is just making things faster.

CB: Absolutely. You have to stay on top of all this stuff. A lot is made of the word disruption, and you don't go knocking on disruption’s door; it usually knocks on yours. And you had better be agile enough to respond to it.

I'll give you an example that ties back into big data. One of the most disruptive things that has happened to Etsy is the rise of the smartphone. When Etsy started back in 2005, the iPhone wasn't around yet; it was still two years out. Then, it came on the scene, and people realized that this was a suitable device for commerce.

It’s very easy to just be complacent and oblivious to new technologies sneaking up on you. But we started seeing that there was more and more commerce being done on smartphones. We actually fell a little bit behind, as a lot of companies did five years ago. But our management made decisions to invest in mobile, and now 60 percent of our traffic is on mobile. That's turned around in the past two years and it has been pretty amazing.

Big data helps us with that, because we do a lot of crunching of what these mobile devices are doing. Mobile is not the best device maybe for buying stuff because of the form factor, but it is a really good device for managing your store, paying your Etsy bill, and doing that kind of stuff. So we analyzed all that and crunched it in big data.

Gardner: And big data allowed you to know when to make that strategic move and then take advantage of it?

CB: Exactly. There are all sorts of crossover points that happen with technology, and you have to monitor it. You have to understand your business really well to see when certain vectors are happening. If you can pick up on those, you're going to be okay.

You may also be interested in:

More Stories By Dana Gardner

At Interarbor Solutions, we create the analysis and in-depth podcasts on enterprise software and cloud trends that help fuel the social media revolution. As a veteran IT analyst, Dana Gardner moderates discussions and interviews get to the meat of the hottest technology topics. We define and forecast the business productivity effects of enterprise infrastructure, SOA and cloud advances. Our social media vehicles become conversational platforms, powerfully distributed via the BriefingsDirect Network of online media partners like ZDNet and IT-Director.com. As founder and principal analyst at Interarbor Solutions, Dana Gardner created BriefingsDirect to give online readers and listeners in-depth and direct access to the brightest thought leaders on IT. Our twice-monthly BriefingsDirect Analyst Insights Edition podcasts examine the latest IT news with a panel of analysts and guests. Our sponsored discussions provide a unique, deep-dive focus on specific industry problems and the latest solutions. This podcast equivalent of an analyst briefing session -- made available as a podcast/transcript/blog to any interested viewer and search engine seeker -- breaks the mold on closed knowledge. These informational podcasts jump-start conversational evangelism, drive traffic to lead generation campaigns, and produce strong SEO returns. Interarbor Solutions provides fresh and creative thinking on IT, SOA, cloud and social media strategies based on the power of thoughtful content, made freely and easily available to proactive seekers of insights and information. As a result, marketers and branding professionals can communicate inexpensively with self-qualifiying readers/listeners in discreet market segments. BriefingsDirect podcasts hosted by Dana Gardner: Full turnkey planning, moderatiing, producing, hosting, and distribution via blogs and IT media partners of essential IT knowledge and understanding.

@ThingsExpo Stories
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of the 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to gre...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develop...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...