|By Eugene Letuchy||
|August 6, 2008 09:00 AM EDT||
One of the things I like most about working at Facebook is the ability to launch products that are (almost) immediately used by millions of people. Unlike a three-guys-in-a-garage startup, we don't have the luxury of scaling out infrastructure to keep pace with user growth; when your feature's userbase will go from 0 to 70 million practically overnight, scalability has to be baked in from the start. The project I'm currently working on, Facebook Chat, offered a nice set of software engineering challenges.
Real-time presence notification
The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.
The naive implementation of sending a notification to all friends whenever a user comes online or goes offline has a worst case cost of O(average friendlist size * peak users * churn rate) messages/second, where churn rate is the frequency with which users come online and go offline, in events/second. This is wildly inefficient to the point of being untenable, given that the average number of friends per user is measured in the hundreds, and the number of concurrent users during peak site usage is on the order of several millions.
Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information up-to-date. Each Facebook Chat user now needs to be notified whenever one of his/her friends
(a) takes an action such as sending a chat message or loads a Facebook page (if tracking idleness via a last-active timestamp) or
(b) transitions between idleness states (if representing idleness as a state machine with states like "idle-for-1-minute", "idle-for-2-minutes", "idle-for-5-minutes", "idle-for-10-minutes", etc.).
Note that approach (a) changes the sending a chat message / loading a Facebook page from a one-to-one communication into a multicast to all online friends, while approach (b) ensures that users who are neither chatting nor browsing Facebook are nonetheless generating server load.
Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice. Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60-some requests for regular Facebook pages. The result of running out of Apache processes over the entire Facebook web tier is not pretty, nor is the dynamic configuration of the Apache process limits enjoyable.
Distribution, Isolation, and Failover
Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user. The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.
The way this is typically accomplished in a web application is by separating the model and the view: data is persisted in a database (perhaps with a separate in-memory cache), with each short-lived request retrieving only the parts relevant to that request. Because the data is persisted, a failed read request can be re-attempted. Cache misses and database failure can be detected by the non-database layers and either reported to the user or worked around using replication.
While this architecture works pretty well in general, it isn't as successful in a chat application due to the high volume of long-lived requests, the non-relational nature of the data involved, and the statefulness of each request.
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.
Glueing with Thrift
The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a "dark launch" period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. With the "dark launch" bugs fixed, we hope that you enjoy Facebook Chat now that the UI lights have been turned on.
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
Mar. 25, 2017 08:45 PM EDT Reads: 5,929
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Mar. 25, 2017 08:45 PM EDT Reads: 2,746
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
Mar. 25, 2017 06:15 PM EDT Reads: 2,560
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Mar. 25, 2017 05:15 PM EDT Reads: 1,925
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
Mar. 25, 2017 04:00 PM EDT Reads: 489
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
Mar. 25, 2017 01:45 PM EDT Reads: 3,284
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
Mar. 25, 2017 01:30 PM EDT Reads: 1,705
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, will posit that disruption is inevitable for c...
Mar. 25, 2017 01:15 PM EDT Reads: 2,039
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Mar. 25, 2017 12:45 PM EDT Reads: 1,874
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
Mar. 25, 2017 12:30 PM EDT Reads: 5,058
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
Mar. 25, 2017 12:30 PM EDT Reads: 1,171
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
Mar. 25, 2017 12:00 PM EDT Reads: 938
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
Mar. 25, 2017 11:15 AM EDT Reads: 1,537
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buyers...
Mar. 25, 2017 11:00 AM EDT Reads: 3,552
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
Mar. 25, 2017 10:45 AM EDT Reads: 2,091
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
Mar. 25, 2017 10:00 AM EDT Reads: 2,925
SYS-CON Events announced today that SD Times | BZ Media has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. BZ Media LLC is a high-tech media company that produces technical conferences and expositions, and publishes a magazine, newsletters and websites in the software development, SharePoint, mobile development and commercial UAV markets.
Mar. 25, 2017 09:15 AM EDT Reads: 4,218
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
Mar. 25, 2017 08:30 AM EDT Reads: 2,049
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Mar. 25, 2017 08:00 AM EDT Reads: 4,078
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
Mar. 25, 2017 08:00 AM EDT Reads: 14,025