« Read/write scale without complete re-write | Main | Real Time Analytics for Big Data: An Alternative Approach » JULY 08, 2011Real Time analytics for Big Data: Facebook's New Realtime Analytics System
In this first post, I’d like to summarize the case study, and consider some things that weren't mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook's experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system.. The Business Drive for real time analytics: Time is money The main drive for many of the innovations around realtime analytics has to do with competitiveness and cost, just as with most other advances. For example, during the past few years financial organizations realized that inter-day risk analysis of their customers' portfolios translated to increased profit as they could react faster to profit and loss events. The same applies to many of the online ecommerce and social sites. Knowing what your users are doing on your site in real time and matching what they do with more targeted information transforms into better conversion rate and better user satisfaction, which means more money in the end. Todd provides similar reasoning to describe the motivation behind Facebook's new system:
Why now?Real time analytics goes mainstream The massive transition to online and social applications makes it possible to track user patterns like never before. The correlation between the quality of data that providers track and their business success is closely related: for example, e-commerce customers want to know what their friends think about products or services, right in the middle of their shopping experience. If sites cannot keep up with their thousands of users in real-time, they can lose their customers to sites that can. So while risk analytics in financial industry was still a fairly small niche of the analytics market the demand for real time analytics in Social, eCommrce and SaaS applications brought the demand for real time analytics to main-stream business under massive load. No one has time for batch processing anymore. Technology advancement Newer infrastructures and technologies like tera-scale memory, nosql, parallel processing platforms, and cloud computing, provide new ways to process massive amount of data in shorter time and with lower cost. As most of the current analytics systems weren’t built to take advantage of these new technologies and capabilities, they haven't been able to adopt to real time requirements without massive changes. Hadoop Map/Reduce doesn’t fit the real time businessOne of the hottest trends in the analytics space is the use of Hadoop Strong evidence for that can be seen from the moves of those who were known as the “poster children” for Map/Reduce: Google and Yahoo both moved away from Map/Reduce. Google has moved to Google Percolator It is therefore not surprising that Facebook reached the same conclusion as it relate to Hadoop:
Facebook's Real Time Analytics systemAccording Todd, Facebook evaluated a fairly long list of alternatives including as MySQL, an in-memory cache, Hadoop Hive/Map Reduce. I highly recommend reading the full details fromTodd's post I tried to outline Facebook architecture based on Todd's summary in the diagram below: Every user activity triggers an asynchronous event, through AJAX – this event is logged in a tail log using Scribe. Ptail is used to aggregate the different individual logs into a consolidated stream. The stream is batched in 1.5 sec groupings by Puma which stores the event batch into HBase. The real time logs are kept for a certain period of time and than get cleared from the system. Obviously this description is a fairly simplistic view – the full details are provided in the original post. Evaluating the Facebook ArchitectureFacebook reasoning behind their technology evaluation seems acceptable for the most part, although there are some obvious concerns. There were two things that caught my attention: Memory Counters
It sounds to me that the evaluation was based on memcached. By default, it's not highly available; a failure would result in loss of data. Obviously, that doesn’t apply to some other memory based solutions such as In Memory Data Grids (for example, GigaSpaces Cassandra vs HBaseThe choice of HBase
Eric Hauser
Eric's comment is indicative of how dynamic the NoSQL space is. I’d be interested in how different the technology selection would be now. The ArchitectureThere are a few common principles that drive the architecture for this type of system:
Facebook seem to follow those same principles in their architecture while keeping the system at scale at a fairly impressive rate. There are a few questions that are still open as I don’t have full visibility into their system:
Realtime.Analytics.nextInterestingly enough, I was asked to draw a solution for similar challenge for a voice recording system: collect lots data from various sources and process them in real time. The good news is that it's doable, as shown by Facebook's success.The better news is that it's actually fairly easy. We can add rich query capabilities, elastic scaling, database and platform neutrality, evolution of data, and more without making things unnecessarily difficult - it's not so much that this is the next generation of realtime analytics as much as map/reduce and the hbase approach used by facebook is the previous generation.I'll describe how it can be done more simply in the next post. References
|
cloud >