Entries in analytics (8)

Sunday

Jul162023

Lessons Learned Running Presto at Meta Scale

Sunday, July 16, 2023 at 9:42AM

Presto is a free, open source SQL query engine. We’ve been using it at Meta for the past ten years, and learned a lot while doing so. Running anything at scale - tools, processes, services - takes problem solving to overcome unexpected challenges. Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

Scaling Presto rapidly to meet growing demands: What challenges did we face?

Deploying new Presto releases

Click to read more ...

Philip Bell |

Post a Comment |

Permalink |

Print Article

Email Article

tagged

Ddatabase,

Scalability,

open source,

presto,

sql in

analytics,

database scalability,

high availablilty,

high performance,

high scalability

Wednesday

Jan222020

Follower Clusters – 3 Major Use Cases for Syncing SQL & NoSQL Deployments

Wednesday, January 22, 2020 at 9:38AM

Follower clusters are a ScaleGrid feature that allows you to keep two independent database systems (of the same type) in sync. Unlike cloning or replication, this allows you to maintain an active, point-in-time copy of your production data. This extra cluster, known as a follower cluster, can be leveraged for multiple use cases, including for analyzing, optimizing and testing your application performance for MongoDB, MySQL and PostgreSQL. In this blog post, we will cover the top three scenarios to leverage follower clusters for your application.

How Do Follower Clusters Differ From Replication?

Unlike a static clone, this data imports on a set schedule so your follower cluster is always in sync with your production cluster. Here are a few critical ways in which it differs from replication:

Click to read more ...

Kristi Anderson |

Post a Comment |

Permalink |

tagged

Database Development,

mysql,

postgresql in

Appliction monitoring,

DevOps,

MySQL,

admin,

database replication,

database scalability,

sql,

testing

Wednesday

Nov162016

The Story of Batching to Streaming Analytics at Optimizely

Wednesday, November 16, 2016 at 8:56AM

Our mission at Optimizely is to help decision makers turn data into action. This requires us to move data with speed and reliability. We track billions of user events, such as page views, clicks and custom events, on a daily basis. To provide our customers with immediate access to key business insights about their users has always been our top most priority. Because of this, we are constantly innovating on our data ingestion pipeline.

In this article we will introduce how we transformed our data ingestion pipeline from batching to streaming to provide our customers with real-time session metrics.

Motivations

Unification. Previously, we maintained two data stores for different use cases - HBase is used for computing Experimentation metrics, whereas Druid is used for calculating Personalization results. These two systems were developed with distinctive requirements in mind:

Experimentation	Personalization
Instant event ingestion	Delayed event ingestion ok
Query latency in seconds	Query latency in subseconds
Visitor level metrics	Session level metrics

As our business requirements evolve, however, things quickly became difficult to scale. Maintaining a Druid + HBase Lambda architecture (see below) to satisfy these business needs became a technical burden for the engineering team. We need a solution that reduces backend complexity and increases development productivity. More importantly, a unified counting infrastructure creates a generic platform for many of our future product needs.

Consistency. As mentioned above, the two counting infrastructures provide different metrics and computational guarantees. For example, Experimentation results show you the number of visitors visited your landing page whereas Personalization shows you the number of sessions instead. We want to bring consistent metrics to our customers and support both type of statistics across our products.

Real-time results. Our session based results are computed using MR jobs, which can be delayed up to hours after the events are received. A real-time solution will provide our customers with more up-to-date view of their data.

Druid + HBase

In our earlier posts, we introduced our backend ingestion pipeline and how we use Druid and MR to store transactional stats based on user sessions. One biggest benefit we get from Druid is the low latency results at query time. However, it does come with its own set of drawbacks. For example, since segment files are immutable, it is impossible to incrementally update the indexes. As a result, we are forced to reprocess user events within a given time window if we need to fix certain data issues such as out of order events. In addition, we had difficulty scaling the number of dimensions and dimension cardinality, and queries expanding long period of time became expensive.

On the other hand, we also use HBase for our visitor based computation. We write each event into an HBase cell, which gave us maximum flexibility in terms of supporting the kind of queries we can run. When a customer needs to find out “how many unique visitors have triggered an add-to-cart conversion”, for example, we do a scan over the range of dataset for that experimentation. Since events are pushed into HBase (through Kafka) near real-time, data generally reflect the current state of the world. However, our current table schema does not aggregate any metadata associated with each event. These metadata include generic set of information such as browser types and geolocation details, as well as customer specific tags used for customized data segmentation. The redundancy of these data prevents us from supporting large number of custom segmentations, as it increases our storage cost and query scan time.

SessionDB

Click to read more ...

Vignesh Sukumar |

Post a Comment |

Permalink |

Print Article

Email Article

tagged

Distributed Computing,

Scalability,

streaming in

BigData,

analytics,

architecture partitioning scaling

Wednesday

Aug132014

Hamsterdb: An Analytical Embedded Key-value Store

Wednesday, August 13, 2014 at 8:56AM

In this post, I’d like to introduce you to hamsterdb, an Apache 2-licensed, embedded analytical key-value database library similar to Google's leveldb and Oracle's BerkeleyDB.

hamsterdb is not a new contender in this niche. In fact, hamsterdb has been around for over 9 years. In this time, it has dramatically grown, and the focus has shifted from a pure key-value store to an analytical database offering functionality similar to a column store database.

hamsterdb is single-threaded and non-distributed, and users usually link it directly into their applications. hamsterdb offers a unique (at least, as far as I know) implementation of Transactions, as well as other unique features similar to column store databases, making it a natural fit for analytical workloads. It can be used natively from C/C++ and has bindings for Erlang, Python, Java, .NET, and even Ada. It is used in embedded devices and on-premise applications with millions of deployments, as well as serving in cloud instances for caching and indexing.

hamsterdb has a unique feature in the key-value niche: it understands schema information. While most databases do not know or care what kind of keys are inserted, hamsterdb supports key types for binary keys...

Click to read more ...

Christoph Rupp |

6 Comments |

Permalink |

C++,

lightweight

Tuesday

Aug282012

Making Hadoop Run Faster

Tuesday, August 28, 2012 at 9:15AM

Making Hadoop Run Faster

One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.

Batch Processing to the Rescue

Hadoop was designed to deal with this challenge in the following ways:

1. Use a distributed file system: This enables us to spread the load and grow our system as needed.

2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.

3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.

Batch Processing Challenges

Click to read more ...

Nati Shalom |

1 Comment |

Permalink |

tagged

hadoop,

nosql in

analytics

Monday

Jul182011

Building your own Facebook Realtime Analytics System

Monday, July 18, 2011 at 9:02AM

Recently, I was reading Todd Hoff's write-up on FaceBook real time analytics system. As usual, Todd did an excellent job in summarizing this video from Engineering Manager at Facebook Alex Himel.

In the first post, I’d like to summarize the case study, and consider some things that weren't mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook's experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..

Click to read more ...

Nati Shalom |

Post a Comment |

Permalink |

tagged

nosql in

analytics large database real time

Tuesday

Mar222011

Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

Tuesday, March 22, 2011 at 9:55AM

Facebook did it again. They've built another system capable of doing something useful with ginormous streams of realtime data. Last time we saw Facebook release their New Real-Time Messaging System: HBase To Store 135+ Billion Messages A Month. This time it's a realtime analytics system handling over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.

Alex Himel, Engineering Manager at Facebook, explains what they've built (video) and the scale required:

Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.

Alex does an excellent job with the presentation. Highly recommended. But let's take a little deeper look at what's going on...

Click to read more ...

HighScalability Team |

6 Comments |

Permalink |

facebook

Wednesday

Jun102009

Hive - A Petabyte Scale Data Warehouse using Hadoop

Wednesday, June 10, 2009 at 7:02AM

This post about using Hive and Hadoop for analytics comes straight from Facebook engineers.

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics.

These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced kind such as Facebook's Lexicon product.

As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive and Hadoop are the technologies that we have used to address these requirements at Facebook.

Read the rest of the article on Engineering @ Facebook's Notes page

Todd Hoff |

Post a Comment |

Permalink |

Hadoop,

hive,

scalable