Give Meaning to 100 Billion Events a Day — The Shift to Redshift
Wednesday, February 27, 2019 at 9:03AM
HighScalability Team in BigData

This is a guest post by Alban Perillat-Merceroz, from the Analytics team at Teads.

In part one, we described our Analytics data ingestion pipeline, with BigQuery sitting as our data warehouse. However, having our analytics events in BigQuery is not enough. Most importantly, data needs to be served to our end-users.

TL;DR — Teads Analytics big picture

In this article, we will detail:

Data is in BigQuery, now what?

Back to where we stopped in the previous article. Our data is stored in raw and aggregated formats in BigQuery. Our data analysts can query it directly for exploration or debugging purpose, but SQL has its limit. That’s where some Scala code takes over in our architecture, in a component we call the Analytics Service.

Analytics Service overview

The Analytics Service is a rather central piece that performs some operations and business logic that would be too complex to model as SQL. It also aggregates BigQuery’s data together with other data sources to enrich it. To be more specific, these operations involve:

The Analytics Service serves two rather different use cases with the same code and infrastructure:

Because of BigQuery’s incompressible latency and additional processing involved to apply business rules, spreadsheet reports can take from 10 sec to a few hours to be computed. It all depends on the requested period and the complexity of the business rules implied by the requested data.

Technically, we offload processing of big reports as Spark jobs in order to isolate each process. Data marts are usually bigger than Spreadsheet reports and take more time to process.

Why Redshift?

Before we put Redshift in place, the Analytics Service processed Spreadsheet reports and SQL outputs that were loaded into an Infobright Enterprise Edition (IEE)* database.

*IEE is a proprietary fork of MySQL, modified for analytics (column-oriented). Although it was fine to use IEE to serve dashboards with Chartio (SaaS visualization app, that we use internally for Business Intelligence), we were reaching its limits, in terms of pricing and scalability.

We also had other data visualization needs within newly built web apps. On these apps, users can explore their data (e.g. a publisher can see his revenue, with many breakdowns: per hour, per device, per website, etc.). Low response time is a requirement as users expect an interactive experience. We set the requirement for any dashboard to load under 1s for these UIs.

With these two use cases in mind (Chartio for internal use and web apps for clients) we considered several options:

Option 1: BigQuery everywhere

We initially thought we could handle all this with BigQuery, by loading outputs of the Analytics Service back into BigQuery (instead of IEE). But doing this we ran into some limitations:

Option 2: Keep data marts in IEE for Chartio and consider caching solutions for web apps

In order to work around latency and concurrency issues of BigQuery we naturally looked into caching solutions and benchmarked Key-Value stores. The short-listed candidates were DynamoDB and BigTable.

Both offer great read performance, but the challenge is on the write side. With BigQuery as the primary data source, a cache miss would cost too much, hence the need to pre-compute and load everything into a cache. This solution would add a lot of work beforehand, and would not help us move away from IEE.

A simplistic comparison of the candidates based on subjective evaluations. Greater is better.

As cache invalidation is one of the two hard things in computer science, we spent a lot of time assessing alternatives. An out-of-the-box proposition was Redshift.

It wasn’t intuitive, but we came to a point where we considered it was a serious candidate to solve two problems at once.

Option 3: The counterintuitive choice of Redshift

Redshift was a natural choice to replace IEE (products are similar on paper) and serve as a data source for internal Chartio dashboards.

But could Redshift be a serious alternative to low latency Key-Value stores for our web apps needs?

Redshift could indeed help reduce BigQuery’s load coming from Chartio, and we were also really tempted to make it fit our web apps needs. This way, we could mutualize the implementation effort.

We were immediately seduced by this solution for web apps as it comes with two great advantages:

However, the scale of our data marts sits in between what the SQL world can handle and real Big Data. We have tables of a few billion rows and up to 600 GB of data per table. It’s enough for classic databases to start struggling, yet it’s the lower end of the spectrum for Redshift.

We still had the same two concerns though:

Concern #1 Query latency

Surprisingly, we found little information about how Redshift performs on relatively small data sets. All the benchmarks available at that time were focused on Big Data use cases. They naturally involved big clusters that would be overkill and economically challenging for our use case. These benchmarks were testing Redshift for large queries taking 10 to 30 seconds.

Without first-hand experience with small clusters and small optimized queries, we were incapable of knowing if Redshift could perform sub-second response times for our datasets.

Experience now tells us it is possible to serve an app with complex dashboards to a few concurrent users and keep latency under 500 ms, which is:

Here are a few practices and optimizations we follow to achieve this level of performance:

Redshift query duration (p90)

 

Concern #2 Query concurrency

Before taking a step into the unknown, we asked around people that had already tried to plug Redshift to a user-facing app. They all warned us and pointed out the concurrency limitations of Redshift. Concurrent queries limitations are similar or worse than BigQuery’s (50 slots maximum, 15 recommended).

We carried on anywayIt was a risky choice, but we bet on Redshift because:

Teads for Publisher, one of the web apps powered by Analytics

Make the most out of Redshift — An optimization journey

Unlike BigQuery, Redshift requires a lot of manual optimizations to perform at his best. AWS manages hardware and clustering, but you are still responsible for many database options inherited from Postgres. Most of the optimization is done at the table level with many options to choose from (column type and encoding, sort keys, primary and foreign key, etc.) as well as maintenance operations (vacuum, vacuum reindex, analyse).

The Redshift documentation gives a good overview of the best practices (hereherehere and here). Here is what works for us:

Sizing

It’s the first thing you need to do when creating a cluster. We use the smallest SSD instances (dc2.large) and to date we have 5 Redshift clusters (from 3 to 18 dc2.large nodes each). We also have an “archive” cluster to store cold data, that we use for backups and ad hoc queries (4 ds2.xlarge HDD instances). It’s really easy to resize a cluster, so there is no need to do any complex capacity planning here.

Columns encoding

Compression is first and foremost about performance but reducing table footprint is always welcome. All you need to know about encoding is in the documentation. After running some benchmarks on our data, the best configuration for us is:

Table example with columns encoding

 

In this example, we use the BYTEDICT encoding for the device column because we know the cardinality is 4 (mobile, desktop, tv, tablet), but we use LZO for the browser because there is a virtual infinity of user agents.

We’ve also compared general purpose algorithms LZO and ZSTANDARD. Although ZSTANDARD always gives the best compression ratio, it compromises on query performance. We end up never using it because of our focus on query performance. In this benchmark, DELTA encoding has even better read performance, but it is not compatible nor adapted to all types of data.

This chart is specific to our context, and only for illustration purpose. It gives a rough idea of the results of our early benchmarks comparing compression algorithms. x-axis is an index of query response time.

Sort keys

Sort keys are like indexes, it’s the most important performance lever. There are two types of sort keys:

Vacuums and Analyse

Redshift doesn’t sort data on insertion nor moves data during deletions. Vacuums are maintenance operations that will defragment and sort tables. It is mandatory to maintain optimal performance. The 3 most important operations are:

Scheduling vacuums is tricky as these are long, I/O intensive processes, and limited to a single concurrent vacuum operation per cluster. Vacuum Reindex operations are so demanding that we can’t do them on all tables every day, it would take more than 24 hours to process. It’s worth mentioning that during vacuums our clusters perform at about a third of their nominal capacities(which may be related to the fact we use small clusters).

At first, we tried to vacuum after every data load (every hour), to maximize performance. It did not prove to be a good idea, the servers were taking most of their time doing these costly operations, and performance reduction over the day without vacuum is negligible. Instead, we run a nightly custom script that loops over all tables in the cluster and performs these maintenance operations.

Queues & priorities

Redshift has advanced mechanisms to manage SLAs between users, but we kept all the default settings for now. The fact that we have a cluster per use-case simplifies this part.

We enabled Short Query Acceleration for a cluster that has a lot of concurrent queries (the one for Chartio). It automatically analyses queries and assigns shorter ones to a dedicated queue. It worked great to reduce the performance impact of larger queries.

Instance reservation

Like many AWS resourcesRedshift instances are eligible to the reservation mechanisms. Reserving instances is a tedious but necessary task to reduce your bill. Fortunately, data marts fall into the permanent type of instances so it’s relatively easy to forecast minimum usage.

Key learnings working with Redshift

The evolution of Redshift

Redshift is an evolving product. This article mostly reflects Redshift as of early 2017. Some features and simplification are regularly released.

Orchestration and scheduling

Each step of our process, from Dataflow ingestion to data mart processing into Redshift, needs to be tightly orchestrated. We use good old Jenkins and jobHistory (an internal tool) to achieve that.

Most of our jobs process an endless write-only stream of data. The base unit of time is an hour and jobHistory acts as a timekeeper for all our jobs.

Basically, each job will have to process 24 of these chunks every day. For each job, jobHistory knows which chunk has been processed, and which hasn’t. It also keeps a dependency graph of jobs, so that each successful job can trigger downstream operations.

For each job, Jenkins polls jobHistory every few minutes and asks for dates to process. Most of the time there is nothing to do and once every hour it triggers the processing of this chunk.

jobHistory UI, grey hours are chunks that will be processed once parent dependencies are processed

 

We tried more advanced alternatives to jobHistory, such as Cloud Composer(based on Apache Airflow), but it didn’t fit well enough our need and lacked maturity at the time.

Jenkins, jobHistory: the ingredients are simple. The key to the reliability of the chain resides in a few good practices:


Conclusion

Two years after our first tests, Redshift has become a central piece of our Analytics stack: a database to rule all our various data visualization needs, from self-serve data exploration on Chartio to apps with latency constraints.

It packs a simple SQL interface with good performance and scalability at a reasonable price. Even if our hybrid use case is exotic, latency is acceptable for web UIs and we are still well under concurrency limits.

We couldn’t think of a better way to show our results than a quick overview or our Data Visualization apps:

Here is one of our web apps
Chartio dashboard (example)

If you like large scale event processing and Analytics related challenges,give us a shout. We are actively looking for Engineers to build tomorrow’s architecture.

Many thanks to Benjamin Davy, Brice Jaglin, Christophe Oudar, Eric Pantera for their feedback on this article and Teads’ Analytics team for their contribution to the technology behind all this: Dimitri Ho, Nathan Marin, Olivier Abdesselam, Olivier Nouguier, Quentin Fernandez, Roch Dardie.

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.