How TripleLift Built an Adtech Data Pipeline Processing Billions of Events Per Day
Monday, June 15, 2020 at 9:04AM
HighScalability Team in Example

 

This is a guest post by Eunice Do, Data Engineer at TripleLift, a technology company leading the next generation of programmatic advertising.

What is the name of your system and where can we find out more about it?

The system is the data pipeline at TripleLift. TripleLift is an adtech company, and like most companies in this industry, we deal with high volumes of data on a daily basis. Over the last 3 years, the TripleLift data pipeline scaled from processing millions of events per day to processing billions. This processing can be summed up as the continuous aggregation and delivery of reporting data to users in a cost efficient manner. In this article, we'll mostly be focusing on the current state of this multi-billion event pipeline.

To follow the 5 year journey leading up to the current state, check out this talk on the evolution of the TripleLift pipeline by our VP of Engineering.

Why did you decide to build this system?

We needed a system that could:

How big is your system? Try to give a feel for how much work your system does.

Approximate system stats - circa April 2020

What is your in/out bandwidth usage?

The data pipeline aggregates event data collected by our Kafka cluster, which has an approximate I/O of 2.5GB/hr.

How fast are you growing?

We’ve experienced fortuitous, rapid year over year growth for the past few years. From 2016 to 2017, the amount of data processed in our pipeline increased by 4.75x. Then by 2.5x from 2017 to 2018. And finally by 3.75x from 2018 to 2019. That’s nearly 50x in 3 years!

How Is Your System Architected?

At a high level our data pipeline runs batch processes with a flow that consists of:

  1. raw event collection and persistence
  2. denormalization and normalization of data via multiple levels of aggregation
  3. persistence or ingestion of aggregated data into various datastores
  4. exposure of ingested data in UI’s and reporting tools for querying

We begin with the data collection process, in which raw event data is published to any of our 50+ Kafka topics. These events are consumed by Secor (an open source consumer created by Pinterest) and written out to AWS S3 in parquet format.

We use Apache Airflow to facilitate the scheduling and dependency management necessary in both steps 2 and 3.

Aggregation tasks are kicked off by Airflow via job submit requests to the Databricks API. The aggregations are run with Apache Spark on Databricks clusters. Data is first denormalized into wide tables by joining on a multitude of raw event logs in order to paint a full picture of what occurred pre, during, and post auction for an ad slot. Denormalized logs are persisted to S3.

After the denormalization tasks enter a success state, the Airflow scheduler kicks off their downstream normalization tasks. Each of these aggregations roll the denormalized data up into more narrowly scoped sets of dimensions and metrics that fit the business contexts of specific reports. These final aggregations are persisted to S3 as well.

Upon the success of each final aggregation task, the Airflow scheduler kicks off their various downstream persistence or ingestion tasks. One such task copies aggregated data into Snowflake, a data analytics platform which serves as the backend for our business intelligence tools. Another task ingests data into Imply Druid, a managed cloud solution consisting of a time-optimized, columnar datastore that supports ad-hoc analytics queries over large datasets.

Finally, step 4 is a joint effort between our business intelligence and data engineering teams. The primary places where aggregated data can be queried are our internal reporting APIs, Looker (which is backed by Snowflake), and Imply Pivot (a drag and drop analytics UI bundled into the Imply Druid solution).

What lessons have you learned?

Data decisions tend to have far reaching repercussions. For example:

What do you wish you would have done differently?

For a long time we didn’t have a clear approach to the scope of data we made accessible, or where we made it accessible.

We’ve since given more thought to our approach and taken measures such as implementing AWS S3 lifecycle rules, defining retention policies per queryable datasource, and designating reporting tools to handle either quick, investigative queries, or large reports over long date ranges - but not both.

How are you thinking of changing your architecture in the future?

We plan to supplement our batch pipeline by building a real time streaming application using Kafka Streams. This was chosen from a few proofs of concept involving Spark Structured Streaming, Kafka Streams, and KSQL.

How do you graph network and server statistics and trends?

We use Prometheus to store application metrics. We then aggregate the metrics in Grafana dashboards and set alerts on those dashboards.

How many people are in your team?

There are a total of 4 data engineers on the team, and we make up about a tenth of the entire engineering organization. The teams we often collaborate with are infrastructure, solutions, and data science. For example, we recently onboarded the data science team with Airflow, and their model runs are now automated.

Which languages do you use to develop your system?

Python for Airflow, Spark Scala for the aggregations, and Java for some reporting tools.

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.