Monday

Mar092015

AppLovin: Marketing to Mobile Consumers Worldwide by Processing 30 Billion Requests a Day

Monday, March 9, 2015 at 8:56AM

This is a guest post from AppLovin's VP of engineering, Basil Shikin, on the infrastructure of its mobile marketing platform. Major brands like Uber, Disney, Yelp and Hotels.com use AppLovin's mobile marketing platform. It processes 30 billion requests a day and 60 terabytes of data a day.

AppLovin's marketing platform provides marketing automation and analytics for brands who want to reach their consumers on mobile. The platform enables brands to use real-time data signals to make effective marketing decisions across one billion mobile consumers worldwide.

Core Stats

30 Billion ad requests per day
300,000 ad requests per second, peaking at 500,000 ad requests per second
5ms average response latency
3 Million events per second
60TB of data processed daily
~1000 servers
9 data centers
~40 reporting dimensions
500,000 metrics data points per minute
1 Pb Spark cluster
15GB/s peak disk writes across all servers
9GB/s peak disk reads across all servers
Founded in 2012, AppLovin is headquartered in Palo Alto, with offices in San Francisco, New York, London and Berlin.

Technology Stack

Third Party Services

Github for code
Asana for project management
HipChat for communication

Data Storage

Aerospike for user profile storage
Vertica for aggregated statistics and real-time reporting
Aggregating 350,000 rows per second and writing to Vertica at 34,000 rows per second
Peak 12,000 user profiles per second written to Aerospike
MySQL for ad data
Spark for offline processing and deep data analysis
Redis for basic caching
Thrift for all data storage and transfers
Each data point replicated in 4 data centers
Each service is replicated at least in 2 data centers (at most in 8)
Amazon Web Services used for long term data storage and backups

Core App And Services

Custom C/C++ Nginx module for high performance ad serving
Java for data processing and auxiliary services
PHP / Javascript for UI
Jenkins for continuous integration and deployment
Zookeeper for distributed locking
HAProxy and IPVS for high availability
Coverity for Java/C++ static code analysis
Checkstyle and PMD for PHP static code analysis
Syslog for DC-centralized log server
Hibernate for transaction-based services

Servers and Provisioning

Ubuntu
Cobbler for bare metal provisioning
Chef for configuring servers
Berkshelf for Chef dependencies
Docker with Test Kitchen for running infrastructure tests
A combination of software (ipvs, haproxy) and hardware load balancers. Plan to gradually move away from traditional hardware load balancers.

Monitoring Stack

Server Monitoring

Icinga for all servers
~100 custom Nagios plugins for deep server monitoring
550 various probes per server
Graphite as data format
Grafana for displaying all graphs
PagerDuty for issue escalation
Smokeping for network mesh monitoring

Application Monitoring

VividCortex for MySQL monitoring
JSON /health endpoint on each service
Cross-DC database consistency monitoring
9 4K 65” TVs for showing all graphs across the office
Statistical deviation monitoring
Fraudulent users monitoring
Third-party systems monitoring
Deployments are recorded in all graphs

Intelligent Monitoring

Intelligent alerting system with a feedback loop: a system that can introspect anything can learn anything
Third-party stats about AppLovin are also monitored
Alerting is a cross-team exercise: developers, ops, business, data scientists are involved

Architecture Overview

General Considerations

Store everything in RAM
If it does not fit, save it to SSD
L2 cache level optimizations matter
Use right tool for the right job
Architecture allows swapping any component
Upgrade only if an alternative is 10x better
Write your own components if there is nothing suitable out there
Replicate important data at least 3x
Make sure every message can be re-played without data corruption
Automate everything
Zero-copy message passing

Message Processing

Custom message processing system that guarantees message delivery
3x replication for each message
Sending a message = writing to disk
Any service may fail, but no data are lost
Message dispatching system connects all components together, provides isolation and extensibility of the system
Cross-DC failure tolerance

Ad Serving

Nginx is really fast: can serve an ad in less than a millisecond
Keep all ad serving data in memory: read only
jemalloc gave a 30% speed improvement
Use Aerospike for user profiles: less than 1ms to fetch a profile
Pre-compute all ad serving data on one box and dispatch across all servers
Torrents are used to propagate serving data across all servers. Using Torrents resulted in 83% network load drop on the originating server compared to HTTP-based distribution.
mmap is used to share ad serving data across nginx processes
XXHash is the fastest hash function with a low collision rate. 75x faster than SHA-1 for computing checksums
5% of real traffic goes to staging environment
Ability to run 3 A/B tests at once (20%/20%/10% of traffic for three separate tests, 50% for control)
A/B test results are available in regular reporting

Data Warehouse

All data are replicated
Running most reports takes under 2 seconds
Aggregation is key to allow fast reports on large amounts of data
Non-aggregated data for the last 48 hours is usually to resolve most issues
7 days of raw logs is usually enough for debug
Some reports must be pre-computed
Always think multiple data centers: every data point goes to a multiple locations
Backup in S3 for catastrophic failures
All raw data are stored in Spark cluster

Team

Structure

70 full-time employees
15 developers (platform, ad serving, frontend, mobile)
4 data scientists
5 dev. ops.
Engineers in Palo Alto, CA
Business in San Francisco, CA
Offices in New York, London and Berlin

Interaction

HipChat to discuss most issues
Asana for project-based communication
All code is reviewed
Frequent group code reviews
Quarterly company outings
Regular town hall meetings with CEO
All engineers (junior to CTO) write code
Interviews are tough: offers are really rare

Development Cycle

Developers, business side or data science team comes up with an idea
Idea is reviewed and scheduled to be executed on a Monday meeting
Feature is implemented in a branch; development environment is used for basic testing
A pull request is created
Code is reviewed and iterated upon
For big features group code reviews are common
The feature gets merged to master
The feature gets deployed to staging with the next build
The feature gets tested on 5% real traffic
Statistics are examined
If the feature is successful it graduates to production
Feature is closely monitored for couple days

Avoiding Issues

The system is designed to handle failure of any component
No failure of a single component can harm ad serving or data consistency
Omniscient monitoring
Engineers watch and analyze key business reports
High quality of code is essential
Some features take multiple code reviews and iterations before graduating
Alarms are triggered when:
- Stats for staging are different from production
- FATAL errors on critical services
- Error rate exceeds threshold
- Any irregular activity is detected
data are never dropped
Most log lines can be easily parsed
Rolling back of any change is easy by design
After every failure: fix, make sure same thing does not happen in the future, and add monitoring

Lessons Learned

Product Development

Being able to swap any component easily is key to growth
Failures drive innovative solutions
Staging environment is essential: always be ready to loose 5%
A/B testing is essential
Monitor everything
Build intelligent alerting system
Engineers should be aware of business goals
Business people should be aware of limitations of engineering
Make builds and continuous integration fast. Jenkins run on a 2 bare metal servers with 32 CPU, 128G RAM and SSD drives

Infrastructure

Monitoring all data points is critical
Automation is important
Every component should support HA by design
Kernel optimizations can have up to 25% performance improvement
Process and IRQ balancing lead to another 20% performance improvement
Power saving features impact performance
Use SSDs as much as possible
When optimizing, profile everything. Flame graphs are great!

On Hacker News

HighScalability Team |

7 Comments |

Permalink |

Print Article

Email Article

Example

Reader Comments (7)

How do they loadbalance requests between the servers?

March 10, 2015 |

une

@une: We use a combination of software (ipvs, haproxy) and hardware load balancers. We plan on gradually moving away from traditional hardware load balancers.

March 10, 2015 |

Basil Shikin

How often do you release new features to production?
Can you describe more how your code reviews are performed, are they formal, informal, what is the focus.

March 11, 2015 |

Wes

Hi Basil, great write up thanks.

I'm particularly interested in your "staging" environment right now as we are working at similar scale and iguring out how to manage this sort of environment.

Could you comment more about how "deep" it is? i.e. is it only your frontend/app servers that are replicated and "staging" still shares production databases? Or do you somehow have a whole separate environment right down to databases and other backend services?

If you do have any stateful services in staging, how do you manage issues like consistency/schema change/only partial data available? If you don't, how do you do integration testing/deployment for those services - especially things like log ingestion pipeline which seems pretty crucial to what you do?

Thanks!

March 12, 2015 |

Paul Banks

@Wes
We have a pretty busy release schedule. Usually 2-3 times a week.
We use GitHub Flow for code review and discussions. We focus on implementing the feature in the simplest way possible.

March 13, 2015 |

Basil Shikin

@Paul Banks
Our staging environment is pretty deep. We try to make sure every production service has a staging counterpart that gets 5-10% of real data.

We do need to share some state to make the staging environment comprehensive. Some tricky issues could be caught only with real data.
Since staging environment could run experimental code, we have made sure to put measures in place that allow easy separation of production and staging data.

This setup also helps our monitoring: most of the time production and staging environments are expected to behave in a similar manner. A deviation might point to a potential issue.

March 13, 2015 |

Basil Shikin

Hi, could you please simply explain how do you handle 40 dimensions based metrics. We have a similar dimension size system and I am trying to use hbase and redis for 2million metric points per minute with very few servers.

April 1, 2015 |

halil

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>