Leveraging AWS to Build a Scalable Data Pipeline
Monday, June 8, 2015 at 10:06AM 
While at Netflix and LinkedIn Siddharth "Sid" Anand wrote some great articles for HS. Sid is back, this time as a Data Architect at Agari. Original article is here.
Data-rich companies (e.g. LinkedIn, Facebook, Google, and Twitter) have historically built custom data pipelines over bare metal in custom-designed data centers. In order to meet strict requirements on data security, fault-tolerance, cost control, job scalability, and uptime, they need to closely manage their core technology. Like serving systems (e.g. web application servers and OLTP databases) that need to be up 24x7 to display content to users the world over, data pipelines need to be up and running in order to pick the most engaging and up-to-date content to display. In other words, updated ranking models, new content recommendations, and the like are what make data pipelines an integral part of an end user’s web experience by picking engaging, personalized content.
Agari, a data-driven email security company, is no different in its demand for a low-latency, reliable, and scalable data pipeline. It must process a flood of inbound email and email authentication metrics, analyze this data in a timely manner, often enriching it with 3rd party data and model-driven derived data, and publish findings. One twist is that Agari, unlike the companies listed above, operates completely in the cloud, specifically in AWS. This has turned out to be more a boon than a disadvantage.
Below is one such data pipeline used at Agari.




Underutilization and segregation are the classic strategies for ensuring resources are available when work absolutely must get done. Keep a database on its own server so when the load spikes another VM or high priority thread can't interfere with RAM, power, disk, or CPU access. And when you really need fast and reliable networking you can't rely on QOS, you keep a dedicated line.




