« Stuff The Internet Says On Scalability For February 24th, 2017 | Main | Stuff The Internet Says On Scalability For February 17th, 2017 »
Monday
Feb202017

Scaling @ HelloFresh: API Gateway

HelloFresh keeps growing every single day: our product is always improving, new ideas are popping up from everywhere, our supply chain is being completely automated. All of this is simply amazing us, but of course this constant growth brings many technical challenges.

Today I’d like to take you on a small journey that we went through to accomplish a big migration in our infrastructure that would allow us to move forward in a faster, more dynamic, and more secure way.

The Challenge

We’ve recently built an API Gateway, and now we had the complex challenge of moving our main (monolithic) API behind it — ideally without downtime. This would enable us to create more microservices and easily hook them into our infrastructure without much effort.

The Architecture

Our gateway is on the frontline of our infrastructure. It receives thousands of request per day, and for that reason we chose Go when building it, because of its performance, simplicity, and elegant solution to concurrency.

We already had many things in place that made this transition more simple, some of them are:

Service Discovery and Client Side Load Balancing

We use consul as our service discovery tool. This together with HAProxy
enables us to solve two of the main problems when moving to a microservice architecture: service discovery (automatically registering new services as they come online) and client side load balancing (distributing requests across servers).

Automation

Maybe the most useful tool in our arsenal was the automation of our infrastructure. We use Ansible to provision anything in our cloud — this goes from a single machine to dealing with network, DNS, CI machines, and so on. Importantly, we’ve implemented a convention: when creating a new service, the first thing our engineers tackle is to create the Ansible scripts for this service.

Logging and Monitoring

I like to say that anything that goes in our infrastructure should be monitored somehow. We have some best practices in place on how to properly log and monitor your application.

  • Dashboards around the office show how the system is performing at any given time.
  • For logging we use the ELK Stack, which allows us to quickly analyze detailed data about a service’s behavior.
  • For monitoring we love the combination of statsd + grafana. It is simply amazing what you can accomplish with this tool.

Grafana dashboards give amazing insight into your performance metrics

Understanding the current architecture

Even with all these tools in place we still have a hard problem to solve: understand the current architecture and how we can pull off a smooth migration. At this stage, we invested some time on refactoring our legacy applications to support our new gateway and authentication service that would be also introduced in this migration (watch this space for another article on that — Ed).

Some of the problems we found:

  • While we can change our mobile apps, we have to assume people won’t update straight away. So we had to keep backwards compatibility — for example in our DNS — to ensure older versions didn’t stop working.
  • We had to analyze all routes available in our public and private APIs and register them in the gateway in an automated way.
  • We had to disable authentication from our main API and forward this responsibility to the auth service.
  • Ensuring the security of the communication between the gateway and the microservices.

To solve the import problems we wrote a script (in Go, again) to read our OpenAPI specification (aka Swagger) and create a proxy with the correct rules (like rate limitingquotasCORS, etc) for each resource of our APIs.

To test the communication between the services we simply set up our whole infrastructure in a staging environment and started running our automated tests. I must say that this was the most helpful thing that we had during our migration process. We have a large suite of automated functional tests that helped us maintaining the same contract that the main API was returning to our mobile and web apps.

After we were quite sure that our setup worked on our staging environment we started to think about on how to move this to production.

The first attempt

Spoiler alert: our first attempt at going live was pretty much a disaster. Even though we had a quite nice plan in place we were definitely not ready to go live at that point. Let’s check the step by step of our initial plan:

  • Deploy latest version of the API gateway to staging
  • Deploy the main API with changes to staging
  • Run the automated functional tests against staging
  • Run manual QA tests on staging website and mobile apps
  • Deploy latest version of the API gateway to live
  • Deploy the main API with changes to live
  • Run the automated functional tests against live
  • Run manual QA tests on live website and mobile apps
  • Beer

Everything went quite well on staging (at least according to our tests), but when we decided to go live we started to have some problems.

  1. Overload on the auth database: we underestimated the amount of requests we’d receive, causing our database to refuse connections
  2. Wrong CORS configuration: for some endpoints we configured the CORS rules incorrectly, causing requests from the browser to fail

Thanks to our database being flooded with requests we had to roll back right away. Luckily, our monitoring was able to catch that the problem occurred when requesting new tokens from the auth-service.

The second attempt

We knew that we didn’t prepare well for our first deploy, so the first thing we did right after rolling back was hold a post mortem. Here’s some of the things we improved before trying again:

  • Prepare a blue-green deployment procedure. We created a replica of our live environment with the gateway deployed already, so all we needed to when the time came was make one configuration change to bring this cluster online. We could rollback if necessary with the same simple change.
  • Gather more metrics from the current applications to help us have the correct machine sizes to handle the load. We used the data from the first attempt as a yardstick for the amount of traffic we expected, and ran load tests with Gatling to ensure we could comfortably accommodate that traffic.
  • Fix known issues with our auth service before going live. These included a problem with case-sensitivity, a performance issue when signing a JWT, and (as always) adding more logging and monitoring.

It took us around a week to finish all those tasks, and when we were finished, our deployment went smoothly with no downtime. Even with the successful deployment we found some corner case problems that we didn’t cover on the automated tests, but we were able to fix them without a big impact on our applications.

The results

In the end, our architecture looked like this:

API Gateway Architecture

Main API

  • 10+ main API servers on High-CPU Large machines
  • MySQL instances run in a master-replica setup (3 replicas)

Auth service

  • 4 application servers
  • PostgreSQL instances run in a master-replica setup (2 replicas)
  • RabbitMQ cluster is used to asynchronously handle user updates

API Gateway

  • 4 application servers
  • MongoDB instances run in a master-replica setup (4 replicas)

Miscellaneous

  • Ansible is used to execute commands in parallel on all machines. A deploy takes only seconds
  • Amazon CloudFront as the CDN/WAF
  • Consul + HAProxy as service discovery and client side load balancing
  • Statsd + Grafana to graph metrics across the system and alert on problems
  • ELK Stack for centralizing logs across different services
  • Concourse CI as our Continuous Integration tool

I hope you’ve enjoyed our little journey, stay tuned for our next article.

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments (16)

"thousands of request per day"

Surely a typo? And I don't mean "request" [sic], I mean "thousands".

February 20, 2017 | Unregistered CommenterJason

What exactly are those "Main API" servers? Are those the ones which run the different services? Or are those the old ones, which used to run the 10+ replicas of your monolithic main application?

February 21, 2017 | Unregistered CommenterDominik

Please explain what was the use case of Rabbit MQ again?

February 21, 2017 | Unregistered CommenterSankalp

When I first saw this article I was confused. The question "Who is HelloFresh and what do they do?" needed to be answered in the first paragraph. Please introduce your self before getting into the technical weeds.
Thank You.

February 21, 2017 | Unregistered CommenterAngus

The icon in the infrastructure graph above loadbalancer was really confusing. Not all the readers know that the icon relates to AWS Route53 (a DNS service). All the other components had labels, why not "DNS" for that icon?

I actually at first thought that the icon is for AWS' API Gateway because they have a product with same name and you used an AWS icon.

February 21, 2017 | Unregistered Commenterjoonas.fi

Thousands of requests?

February 22, 2017 | Unregistered CommenterVrashabh

Hi Dominik,

Those Main API servers are our monolithic API, we are working on breaking it down to their own services, but we still have a lot of work to do.

February 22, 2017 | Registered CommenterItalo Vietro

Hi Sankalp,

We use RabbitMQ extensively on our platform, the idea is that the message broker helps us to achieve a more decoupled architecture, easy to extract new services and scale.
In this particular example that I've mentioned in the article, we use it to receive user updates from the monolithic API to the auth service, this was essential for completing the migration of users to this new service.
In a future article, I will tell more details about the auth service.

February 22, 2017 | Registered CommenterItalo Vietro

Hi Angus,

You are definitely right, I forgot to do an introduction about HelloFresh and what we do.
You can check a bit more about our engineering team here https://engineering.hellofresh.com

Thank you for the feedback

February 22, 2017 | Registered CommenterItalo Vietro

Unless you are expecting in really short term to have millions of requests per day I'd say that this architecture is really overkill. All off this sounds great, it's just question - is it neccessary?

February 22, 2017 | Unregistered CommenterAmel Musić

1. MySQL
2. PostgreSQL
3. MongoDB

Man, I'm wondering why Oracle is missing :) Sorry but it sounds really complicated.

February 22, 2017 | Unregistered CommenterVish

Thousands of request ???

IMHO this architecture sounds horrible complicated. I'd get rid of Microservices/API gateway/Service Discovery shit and simply but as many HTTPs endpoints I want running on jetty servers. Very easy to manage and scalable to billions of requests per day.

February 22, 2017 | Unregistered CommenterVish

Hi Guys, enjoyed the article.

Just wondering why you rolled your own, instead of using Tyk.io, which seems very similar, has the features you are looking for, is open source and Free, as in beer.

I work at Tyk, so the feedback would be great!

February 23, 2017 | Unregistered CommenterJames

Interesting article, but also curious why all the complexity for thousands of requests.

Also wondering why you chose to roll your own API Gateway vs using one of many existing (paid or free) solutions. Something like Kong, Tyk or even Amazon API Gateway.

February 24, 2017 | Unregistered CommenterRamin

Hey Guys,

To answer a couple of question about the complexity. The MYSQL database that I mentioned there is from an already existing legacy API that we have, the PostgresSQL comes from another service that uses postgres as its storage and mongodb is the only storage that we have in the Gateway (Of course we could use postgres or mysql, but since we had the cluster already done we just opted to use it). Having many data storages is not a problem for us, the ideal is that each service has it's own and the teams can choose whatever makes more sense for them.

About the thousands of requests, yes it's not much, but this is only the main API, we have many other third party applications, integrations, Supply Chain applications that consume other parts of our platform. The idea why we chose to go towards microservices is that it's easier for us to work in separated bounded context and deploy independently.

And why did we chose to build our own gateway?
We've tried out the Amazon API Gateway and Tyk but since we had our own authentication provider, integrating it with the AWS Gateway was not ideal, we had to deal with lambdas to add a custom auth provider, shipping metrics to grafana would be a bit more complicated and of course we would be locked down to the same provider.
Tyk didn't give us (at least at the time) the option to have our own auth provider, we had to use the built-in policies, user management and ACLs that was something that we didn't want. I think that the product today is very different, but that was the main reason at the time.

Also with our own gateway, we can version our route configuration files on git and have the change log of it which for us is extremely important.

March 2, 2017 | Registered CommenterItalo Vietro

Great article, congratulations! I'm wondering if you investigated Vamp (vamp.io) adding granular canary-releasing to your stack? We also use HAProxy, can integrate with Consul and use ELK for metric aggregation. And of course much more related to the container and microservices space (f.e. kubernetes and AWS ECS integration for unified deployments and autoscaling). Interested in hearing your feedback! (Disclaimer: i'm one of the founders of Vamp) Cheers, Olaf

March 23, 2017 | Unregistered CommenterOlaf Molenveld

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>