Monday

Feb092015

Vinted Architecture: Keeping a busy portal stable by deploying several hundred times per day

Monday, February 9, 2015 at 8:56AM

This is guest post by Nerijus Bendžiūnas and Tomas Varaneckas of Vinted.

Vinted is a peer-to-peer marketplace to sell, buy and swap clothes. It allows members to communicate directly and has the features of a social networking service.

Started in 2008 as a small community for Lithuanian girls, it developed into a worldwide project that serves over 7 million users in 8 different countries, and is growing non-stop, handling over 200 M requests per day.

Stats

7 million users and growing
2.5 million monthly active users
200 million requests / day (pageviews + API calls)
930 million photos
80 TB raw space for image storage
60 TB HDFS raw space for analytics data
70% of traffic comes from mobile apps (API requests)
3+ hours / week time spent per member
25 million listed items
Over 220 servers:
- 47 for internal tools (vpn, chef, monitoring, development, graphing, build, backups, etc.)
- 38 for the Hadoop ecosystem
- 34 for image processing and storage
- 30 for Unicorn and microservices
- 28 for MySQL databases, including replicas
- 19 for the Sphinx search
- 10 for resque background jobs
- 10 for load balancing with Nginx
- 6 for Kafka
- 4 for Redis
- 4 for email delivery

Technology Stack

Third party services

GitHub for code, issues and discussions
NewRelic for monitoring app response times
CloudFlare for DDoS protection and DNS
Amazon Web Services (SES, Glacier) for notifications and long term backups
Slack for work conversations and "chat ops"
Trello for tasks
Pingdom for uptime monitoring

Core app and services

Ruby for services, scripts and the main app
Rails for the main app and internal apps
Unicorn to serve the main app
Percona MySQL Server as main database
Sphinx Search for full-text search and for reducing load to MySQL (testing ElasticSearch)
Capistrano for deployments
Jenkins for running tests, deployments and various other tasks
Memcached for caching
Resque for background jobs
Redis for Resque and Feed
RabbitMQ for passing messages to services
HAproxy for high availability and load balancing
GlusterFS for distributed storage
Nginx to serve web requests
Apache Zookeeper for distributed locking
Flashcache for increased I/O throughput

Data warehouse stack

Clojure for data ingestion services
Apache Kafka for storing in-flight data
Camus for offloading data from Kafka to HDFS
Apache Hive as SQL-on-Hadoop solution
Cloudera CDH Hadoop distribution
Cloudera Impala as low latency SQL-on-Hadoop solution
Apache Spark in experimentation phase
Apache Oozie as workflow scheduler (investigating alternatives)
Scalding for data transformations
Avro for data serialization
Apache Parquet for data serialization

Servers and Provisioning

Mostly bare metal
Multiple data centers
CentOS
Chef for provisioning nearly everything
Berkshelf for managing Chef dependencies
Test Kitchen for running infrastructure tests on VMs
Ansible for rolling upgrades

Monitoring

Nagios for usual ops monitoring
Cacti for graphs and capacity planning
Graylog2 for app logs
Logstash for server logs
Kibana for querying logstash
Statsd for collecting real-time metrics from the app
Graphite for storing real-time metrics from the app
Grafana for creating beautiful dashboards with app metrics
Hubot for chat-based monitoring

Architecture overview

Bare metal servers are used as a cheaper and more powerful alternative to cloud providers.
Servers hosted in 3 different data centers across Europe and the US.
Nginx frontends route HTTP requests to unicorn workers and do SSL termination.
Pacemaker is used to ensure high availability of Nginx servers.
In different countries, every Vinted portal has its own separate deployment and set of resources.
To increase machine utilization, most services that belong to multiple portals may run alongside each other on a single bare metal server. Chef recipes take care of unique port allocation and other separation concerns.
MySQL sharding is avoided by having a separate database for each portal.
In our largest portal, functional sharding is already happening, and a point where the single largest table will not fit into the server is months away.
Caching in Rails app uses custom mechanism with L2 cache that prevents spikes when the main cache is expired. Several caching strategies are used:
- remote (in memcached)
- local (in unicorn worker memory)
- semi-local (in unicorn worker memory, falling back to memcached when local cache expires).
Several microservices are built around the core rails app, all with a clear purpose, like sending iOS push notifications, storing and serving brand names, storing and serving hashtags.
Microservices can be either per-portal, per-datacenter or global.
Multiple instances of each microservice are running for high availability.
HTTP-based microservices are balanced by Nginx.
Custom feeds for each member are implemented using Redis.
Catalog pages are loaded from the Sphinx index rather than MySQL when filters are used (custom size, color, etc.).
Images are processed by a separate Rails app.
Processed images are stored in GlusterFS.
Images are cached after 3rd hit, since the first few hits usually come from the uploader and an image may be rotated.
Memcached instances are sharded using twemproxy.

Team

over 150 full-time employees
30 developers (backend, frontend, mobile)
5 site reliability engineers
Main HQ in Vilnius, Lithuania
Offices in the USA, Germany, France, the Czech Republic and Poland

How The Company Works

Nearly all information is open to all employees
Using GitHub for almost everything (including to discuss company issues)
Real-time discussions in Slack
Everyone is free to participate in things that interest them
Autonomous teams
No seniority levels
Cross-functional development teams
No enforced process, teams decides how to manage themselves
Teams work on high-level problems and decide how to tackle them
Each team decides how, when and where they work
Teams hire for themselves when a need arises

Development Cycle

Developers branch from master.
Changes become pull requests in GitHub.
Jenkins runs Pronto to do static code and style checks, runs all tests on pull request branches and updates the GitHub pull request with a status.
Other developers review the change, add comments.
Pull requests may be updated multiple times with various fixes.
Jenkins runs all tests after each update.
Git history is cleaned up before merging with master to keep master history concise.
Accepted pull requests are merged into master.
Jenkins runs master build with all tests and triggers deployment jobs to roll out the new version.
Several minutes later, after the code has reached master branch, it is applied in production.
Pull requests are usually small.
Deployments happen ~300 times per day.

Avoiding Failures

Deploying hundreds of times per day does not mean everything always has to be broken, but keeping the site stable requires some discipline.

Deployments do not happen if tests are broken. There is no way to override this, master has to be green in order to deploy.
Automatic deployments are locked every night and during weekends.
Anyone can lock deployments manually from a Slack chat.
When deployments are locked, it is possible to "force" a deployment manually, from the Slack chat.
The team is very picky when it comes to reviewing code. The quality bar is set high. Tests are mandatory.
Before Unicorn reloads code, a warmup is done in production. It includes several requests to various critical parts of the portal.
During warmup, if any one request does not return 200 OK, the old code remains and deployment fails.
Occasionally a problem is not caught by tests / warmup, and broken code is released to production.
Errors are streamed to graylog server.
Alarms are trigerred if the error rate exceeds a threshold.
Error rate alarms and failed build notifications are reported instantly to a Slack chat.
All errors contain extra metadata: Unicorn worker host and pid, HTTP request details, git revision of the code that caused the problem, error backtrace.
Some types of "Fatal" errors are also reported directly to a Slack chat.
Each deployment log contains GitHub diff URL with all the changes.
If something breaks in production, it is easy to pinpoint the problem quickly due to a small changeset and instant error feedback.
Deployment details are reported to NewRelic, which makes it easy to troubleshoot introduction of performance bottlenecks.

Reducing Time To Production

There is a big focus on both fast and stable releases, the team is always working to keep build and deployment times as short as possible.

Full test suite contains >7000 tests and runs in ~3 minutes.
New tests are added constantly.
Jenkins runs on a bare metal machine with 32 CPU, 128G RAM and SSD drives.
Jenkins splits all tests into multiple chunks and runs them in parallel, aggregating final results.
Tests would run over 1 hour on an average machine without parallelisation.
Rails assets are pre-built in Jenkins for every commit in master branch.
During deployment, prebuilt assets are downloaded from jenkins and uploaded to all target servers.
BitTorrent is used when uploading a release to target servers.

Running live database migrations

We can change the structure of our production MySQL databases without downtime, in most cases even at rush hour.

Database migrations can happen during any deployment
Percona toolkit's pt-online-schema-change is used to run alters on a copy of the table, while keeping it in sync with the original, and to switch the copy with the original when the change is done

Operations

Server Provisioning

Chef is used to provision nearly every aspect of our infrastructure.
SRE team does pull requests for infrastructure changes just like Development teams do for code changes.
Jenkins is verifying all Chef pull requests, running cross dependency checks, foodcritic, etc.
Chef code changes are reviewed in GitHub by the team.
Developers can make infrastructure change pull requests to Chef repository.
When pull requests are merged, Jenkins uploads changed Chef cookbooks and applies them in production.
Plans to spawn DigitalOcean droplets to run Chef kitchen tests with Jenkins.
Jenkins itself is configured with Chef, not using the web UI.

Monitoring

Cacti graphs server side metrics
Nagios checks for everything that can be up or down
Nagios health checks for some of our apps and services
Statsd / Graphite for tracking app metrics, like member logins, signups, database object creation
Grafana used for nice dashboards
Grafana dashboards can be described and created dynamically using Chef

Chat Ops

Hubot sits in Slack.
Slack rooms are subscribed to various event notifications via hubot-pubsub.
#deploy room shows information about deployments, with links to GitHub diffs, Jenkins console outputs, etc. Right here deployments can be triggered, locked and unlocked with simple chat commands.
#deploy-fail room shows only deployment failures.
#failboat room shows announcements about error rates and individual errors in production.
There multiple #failboat-* rooms that give information about cron failures, stuck resque jobs, nagios warnings, newrelic alerts.
Twice a day Graylog errors are crunched and cross-checked with GitHub, reports are generated and presented to the development team.
If someone mentions you in GitHub, Hubot gives you a link to the issue in a private message on Slack.
When an issue or pull request is created in GitHub, teams that are mentioned receive pings in their appropriate Slack rooms.
Graphite graphs can be queried and displayed in Slack chat via Hubot.
You can ask Hubot questions about infrastructure, i.e. as about the machines where a certain service is deployed. Hubot queries the Chef server to get the data.
Developers use Hubot notifications for certain events that happen in our app. For example, customer the support chatroom is automatically notified about possible fraud cases.
Hubot is integrated with the office netatmo module and can tell everyone what CO2, noise, temperature and humidity levels are.

Data warehouse stack

We ingest data from two sources:
- Website event tracking data: impressions, clicks etc.
- Database (MySQL) changes.
Most event tracking data is published to an HTTP endpoint from JavaScript/Android/iOS front-end apps. Some events are published to a local UDP socket by our core Ruby on Rails application. We chose UDP in order to avoid blocking the request.
There's a service that listens on the UDP socket for new events, buffers them and periodically emits to a raw events' topic in Kafka.
A stream processing service consumes the raw events topic, validates the events, encodes them in Avro format and emits to event-specific topics.
We keep our Avro schemas of events in a separate centralized schema registry service.
Invalid events are placed into a separate Kafka topic for possible correction.
We use LinkedIn's Camus job to offload events from event-specific topics to HDFS incrementally. Time to Hadoop for an event is usually up to an hour.
Using Hive and Impala for ad-hoc queries and data analysis.
Working on Scalding-based solution for data processing.
Reports run from our in-house grown OLAP-like reporting system written in Ruby.

Lessons Learned

Product development

Invest in code quality.
Cover absolutely everything with tests.
Release small changes.
Use feature switches to deploy unfinished features to production.
Do not keep long running branches of code when developing something.
Invest in a fast release cycle.
Build the product mobile-first.
Design API very well from the very beginning, it is difficult to change it later.
Including sender hostname and app name in RabbitMQ messages can be a life saver.
Do not guess or make assumptions, do AB testing and check the data.
If you plan on using Redis for something big, think about sharding from very beginning.
Never use forked and slightly modified ruby gems if you are planning on keeping your Rails up to date.
Make sharing your content through social networks as easy as possible.
Search engine optimization is a full-time job.

Infrastructure and Operations

Deploy often to increase system stability. It may sound counter-intuitive at first.
Automate everything you can.
Monitor everything that can be down.
Network switch buffers matter.
Capacity planning is hard.
Think about HA from the start.
RabbitMQ cluster is not worth the effort.
Measure and graph everything you can: HTTP response times, Resque queue sizes, Model persistence rates, Redis response times. You cannot improve what you cannot measure.

Data warehouse stack

There's a myriad of tools in the Hadoop ecosystem. It's hard to pick the right one.
If you are writing a user defined function (UDF) for Hive it's time to consider a non-SQL solution for data transformations.
The "Hadoop application" notion is vague. Oftentimes we need to glue our application components together manually.
Everyone writes their own workflow manager. And for a good reason.
Distributed system monitoring is tough. Anomaly detection and graphing helps.
The core Hadoop infrastructure (HDFS, YARN) is rock solid but the newer tools usually have unpleasant nuances.
Kafka is amazing.

HighScalability Team |

12 Comments |

Permalink |

Print Article

Email Article

Example

Reader Comments (12)

Is this website about high scalability or how to waste servers? Do you really need 220 servers to serve ~2300req/sec?

February 10, 2015 |

Peter

He got the break-down on some server use there. It's Ruby on Rails, so that's why.

February 10, 2015 |

Tim

Could you elaborate on "RabbitMQ cluster is not worth the effort"? Have you had problems getting it running?

February 11, 2015 |

Ian

@Peter - likely, as with other high-availability sites, they can serve the traffic with less than ten servers.
As it seems from the article - the rest goes to (a) collecting and analysing the data; (b) monitoring and supervising the system; (c) ensuring high availability.
There are just 10 edge (cache/nginx) servers (30 unicorns are just natural extension because of HA) over three regions. Given that traffic is not uniform (thus 200M req/day doesn't translate to 2300 req/s) this seems moderate, as in any region request may be handled by just one of several endpoint servers. Again - that's what you expect in HA set-up - 2+ servers for any potential incoming (source of) request(s).

February 11, 2015 |

Justas Butkus

I have to agree with Peter. After recently reading the Stack Exchange article here, it's hard not to think that this setup is a bit wasteful of resources.

February 11, 2015 |

Jon

Thanks for sharing.
It looks that you are using a large number of open source projects. I am curious about your decision. Were the team members already familiar with most of the technologies at the beginning? If not, was it a bit overwhelming to ramp up on so many fronts?

February 11, 2015 |

Tao

Hey guys,

Vaidotas, one of Vinted Engineers here.

Would like to comment on server count. If you look at details, you will see that a lot of servers are dedicated for analytics data gathering. And i feel that analytics data could be separate topic..

If you look at servers needed to run Vinted app and everything it has you see this:

34 for image processing and storage (930 million photos)
30 for Unicorn and microservices (200 million requests / day)
28 for MySQL databases, including replicas
19 for the Sphinx search (25 million listed items to index)
10 for resque background jobs
10 for load balancing with Nginx (acts as proxy and some caching, could be less, but need them in different countries we operate in).
4 for Redis

So thats 135 servers which are running app infrastructure. Now keep in mind, everything needs to be high-available. That means minimum 2x of everything, even if you don't need/use it at the moment.

Hope that explains server count a bit.

February 12, 2015 |

Vaidotas

Also, what is your ops team size like to maintain this kind of operation/infrastructure? Just SREs alone you listed 5. How about DevOps engineers, Sys admins and/or DBAs?

February 12, 2015 |

Jon

Your Jenkins parallelisation set up seems interesting. How are you accomplishing that? Maybe another blog post dedicated to that soon?

February 12, 2015 |

Tim

Jon, we have 5 site reliability engineers who maintain all infrastructure. We do everything from server bootstrapping to database administration. We heavily rely on automation :)

February 13, 2015 |

Vaidotas

What I cannot really understand - how is 200 M of requests/day transform to real Internet traffic? StackExchange has 150 M of requests and is on 53 position in Alexa rating, vinted.com is on 38763 position but has higher RPS.
The same about http://highscalability.com/blog/2014/11/10/nifty-architecture-tricks-from-wix-building-a-publishing-pla.html (wix) - they claim 700M requests/day, but surely cannot be compared in Alexa rating with StackExchange one.

Can anyone clarify?

February 19, 2015 |

Anton

@tao, at the beginning there was only one developer (CEO, co-founder) and the technology stack was lot smaller. The answer to your question is NO.

When the project began to grow, some things appeared. We hired more developers, more engineers, etc. We choose to hire the smartest ones. Some technologies were used because of experience with them, others were "trials and errors" (especially the big data stuff). We were solving problems as we encountered them along the way. We still have problems we're solving. This is how new technologies appear. I believe that after a year we can write a different article on how we do things.

@lan: RabbitMQ is great and we're using it, but when it comes to analytic events - we met some limitations and chose kafka for event emission. The RabbitMQ cluster stuff is easy to break. As the documentation says - "RabbitMQ clustering does not tolerate network partitions well" and this is not as rare as we’d like it to be.

@Tim: We use slightly modified version of https://github.com/grosser/parallel_tests

@Anton: not sure that "transform " means but we deliver more than 10 Gigs traffic to girls. Talking about Alexa, at the moment we have 8 different countries and, for example, on of ours (http://kleiderkreisel.de) is at position 7,468 so no single aggregated ranking. Another thing is that 70% of traffic comes from mobile apps and I'm not sure that Alexa counting this.

I hope, guys, that Vinted answered all your questions :)

February 21, 2015 |

Nerijus Benžiūnas

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>