Guest post by Thierry Schellenbach, Founder/CTO of Fashiolista.com, follow @tschellenbach on Twitter and Github
Fashiolista started out as a hobby project which we built on the side. We had absolutely no idea it would grow into one of the largest online fashion communities. The entire first version took about two weeks to develop and our feed implementation was dead simple. We’ve come a long way since then and I’d like to share our experience with scaling feed systems.
Feeds are a core component of many large startups such as Pinterest, Instagram, Wanelo and Fashiolista. At Fashiolista the feed system powers the flat feed, aggregated feed and the notification system. This article will explain the troubles we ran into when scaling our feeds and the design decisions involved with building your own solution. Understanding the basics of how these feed systems work is essential as more and more applications rely on them.
Furthermore we’ve open sourced Feedly, the Python module powering our feeds. Where applicable I’ll reference how to use it to quickly build your own feed solution.
The problem of scaling feed systems has been widely discussed, but let me start by clarifying the basics. The solutions are aimed at making a page like your Facebook news feed, Twitter stream or Fashiolista feed work under high traffic conditions. What all these systems have in common is that they show activities by the people you follow. In our case we based the activity object on this standard for activity streams. Examples of activities are “Thierry added an item to a list on Fashiolista” or “Tommaso tweeted a message”.
Pull, where the feed is gathered during reads
Push where all the feeds are pre computed during the writes.
Most real live applications will use a combination of these two approaches. The process of pushing an activity to all your followers is called a fanout.
The Feed system at Fashiolista went through three major redesigns. The first version worked on a PostgreSQL database, the second used Redis and the third and current version runs on Cassandra. To give you an understanding of when and why these solutions fall apart I’ll briefly cover a bit of history.
Our first setup simply queried a PostgreSQL database and looked something like this
select * from love where user_id in (...)
The most surprising thing was how robust this system was. We passed 1M loves and it kept on working, soon after we hit 5M loves and it still kept on working. Our bet was that it would break after 10M loves, but it just kept on running smoothly. It took some database tweaking but this simple system held up well into ~100M loves and 1M users. Around that time the performance of this solution started to fluctuate. In general it kept on working, but for some users the latency spiked to multiple seconds. After reading many articles on feed design we built the very first version of Feedly with Redis.
Our second approach stored a feed for every user in Redis. When you loved an item this activity was fanned out to all your followers. We used a few smart tricks to keep memory usage low, which I’ll cover in the next section. Redis was really easy to setup and maintain. We sharded across several Redis machines using Nydus and used Sentinel for automatic failovers. (Currently we recommend using Twemproxy instead of Nydus)
I think our history is quite representative of the process other companies go through. When the time comes for building your own feed system (hopefully using Feedly) there are a few important design decisions to consider.
Storing only the id vastly reduces your memory usage. However it also means another trip to your data store every time you load the feed. One factor to consider is how often you copy the data when denormalizing. It makes a huge difference if you are building a notification system or a news feed system. For a notification you usually notify 1 or 2 users for every action which occurs. However with a follow based feed systems the action might get copied to thousands of followers.
Furthermore the best choice really depends on your storage backend. With Redis you need to be careful about memory usage. Cassandra on the other hand has plenty of storage space, but is quite hard to use if you normalize your data.
For notification feeds and feeds built on Cassandra we recommend denormalizing your data. For feeds built on Redis you want to minimize your memory usage and keep your data normalized. Feedly allows you to pick which approach you prefer.
Silberstein et al. make things more interesting by looking at the consumer and producer pair. The basic intuition is that a push approach makes most sense when:
Redis however has a few limitations. All of your data needs to be stored in RAM which eventually becomes expensive. In addition there is no support for sharding built into Redis. This means that you have to roll your own system for sharding across nodes. (Twemproxy is a great option for this). Sharding across nodes is quite easy, but moving data when you add or remove nodes is a pain. You can work around these limitations by using Redis as a cache and falling back to your database. As soon as it becomes hard to fallback to the database I would consider moving from Redis to Cassandra.
The Cassandra Python ecosystem is still rapidly changing. Both CQLEngine and Python-Driver are excellent projects, but they needed a bit of forking to work together. If you choose Cassandra you need to be ready to invest time to learn about Cassandra and contribute to client libraries.
Feedly doesn’t make any of these choices for you. It’s a framework for building feed systems and leaves it up to you to determine what works best for your use case. For an introduction to Feedly have a look at the readme or this tutorial for building a Pinterest esque application. If you give it a try be sure to let us know if you encounter issues.
Note that you only need to solve this problem once you get millions of activities in your database. At Fashiolista the simple database solution got us to our first 100M loves and 1M users.
To learn more about feed design I highly recommend reading some of the articles which we based Feedly on: