Ask HighScalability: Facing scaling issues with news feeds on Redis. Any advice?
We just released a social section to our iOS app several days ago and we are already facing scaling issues with the users' news feeds.
We're basically using a Fan-out-on-write (push) model for the users' news feeds (posts of people and topics they follow) and we're using Redis for this (backend is Rails on Heroku). However, our current 60,000 news feeds is ballooning our Redis store to almost 1GB in a just a few days (it's growing way too fast for our budget). Currently we're storing the entire news feed for the user (post id, post text, author, icon url, etc) and we cap the entries to 300 per feed.
I'm wondering if we need to just store the post IDs of each user feed in Redis and then store the rest of the post information somewhere else? Would love some feedback here. In this case, our iOS app would make an api call to our Rails app to retrieve a user's news feed. Rails app would retrieve news feed list (just post IDs) from Redis, and then Rails app would need to query to get the rest of the info for each post. Should we query our Postgres DB directly? But that will be a lot of calls to our DB. Should we create another Redis store (so at least it's in memory) where we store all of the posts from our DB and we query this to get the post information? Or should we forget Redis and go with MongoDB or Cassandra so we can have higher storage limits?
Thanks for your help in advance.
Reader Comments (28)
Don't hit a relational database directly if there's any kind of load, it will die. Use Redis to store IDs and keep the content in the file system or a caching subsystem. If you will have a constant stream of writes then you'll need to shard Redis. Possibly better is to use Cassandra so you don't have to worry about sizing or sharding.
Tumblr has a lot of good advice on handling feeds: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html
Also take a look at DataSift: http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html
Prismatic: http://highscalability.com/blog/2012/7/30/prismatic-architecture-using-machine-learning-on-social-netw.html
Pinboard: http://highscalability.com/blog/2010/12/29/pinboardin-architecture-pay-to-play-to-keep-a-system-small.html
Facebook: http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Old post: http://highscalability.com/blog/2007/11/6/quick-question-about-efficiently-implementing-facebook-news.html
Twitter of course, but I'm not sure what their current approach is.
If you're storing all of your newsfeed data in Redis, you're doing in wrong!
Redis is great at storing the structure of information, but it's expensive to run because the cost/gb is very high. Storing the individual news feed items in Redis isn't necessary; you only need to know which newsfeed items correspond to which newsfeed. Fortunately, RDBMSs are generally VERY inexpensive in terms of cost/gb, though are very expensive for aggregate lookups (what Redis does very well inexpensively).
I'd recommend storing the newsfeed data in a traditional MySQL/PostgreSQL db (pk'ed on the ID of the newsfeed item) and just pointing the individual newsfeed lists in redis at those IDs. You can then use Memcached (generally pretty cheap) to cache requests to the RDBMS.
For now (since newsfeeds for you are capped at 300 items and all items will eventually expire as they're pushed off the end), I'd say that you can do this: start storing IDs instead of the posts, prefixed with some kind of "ID prefix flag". If you fetch a newsfeed item, check to see whether it begins with the prefix flag. If it does, it's an ID and you should grab it from the DB. If not, you already have the newsfeed item.
OP here. If we decided to store just IDs (and not the content) in Redis, then where would we cache the content? Would we store the content in Redis too or Memcache? Or is there another option?
Also, here's some links I collected for a post on the same subject:
http://news.ycombinator.com/item?id=2391623 - How to Build a Fast News Feed in Redis (and Rails)
http://blog.waxman.me/how-to-build-a-fast-news-feed-in-redis
http://www.quora.com/What-are-best-practices-for-building-something-like-a-News-Feed?q=news+feeds
http://blog.waxman.me/how-to-build-a-fast-news-feed-in-redis
http://www.quora.com/What-are-the-scaling-issues-to-keep-in-mind-while-developing-a-social-network-feed
http://mysqldba.blogspot.com/2011/04/building-facebook-feed-like-system-on.html
Dave: Store it in the cheapest datastore you have access to. You're only storing a blob of data associated with a key at that point, where the key is either a unique ID or some monotonically increasing ID. You can use Redis to perform caching against that database, but it's probably more expensive than using Memcached. If it's not, bear in mind that you're only storing one copy of each post, rather than one copy per feed that it ends up in. That should be a lot easier on your data store.
Also bear in mind that even if you delete the ID after 300 newsfeed items, you can't delete the content object from your database because you don't know if there are any other newsfeeds that reference it. At that point, it might just make more sense to preserve newsfeed items indefinitely. The other options are a.) to run a cron that tries to garbage collect old newsfeed items or b.) to have an expiration on newsfeed IDs and the corresponding objects, so that you just ignore/delete expired IDs in Redis and garbage collect content items from your database that have an expiry in the past.
The answer will be highly dependent on your use cases, ability, and resource availability.
Consider using Redis for the PK's (as suggested by, well, everyone) and using the simplest possible stores for your data behind a caching layer or two.
For example mogilefs behind varnish behind nginx would allow you to put to /feeds/content/substr(md5(pk), 0, 4)/pk in mogilefs (which if built for easy storage , clustering, and failover) and have clients request them directly. Varnish is an excellent http content accelerator built to handle heavy delivery loads of static content (which this is.) Nginx would allow you to distribute the caching/read load based on arbitrary rule sets to different sets of caching servers, etc, or even completely rewrite the request as needed when you change your architecture on the backend.
If you have a long tail consider letting first fetches (counted via a redis or memcached incrimenting counter) pass through uncached and when you hit N requests add a header which tells varnish to bother caching.
You can tune your delivery setup pretty well from there.
replace mogilefs with S3, etc, at your pleasure.
My First recommendation is to use a graph database like Neo4j.
There are plenty of examples out there.
Here is something worth looking into
http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/
My second recommendation is MongoDB. You may have to go with two collections, one for news feeds (stores news, owner and the followers of news) and another one for just Users.
In both MongoDB and Neo4J, you do not have a problem of scaling-out. Nodes can be added instantaneously.
As usual, if you choose a NoSQL Platform, be careful about the data consistency (specifically updates). Test it throughly before you can roll out to Production.
At a certain Q&A Interest Graph based social network I used to call home, we implemented this by maintaining the "post ID" in redis with the data itself backed-out in SimpleDb or Cassandra.
So essentially letting Redis shine as an index.
Put news IDs in the redis list and store news in couchbase. Redis is too greedy for memory, and couchbase doesnt have memory overhead like redis. Also, couchbase is easy to scale.
Also you can try to use mysql-handlersocket if there is a library for language you are using, it gives you quick nonSQL interface to InnoDB.
OP here again. Ok, so to summarize it looks like most people are suggesting using Redis as an index by storing only the IDs, and then caching the content elsewhere (I wouldn't want to hit our Postgres DB with all those ID queries, thus would need all content cached). The caching options suggested are:
MongoDB
Cassandra
SimpleDB
Couchbase
Memcached
Nginx/Varnish
Neo4j
Ideally, I would like a caching option that is:
1. Simple & Fast - In this use case, we would be feeding it a list of 30 IDs (from our Redis store of user's news feed) and it would need to pull the content of the IDs . The key would be to be really fast. Would be great if it could be in memory and not access disk to save time.
2. Extendable - the caching option would need to store content of each ID just once, and that's pretty much all it would store (i'm assuming). In 6 months, we're expecting about 300,000 unique posts per month. If it really takes off it could be 1MM unique posts per month (this is just the original post, not replies/comments which we don't need to cache because we are showing them on another page in the app). Does anybody have any rough guesses how much space would 1MM status-update-like posts would take to store? Maybe 100MB?
Lastly, we're a bootstrapped startup and don't have a lot of time/energy for server maintenance. So, based on what I shared, does anybody have any more specific suggestions for our case?
I'm a little intimidated by Cassandra (http://blog.flowdock.com/2010/07/26/flowdock-migrated-from-cassandra-to-mongodb/) and not sure about the data file limits of MongoDB. I'd prefer something in memory. I'm wondering if Memcached could be set up to serve our purpose. Or if we could actually just create another Redis store for all the content. Any thoughts?
I don't think it is strictly nessessary to use multiple datastores. It sounds like you only have 1GB of memory which is rather small honestly. I've had good luck with mongodb personally, and with proper indexing you won't need another secondary datastore (although i've heard of people pushing secondary indexes into ElasticSearch). However, it sounds like hardware costs are a major concern for you, in which case mongodb probably isn't a good fit. In that case, I'd recommend investigating cassandra and riak which both have been recommended to me becuase of their simple scaling models.
We use solr as an accelerated, denormalized data store for the news feed of the world's largest romantic social network. For us, posts also propogate to a user's social network. Don't bother storing just the ids as it will take too long to get the data from somewhere else. Solr is very good at scaling out. Good luck with Redis.
1GByte is too small to make sense for Cassandra or Riak. It's actually tiny really.... If you can't afford it cut the amount of data per user from 300 to 100 or something. You could also try to compress the data before you store it. Otherwise find a cheaper hosting provider or start a freemium service where users get less for free and subscribe for more data.
I don't think your problem is as much a technology problem as it is a business model one. As you note, "we're a bootstrapped startup and don't have a lot of time/energy for server maintenance. So, based on what I shared, does anybody have any more specific suggestions for our case?"
Are you giving away a service you could be selling? Are you in a "race to the bottom" with some better-capitalized competitor? There have to be limits on what non-paying users are allowed to run against your servers.
I think in the long run, some well-tuned Redis code backed up by an industrial strenght RDBMS - PostgreSQL is my favorite but there are others that will do as well - is where you'll end up. CouchBase, as I understand it, is a CouchDB with a Memcached front end. In any event, regardless of the technology / architecture, you *must* control the loading you allow to enter your system, and the best way to do this is by pricing it by the value it delivers.
I think using multiple datastores is over kill for what you're wanting.
I did a bit of experimenting with Redis and newsfeeds and found that it doesn't scale if you're pushing the newsfeed items out to every subscribing newsfeed - I found it was more expensive to write than it was to read - So having a list of "publishers" to read from per user and getting the individual posts ordered by timestamp worked. That would mean you would do 300+ reads per request for the newsfeed but that scales much better - using read slaves - than scaling your masters for writes. Note I'm not familiar with Redis sharding, just the master->slave set up.
Also, using a relational database like MySQL or Postgres and structuring your data correctly means you wouldn't be doing a query per newsfeed read request you received - You'd do one query.
User table -> UserNewsItem table <- NewsItem
That would result in one select for reads but you would be writing to the UserNewsItem table for every user subscribed to the publisher. Again, you can flip this around and have 300 reads per request and one write per post.
Work out your access patterns and where you'd prefer the resource heavy side of a newsfeed (reads or writes) and I think you'll be in a better place to choose the correct datastore.
In my extensive experience in Redis, Mongo, large data and stretching a small budget I would recommend the following.
1. Store the ID's as keys in redis as most have suggested.
2. Use MongoDB to store the backend data using safe mode to ensure the data is written to disk. This allows you to scale horizintaly using MongoDB's easy sharding. Stay away from RDBMS' when using big data unless you have the need for transactions and have an endless pit of money to throw at scaling. An edge graph like Neo4J would also work well but I would recommend using mongo as you can use it for other things should you need to.
HTH
Nodex
What is your read/write ratio? How many writes/second? How much hardware (how many machines) are you willing to buy?
1 GB doesn't seem like a huge amount of space. Separating index from content seems a good start.
What is your feeling on Amazon vs DIY?
One common pattern to improve performance is to separate fast and slow producers (people and topics). For fast producers - use pull approach for slow - push. If you have "hot" topic, article, etc with a lot of followers - do not push all updates to all followers let them query directly.
Thanks everyone for your feedback/help. It's immensely helpful. Just a few more notes about our situation. We're a team of four people (2 product/design people and 2 engineers). Our business model in the past has been a free app and paid app model, but the paid app revenue has been going down and we're in search for a new business model. But we have a lot of free users (ie., 4-5 MM MAU) and we wanted to offer this social experience for free to them and create a marketplace or other business model on top of that. So, anyway we'd like to keep our hosting costs down to a few thousand a month (already at 1100+/month with heroku, redis, sendgrid, etc and we've released the social features to less than 5% of our users) while we're developing our business model. And we'd like it where our engineers can focus on front-facing product as much as possible. We like heroku and their add-ons cause it's simple and saves us man-hours.
Another note, I might be grossly underestimating our traffic 6 months out. Maybe we'll have 10MM unique posts per month. But who really knows.
Right now, it looks like the plan would be:
1. Store IDs as keys in Redis
2. Cache the content (perhaps in MongoDB)
For #2, I'm wondering if we could just create a separate Redis DB and store the content in that. This way we'd be querying a Redis DB in memory with (I'm assuming) quicker read times than a MongoDB on disk. Or will read times be similar? Will we be saving space if we go with MongoDB?
Any feedback on this?
I'd consider moving from Heroku . I know that this would require an upfront effort (that could be done externally instead of using your own engineers) but definitely "a few thousand dollars" and 1GB of memory on the same sentence is an alert.
Heroku is a great platform but in order to keep it,if you don't have funding,your app needs to start making some money pretty quickly and seems that this is not your case.
Hardly a technical solution will solve your problem, may it will hide the problem for some time, but according to the information you provided the REAL problem that you have is that your PaaS provider is too expansive for the current revenue of your application. Making more money is not that easy so to change the provider would be the easier way to go.
Hope this helps
Hi Dave,
We've been using mongodb in production for over a year now. If memory utilization is a concern, I recommend you look carefully at these two posts:
http://tech.yipit.com/2012/02/09/how-to-prevent-memory-bloat-in-mongo/
https://groups.google.com/forum/m/?fromgroups#!topic/mongodb-user/iGYXgKsLpcg
We're running it on a cluster of nodes with 196GB of memory. I'd be very hesitant to even consider running it on a node with 1GB of memory.
Also as far as scalability goes, you can't scale a mongo cluster one server at a time. If you're utilizing replicasets (which you will want to do) you'll need 2 or 3 servers per replicaset (preferably 3). This means every time you add a shard, you need 2 or 3 new nodes. This is a non-starter for a lot of potential usecases, and I would recommend you take this into consideration. Finally, if your budget is as tight as you imply it is, running multiple dbs is only going to make the situation worse. I would recommend concentrating one a single db that has most of the features you need, and then if you need secondary indexing to speed things up in the future, and it's in the budget, worry about it then.
Are you running your own Redis instance or using the Redis To Go add-on?
@Phil
Writes can be delayed -- there isn't much penalty to trickling out updates to subscriber feeds "slowly" over a few seconds / minutes.
But reads cannot -- you want consistent read behavior and performance.
Going the many-joins-in-a-RDBMS route without caching in front of it is a bad idea.
We've encountered these issues over at fashiolista.com
I've opensourced most of our work on this area;
https://github.com/tschellenbach/feedly
I would recommend using redis with a fallback to postgress.