Monday
Nov052007
Quick question about efficiently implementing Facebook 'news feed' like functionality
Monday, November 5, 2007 at 5:09PM
Im sure most are familiar with Facebooks 'news feed'. If not, the 'news feed' basically lists recent activity of all of your friends.
I dont see how you can get this information efficiently from a DB:
* Im assuming all user activity is inserted in a "actions" table.
* first get a list of all your friends
* then query the actions table to return recent activity where the activity belongs to someone on your friends list
This can't be efficient especially considering some people have 200+ friends.
So what am I missing? How do you think Facebook is implementing their "news feed".
Im not asking for any specific details, just a general point in the right direction, as I cant see how they are implementing the 'news feed efficiently.
Thanks.
I dont see how you can get this information efficiently from a DB:
* Im assuming all user activity is inserted in a "actions" table.
* first get a list of all your friends
* then query the actions table to return recent activity where the activity belongs to someone on your friends list
This can't be efficient especially considering some people have 200+ friends.
So what am I missing? How do you think Facebook is implementing their "news feed".
Im not asking for any specific details, just a general point in the right direction, as I cant see how they are implementing the 'news feed efficiently.
Thanks.
Reader Comments (11)
I think you are right, creating a feed via this type of database query wouldn't scale at all. It reminds me of a publish-subscribe system. Every user subscribes to events from their friends, who are event publishers. Publishers have a list of consumers and each consumer has a single event channel. As a user produces an event it is simply pushed on to each subscribed consumer's event channel. So to produce a news feed for a user all you have to query is the user's event channel. No joins or anything excessive.
The http://highscalability.com/googletalk-architecture">Googletalk Architecture profile may also have some useful advice.
Thanks for that Todd, I was thinking about that type solution but I got hung up on the number of inserts.
OK, assume we're using the publisher-subscribe system. Doesnt that mean if I have 200 friends and I create an event, that I will then have to do 200 inserts as my event is pushed to each of my individual friends events channels?
Even with all of those inserts would it be more scalable than the setup in my first post, as the setup in my first post requires only 1 insert for every event, regardless of the number of friends you have.
I would just keep the event channel in cache so the database isn't hit. In the unlikely event the cache goes down, then oh well. There's nothing life or death in a facebook event channel, so it's not a big deal and it's unlikely. If the channel is important you can keep a replicated cache. For the reason you mentioned though I would keep it away from the database, unless the data was really important. In that case you'll probably need to shard. Or maybe use something like Amazon's queue service? It might also be reasonable to put back pressure on the event flow so you don't get overwhelmed.
Todd thanks, but I think your last post is a bit over my head.
In your last post are you still talking about the publisher-subscribe model?
If so I don't understand what you mean by I would just keep the event channel in cache so the database isn't hit. . When we're selecting the data the hit to the DB is minimal using the publisher-subscribe system. We run into problems when we're inserting correct?
So are you saying insert directly into the cache, is that possible?
You are right about the event channel not being critical.
> I would just keep the event channel in cache
That just means keep it in memory somewhere. You don't have to put it in the database at all. By keeping it in memory you remove all the inserts.
ok yeah, I see what you mean.
Ofcourse you run into other problems going the cache route but for this problem I think its the best implementation. Im curious to see what data structure is best to use when caching, I'll have to do some more research.
Thanks for the help, I was considering the publisher-subscribe model before but I never considered keeping the event channel in cache. Thanks!
First, You need this information persistent. What if the user access it after a week.
Second, this is too much information to keep in memory even for one day. There are billions of user actions.
Third, it is not efficient to publish the information to all of your friend, doing it you publish the same information multiple time (The lookup, however, is faster).
I had a similar problem, where join was not option, we tried publisher / subscriber method.
I had to many inserts, and we were not table to process all at a time.
I created a queue where I did a basic insert, Publisher A -> Action.
And I had a parallel process ( a few threads ) that read from this queue and notified all the subscribers, so I was able to keep my resources in order. Publishers were not able to collapse my system.
So it was "quick" for a publisher to notify its action, I was able to notify my subscribers without hammering my servers, and getting new actions was done without any join. But this was done at the expense of a bit delay in notifying the publishers actions.
In our case this was viable, as most of the time we were able to keep up to date with the list, and a little delay was not a real problem for us.
^^^^
great information, thanks Alberto
i would do it in 3 main components:
- friend post
- waiting list
- request/retrieve feed
when you post something new, it will signal all your friends a pointer of your post, which will be stored in each one's waiting list;
when you load the home page, you made a request of retrieving the feed, it will start processing the priority based on the criteria (the algorithm), give priority to close friends and stuff like that... then an array of pointers to posts will be sent to your feed, then comes the parser, then comes the generate front-end design...
though, that's how i would do it ;)
Lots of years later, and this is still useful. Thanks for the suggestions, guys.
I have a special newsfeed that requires two users only to see any update, and I have less than 200 users. That means that "publisher / subscriber" will work perfectly even when inserting a record for each subscribed user.
I know my issue is not a scaling one, but the server already has lots of pressure and I wanted to make it as efficient as possible.