« Product: ChironFS | Main | Strategy: Diagonal Scaling - Don't Forget to Scale Out AND Up »
Monday
Nov052007

Quick question about efficiently implementing Facebook 'news feed' like functionality

Im sure most are familiar with Facebooks 'news feed'. If not, the 'news feed' basically lists recent activity of all of your friends.

I dont see how you can get this information efficiently from a DB:

* Im assuming all user activity is inserted in a "actions" table.
* first get a list of all your friends
* then query the actions table to return recent activity where the activity belongs to someone on your friends list

This can't be efficient especially considering some people have 200+ friends.

So what am I missing? How do you think Facebook is implementing their "news feed".
Im not asking for any specific details, just a general point in the right direction, as I cant see how they are implementing the 'news feed efficiently.
Thanks.

Reader Comments (11)

I think you are right, creating a feed via this type of database query wouldn't scale at all. It reminds me of a publish-subscribe system. Every user subscribes to events from their friends, who are event publishers. Publishers have a list of consumers and each consumer has a single event channel. As a user produces an event it is simply pushed on to each subscribed consumer's event channel. So to produce a news feed for a user all you have to query is the user's event channel. No joins or anything excessive.

The http://highscalability.com/googletalk-architecture">Googletalk Architecture profile may also have some useful advice.

December 31, 1999 | Unregistered CommenterTodd Hoff

Thanks for that Todd, I was thinking about that type solution but I got hung up on the number of inserts.

OK, assume we're using the publisher-subscribe system. Doesnt that mean if I have 200 friends and I create an event, that I will then have to do 200 inserts as my event is pushed to each of my individual friends events channels?

Even with all of those inserts would it be more scalable than the setup in my first post, as the setup in my first post requires only 1 insert for every event, regardless of the number of friends you have.

December 31, 1999 | Unregistered CommenterAnonymous

I would just keep the event channel in cache so the database isn't hit. In the unlikely event the cache goes down, then oh well. There's nothing life or death in a facebook event channel, so it's not a big deal and it's unlikely. If the channel is important you can keep a replicated cache. For the reason you mentioned though I would keep it away from the database, unless the data was really important. In that case you'll probably need to shard. Or maybe use something like Amazon's queue service? It might also be reasonable to put back pressure on the event flow so you don't get overwhelmed.

December 31, 1999 | Unregistered CommenterTodd Hoff

Todd thanks, but I think your last post is a bit over my head.

In your last post are you still talking about the publisher-subscribe model?
If so I don't understand what you mean by I would just keep the event channel in cache so the database isn't hit. . When we're selecting the data the hit to the DB is minimal using the publisher-subscribe system. We run into problems when we're inserting correct?
So are you saying insert directly into the cache, is that possible?

You are right about the event channel not being critical.

December 31, 1999 | Unregistered CommenterAnonymous

> I would just keep the event channel in cache

That just means keep it in memory somewhere. You don't have to put it in the database at all. By keeping it in memory you remove all the inserts.

December 31, 1999 | Unregistered CommenterTodd Hoff

ok yeah, I see what you mean.
Ofcourse you run into other problems going the cache route but for this problem I think its the best implementation. Im curious to see what data structure is best to use when caching, I'll have to do some more research.
Thanks for the help, I was considering the publisher-subscribe model before but I never considered keeping the event channel in cache. Thanks!

December 31, 1999 | Unregistered CommenterAnonymous

First, You need this information persistent. What if the user access it after a week.
Second, this is too much information to keep in memory even for one day. There are billions of user actions.
Third, it is not efficient to publish the information to all of your friend, doing it you publish the same information multiple time (The lookup, however, is faster).

December 31, 1999 | Unregistered Commentermesir

I had a similar problem, where join was not option, we tried publisher / subscriber method.

I had to many inserts, and we were not table to process all at a time.

I created a queue where I did a basic insert, Publisher A -> Action.

And I had a parallel process ( a few threads ) that read from this queue and notified all the subscribers, so I was able to keep my resources in order. Publishers were not able to collapse my system.

So it was "quick" for a publisher to notify its action, I was able to notify my subscribers without hammering my servers, and getting new actions was done without any join. But this was done at the expense of a bit delay in notifying the publishers actions.

In our case this was viable, as most of the time we were able to keep up to date with the list, and a little delay was not a real problem for us.

December 31, 1999 | Unregistered CommenterAlberto

^^^^
great information, thanks Alberto

December 31, 1999 | Unregistered CommenterAnonymous

i would do it in 3 main components:
- friend post
- waiting list
- request/retrieve feed

when you post something new, it will signal all your friends a pointer of your post, which will be stored in each one's waiting list;
when you load the home page, you made a request of retrieving the feed, it will start processing the priority based on the criteria (the algorithm), give priority to close friends and stuff like that... then an array of pointers to posts will be sent to your feed, then comes the parser, then comes the generate front-end design...

though, that's how i would do it ;)

October 3, 2012 | Unregistered CommenterYousef Halabi

Lots of years later, and this is still useful. Thanks for the suggestions, guys.
I have a special newsfeed that requires two users only to see any update, and I have less than 200 users. That means that "publisher / subscriber" will work perfectly even when inserting a record for each subscribed user.

I know my issue is not a scaling one, but the server already has lots of pressure and I wanted to make it as efficient as possible.

October 5, 2017 | Unregistered CommenterIsmaeel

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>