Tuesday
Apr012008
How to update video views count effectively?

Hi,
I am building a video-sharing site and I'm looking for an efficient way to update video views count.
The easiest way would be to perform an SQL update to increase the "views" counter every time a video is viewed, but naturally I want to avoid DB write access as much as possible.
I am looking for an efficient temporary storage to which I could connect and say "increment views of video X". Every so often I would save the changes to my main database, and remove the counter from this temporary storage.
I am having a hard time finding such temporary storage, however. My first thought was memcache, but it's not ideal as I wouldn't like to lose the data if memcache goes down. Also, memcache's increment command requires that the key is already present - that means that every time a video is viewed, I would have to check if the key already exists in memcache, before I can actually send the increment command.
What do people use to solve this kind of issues?
Kind regards,
Tomasz
I am building a video-sharing site and I'm looking for an efficient way to update video views count.
The easiest way would be to perform an SQL update to increase the "views" counter every time a video is viewed, but naturally I want to avoid DB write access as much as possible.
I am looking for an efficient temporary storage to which I could connect and say "increment views of video X". Every so often I would save the changes to my main database, and remove the counter from this temporary storage.
I am having a hard time finding such temporary storage, however. My first thought was memcache, but it's not ideal as I wouldn't like to lose the data if memcache goes down. Also, memcache's increment command requires that the key is already present - that means that every time a video is viewed, I would have to check if the key already exists in memcache, before I can actually send the increment command.
What do people use to solve this kind of issues?
Kind regards,
Tomasz
Reader Comments (12)
I've got a need for something like this too and also thought of Memcache. Would the checking for the key really have a noticeable impact on performance?
It must be something that sites like Youtube have found a good solution for....
If you remove it from temp storage won't you always have to check? The correct count would need to be loaded from the database. Otherwise the view count would restart from 0, which wouldn't seem correct. One option be never to take out of memory. If that's not possible the check would seem to be required.
What I intend to store in cache is not the video's total view count, but the number of new views since the database was last updated. Periodically, when I want to flush changes to persistent storage, I'd do something like:
increment = cache.get_new_views_for_movie(movie_id)
UPDATE movie m SET m.views = m.views + increment WHERE m.id = movie_id
I was worried that having to check for keys would result in weird concurrency issues - one webserver would see the key's not there, so it would set it to 0, and in the mean time,another webserver would also see the key's not present and set it to 0, which would result in a lost view.
In reality, I wouldn't care even if I lost one view occassionally, but I just realised this can me made flawless by using "add" memcache command, instead of "set", and then issuing an "incr".
Regards,
Tomasz
We had a similar problem with our CMS systems.
We offer hosted CMS that runs from a single codebase but connects to a different database for each domain.
We have around 20k databases split between 4 database servers and they did not like the constant beating.
This beating was not coming from images, content, complex forums, etc.
It was coming from the logging.
The application wrote some information into the database everytime someone visited or loaded a page.
This information was written across 4 or 5 tables (one table for user-agent, country, etc, etc)
So for each page request there were around 5 writes to the db.
Now if only 1 visitor on each of these 20 k sites then we have 100K writes to the database immediatelly.
We had cache everywhere except here.
After some beating we found a good solution.
* Use the logs to retrieve the data periodically and then update the database with it*
I dont mean the webserver logs which is serving the page.
We have around 20 webserver front-ends so looking into the logs would be messy.
for example: go to http://www.ispman.net (its one of my website on this cms)
View the source code of this page.
Look for a line like window.ws_stats_url
So basically when you load the page, we try to include a javascript file from an external server
http://217.196.177.134/log?domain=ispman.net¶m1=blah¶m2=ping¶m3=pong
Try this url. Do you what it does? Nothing.
Its an nginx server which always serves one single file which has this text
// thanks
It also *logs* the request.
Now we have some perl scripts that run every 15 minutes. They
* rotate the log
* parse and store the data in the database
* etc, etc.
This was a specific solution for a specific problem.
For your video sites, i can suggest to make this counter application external.
So your page may include a call like
script src=http://someserver/counter?video_id=5
and this will return
window.movieCount=1234
The application will both increment the count and give you the count.
I suggest to use nginx and its embedded perl interpreter to write this simple application.
If you need code, let me know if should not take more than a few lines to write this.
best-regards
Atif
Thanks for the input, this certainly is an interesting approach.
However, while it certainly takes the load off the application, I am worried it will make user experience worse. I'm currently reading High Performance Websites by Steve Souders and according to this book, one of the top techniques for optimizing end user experience is to minimize the number of HTTP requests. If I decided to use your approach, I would be doing exactly the opposite, and I can achieve what I need with memcache (or memcachedb, which is also blazingly fast, at least compared to RDMBS) approach.
Comments?
Regards,
Tomasz
In both cases you are incurring a remote call so the cost is probably similar. Though I would imagine a UDP based memcached call is somewhat faster than a TCP based HTTP call. Memcached also load balances naturally whereas your service would need explicit load balancing to keep latency in check.
A product like Coherence or GigaSpaces using a Write-Behind caching strategy may work for you.
http://wiki.tangosol.com/display/COH32UG/Read-Through%2C+Write-Through%2C+Refresh-Ahead+and+Write-Behind+Caching
Tomasz,
Minimizing http requests while page is rendering is good.
One of the reasons to consider is that a browser can only x amount of connections to the same server.
My approach works (atleast for me) without reducing the page load time for 2 reasons.
1. The request goes to not the same webserver but to a different server (A new connection is used from a different pool and not from the pool that was used for your website). Similar techniques are used when building comet based applications.
2. The request is done after the page has loaded.
have a look at
http://www.ispman.net/bootstrap.js
This is something called lazy loading.
You can do somthing like this in your main document.
window.onload=loadOnloadFiles;
function loadOnloadFiles(){
windown.counterDiv='someDiv';
loadScript(http://mycountserver/blah/whatever.pl?id=123);
}
The script get the count and updates the div called someDiv with the count.
best regards.
Atif
Tomasz
What solution did you come up with in the end?
I think this thread might help too: http://highscalability.com/strategy-break-memcache-dog-pile
I'd seriously look at Redis.
It's fairly similar to memcache, but is persistent, and is pretty quick. Also it has an incr function which would do exactly what you need.
Also you could run read slaves from on master etc plus using a queue like someone suggested above and you've got a fairly nice decoupled system.
I use this code in php quite a bit
design is the same for many languages
function incrementTTL($key, $ttl, $amount=1){
if ($int = $this->memcache->increment($key, $amount)){
return $int;
} else {
if ($this->memcache->add($key, $amount, false, $ttl)){
return $amount;
} else {
return false;
}
}
}
If you get a false you app should be able to resubmit or else you are assuming the memcached is always online and never offline or restarted that is a bad idea. The add will only succeed for one writer the others will get a fail and retry. if you are serious about catching every view then fire off the fail request to a message broker to be updated later when memcached comes back online.