« Twitter by the Numbers - 460,000 New Accounts and 140 Million Tweets Per Day | Main | Google and Netflix Strategy: Use Partial Responses to Reduce Request Sizes »
Monday
Mar142011

6 Lessons from Dropbox - One Million Files Saved Every 15 minutes

Dropbox saves one million files every 15 minutes,  more tweets than even Twitterers tweet. That mind blowing statistic was revealed by Rian Hunter, a Dropbox Engineer, in his presentation How Dropbox Did It and How Python Helped at PyCon 2011.

The first part of the presentation is some Dropbox lore, origin stories and other foundational myths. We learn that Dropbox is a startup company located in San Francisco that has probably one of the most popular file synchronization and sharing tools in the world, shipping Python on the desktop and supporting millions of users and growing every day

About half way through the talk turns technical. Not a lot of info on how Dropbox handles this massive scale was dropped, but there were a number of good lessons to ponder:

  1. Use Python
    • 99.9 % of their code is in Python. Used on the server backend; desktop client, website controller logic, API backend, and analytics.
    • Can't use Python on the Android due to memory constraints.
    • Runs on a single code base using Python. Dropbox runs on Windows, Mac, Linux using tools like PyObjs, WxPython, types, py2exe, py2app, PyWin32.
    • Pros: 
      • Developers talk to each other and express ideas in Python
      • Easy to learn, easy to read, easy to write, easy for new people to pick up.
    • Cons: 
      • Don't be silly. 
      • OK, it can use too much memory and be too slow. Not a big deal on the server side, just buy bigger machines. On the client side you can't get an old Power PC user to upgrade.
      • Coding in a mixed environment of Python and C creates problems because it's hard to profile across the language boundaries like you want to do when fixing memory and CPU problems.
      • Memory fragmentation issues are reason why scripting languages may not be a good idea for long running processes.
  2. Just Work Baby
    • Shouldn't matter what file system you are on, what OS you are using, what applications you are using. The product should always just work.
    • Python helped them iterate fast through all the different error cases they experienced on the wide variety of platforms they support.
  3. Release Early
    • Code something in a day and release it. Python makes that easy.
  4. Use C for Inner Loops - Optimizing CPU is easy
    • A way to handle the too slow problem.
    • Optimize inner loops to reduce CPU time. 
    • 44% of overhead when looping in Python vs C (2.88s vs 1.61)
    • Python VM bytecode dispatches are really slow. 
    • Many tools exist for profiling CPU. 
    • CPU optimizations are usually limited to small code sections.
  5. Poll - Polling 30 Milion Clients All Over the World Doesn't Scale 
    • Created an HTTP notification structure to avoid polling the server on the client site.
  6. Custom Memory Allocator - Optimizing Memory is Hard
    • This was there biggest problem for a while. Could use huge amounts of memory and the memory would never be freed. For large sync they could use up to 1.5GB, now they rarely use more than 100MB.
    • Hard because: 
      • Few tools exist for profiling memory for Python and C
      • Memory bloat has so many causes: leaks in Python and C code; memory fragmentation; inefficient use of memory.
    • Fixing obvious memory inefficiencies didn't help. They thought there was a memory leak, but there wasn't.
    • Problem turned out to be memory fragmentation. Memory fragmentation is what happens when different sized memory blocks are continually being deleted and allocated. What happens is contiguous blocks of memory can no longer be allocated. CPython doesn't have a garbage collector, so all this memory simply wasn't able to be allocated and the heap continually grew so memory requests could be satisfied.
    • Solution was to create a custom allocator. The file meta-data object grows a lot when doing transfers, so the obvious low hanging fruit was to create a custom allocator in C using mmap.

Future Directions

  • Dropbox on toasters. File sharing on toasters will be really big.
  • They see folders as a unifying metaphor for storing, organizing, and accessing data in the cloud and on any device, anywhere, anytime. 

Related Articles 

Reader Comments (13)

They could use Perl for this project or JavaScript. Also scripting languages. Some people are having fun trying to solve the problems they have created themselves.

March 14, 2011 | Unregistered CommenterVladimir Rodionov

CPython doesn't have a garbage collector, so all this memory simply wasn't able to be allocated and the heap continually grew so memory requests could be satisfied.

This is clearly false. What CPython doesn't have is a compacting garbage collector.

March 14, 2011 | Unregistered CommenterTuure Laurinolli

Couldn't they just use Java? I mean it is mature and has memory compacting gc. Why reinvent the wheel?

March 17, 2011 | Unregistered Commentertehehe

Thank you for the bringing attention to such interesting talk.
@tehehe Can you give an example of usable Java UI? Also memory consumption by VM is not to Java favour I believe. I would be very interested to see an example of robust and usable java application similar to Dropbox.

Vladimir quite unlikely perl or JS will fit even basic requirements:
perl readability suffers, while for python there are ways to enforcing it. JS have a long way to go before it will become useful as python replacement ( libraries etc.)

March 21, 2011 | Unregistered CommenterAlex Mikhalev

--Can you give an example of usable Java UI?

ah..Eclipse

March 21, 2011 | Unregistered Commenterblue

--Can you give an example of usable Java UI?

CyberDuck for Mac OS X, very well done piece of software...

April 14, 2011 | Unregistered CommenterDemetrio

This is not the source place, but maybe a good one to comment anyhow.

Saying DropBox uses Python is not a selling point for me, as a matter of fact maybe it would deter me from using it a little.

I'm a system level programmer and look at the "performance" of things at a very low level when creating any software of significance.
First of all for something that stays resident most of the time on my machine I would want it to be very efficient.
There is just a lot of horrible bad programming out there with little regard for such things.
Apparently for a lot of programmers where such things as the over all performance and effects on users systems are the farthest consideration. As if their program was the ONLY, the most single important software running on their system, bugger all anything else.
Everything from poorly written system drivers that cause havoc, to overly bloated applications that do to their bulk make them very inefficient to use.

Regardless of the language used I see some funny irony in these claims.
First of all lets look at what the main thing DropBox will be doing.
Here it's performance will count the most is how fast and efficient it moves files from client to server, then from server back to a client again. And then how it manages notifications and synchronization management.
If it does these things pretty optimal then it will be a good product.
These are the main things it has to do well.
I would suggest using the lowest level components here. Or at least the lowest level layer the gives the best performance. These might be OS issues at the heart of it. It just takes a study of at least the main platforms to see what the main bottlenecks are and find what the best solution is (be it conventional or not).

Then comes the UI. Other then a good UI that allows the user to config and do what ever they need to accomplish with DropBox (or anything else) the performance here is not all that important.
Although people often equate performance and how great a program is by it's UI.
Here it could be direct Python, Java, via web browser (and the main script options here), what ever.
I would imagine for DropBox, the performance and overall fitness is going to depend little on the UI.

Okay, I get it/they are talking issues of portability and maintenance, etc.,
But if you go to some convention and say "Hey look at what my language can do!" making claims of a particular fitness and so on, it begs for comparison, and some search for truth.

Now the funny and ironic parts:
First of all for a con what is "Don't be silly."? Hardly a technical description. Does this mean there are obvious cons?
Then all the other things of problems like "Mixing C and Python", memory problems, and even issues with optimization.
Wouldn't it then have been just been easier to make the whole client side out of C with the exception of the UI?
With bare to the metal access a lot of these issues wouldn't exist.
Still there would be a need to optimize perhaps but it would be so much easier because you have control over everything. For instance the memory issue. In C start with "malloc" or what ever is best for the platform and make your own memory manager.
For any project that as part of it requires a lot of memory management this is more of a necessity then a whim.
With C there is little high level abstractions in the way.
You/they actually had to dig inside of the language it's self to find what these problems are too.
Is it just me that finds this humorously ironic?

And I find this very funny:
"Not a big deal on the server side, just buy bigger machines."
So we should all use the most abstracted platform, even with sloppy coding, etc., and just throw bigger hardware at it to solve the problem?
Assuming as a company they have control over the servers used everywhere then wouldn't this mostly be a set platform? Wouldn't they all be using the same server back end?
Then why not just use C (or something else low level) and use the most of the machines?

Use Python, C, Java what ever for the UI, but when it at least comes to my machine and users where something is to be resident (always running) then consider and live in the low level world!

July 7, 2011 | Unregistered CommenterSirmabus

Thanks for a great article !

I would love to see some opensource repository for that custom allocator code. It might help somebody or even make it into Python(3) :-)

Now I feel that this entire discussion is a hijacking attempt by Java-fanboys, so let me through my cents in.

I think Eclipse is a clear example of a bad UI written in Java. Honestly most if not all Java desktop application still suck.
On a daily basis I have to deal with a lot of them and almost everything is terrible. Even the best of the worst namely JDownloader is still far less then perfect.

A lot of this has to do with the applications themselves and people writing them but substantial amount has to do with Java UI frameworks having bad interoperability with various desktop environments. On Ubuntu, the most used desktop Linux in the world, Java UI applications often behave very differently to desktop events, sometimes ignoring them, have various model-dialog problems and many more issues.

Other applications have there problems too but in most Java GUI's these are predictable and consistent problems since the underlying libraries are just not doing the job properly.

Also packaging Java applications and maintaining them is just not easy... the over-engineered Java ecosystem makes trivial problems hard and hard problems just as hard as anything else.

This makes me conclude that writing a GUI application for customers in Java (not developers like eclipse or power users) is not a logical choice for anybody that want to do serious work. (I would say almost exactly the same thing for the server side, I have to deal with Java web applications as well... and IMHO these also are horrific... only good reason I ever hear for people choosing them is because they have easy access to mediocre labor force)

That is not too say it cannot be done... heck it could probably even be done to do it all in some form of GWBasic, but it makes little business sense. Using native Windows & Mac apps make much more sense, although that won't be cross platform... If you want to be cross platform the DropBox solution makes perfect sense! Otherwise something in QT or WX/C++ would also make sense.

October 15, 2011 | Unregistered Commentertrewq

"Can you give an example of usable Java UI?"

TextEdit.app on Mac OS X was originally written in Java, IIRC. I doubt most users noticed that comment in the About box in early versions of Mac OS X, or noticed when it was rewritten in Objective-C. It didn't use AWT/Swing/SWT, though, of course.

This doesn't mean that Java would have been a better solution than Python. The first 3 of these 6 lessons are advantages of Python (or similar languages).

January 27, 2012 | Unregistered Commenterpat

They could use MS .net technology for this kind of big projects. ASP.NET 4.0 with C# is very powerful IDE to develop such a big applications. Easy to understand and easy to code with WCF. for Desktop app they can use flex's Air tool.

March 1, 2013 | Unregistered CommenterNilesh

Thanks for the write up.

What is "PyObjs" ?

May 22, 2013 | Unregistered CommenterJonathan Hartley

--Can you give an example of usable Java UI?

Wuala?

July 31, 2013 | Unregistered CommenterMarco

@They could have use java... They could have use MS .NET but why python?
.... If they did then they are not dropbox. :D

July 31, 2013 | Unregistered CommenterBorogonstogodonstoy

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>