Ask HighScalability: How Do I Build My MegaUpload + Itunes + YouTube Startup?

This question was sent in by Val, who asking for a little help in creating the next big thing. Any ideas?
I'm planning to run my own, first startup website and have been surfing the webs for relevant info to plan the technology I will use for it (the frontend and the backend, including the software and the hardware). The website will be something like a combination of:
- MegaUpload (users will upload their files)
- iTunes (users will be paid for their uploads)
- and YouTube (in the future I'm planning to let users watch/listen to the content online, without downloading).
I don't have any investors yet, nor the budget - I'm still preparing the idea and I'm going to create first implementation (an "alpha version") before I show it to potential investors. Hence the initial technologies have to be extremely cheap *but* also highly scalable in the future so that I don't have to redo anything when the website grows.
Unfortunately I don't have much experience in running big wesites but, on the other hand, I hope my website to grow extremely big (of course).
My questions are:
1. What programming languages should I use? I don't know JavaScript/AJAX or PHP yet. I know C# and I found mojoPortal to be an interesting solution - based on ASP.NET but can run on Linux through Mono. I'm also willing to learn AJAX/PHP if it is necessary.
2. Should I use an existing framework/CMS (that I would only modify) to avoid reinventing the wheel of all the algorithms for user registration, logins, file uploaders etc? Won't this framework/CMS limit me in the future with it's performance, limited functionalities etc?
3. How should I plan the database? I read MySQL is probably the best for this kind of projects but how should I save the data to ensure data safety and scalability? Generate a new database for each user and then make a copy (backup) on a different physical machine? Create one big database for all users and files (won't that be too slow and unseparatable between machines)?
4. The server. I think Linux will be best for the beginning, but what for the future? If I base my software on C#/ASP.NET, should I go for Windows Server in the future?
Advice from anyone experienced in this kind of web development is very much appreciated.
Reader Comments (17)
I have been throwing around the idea of storing and streaming files in cassandra.
Cassandra happens to be very good at distributed storing of key value data, like files. A file has a name and content, or filename = key and content = value.
As you scale you will need to have more data centers in order to serve content that is closer to the user that is receiving the content. Cassandra is scaleable to several data centers and beats systems like Mysql on master / slave set ups because your users can insert data to a data center that is closest to them and not a master mysql server that is thousands of miles away.
That does not mean you not use both, mysql for storing information about the file, like keywords, and have cassandra take care of the distribution of content.
I think you're trying to do too much out of the gate. Why build something to support the traffic of YouTube if you have zero visitors? The smartest thing you could do would be to pick up a copy of "The Lean Startup" and start thinking about a minimal viable product you could build. For example, is there a web-service out there that accepts video uploads that you could whitelabel? Cheat, hack, and kludge your alpha together. If the investors like it, you'll have plenty of time to deal with scaling issues once you start getting real customers.
Being a technology guy I would tell this person to stop worrying about the technology. Be a problem solver and produce an alpha product that solves your stated problem as smoothly and elegantly as possible. When and if that portion is a hit with the big idea VCs or consumers directly then you hire people who have done this and you hire them quickly.
And as always... Start cloud, start small, don't over engineer until you need to. Trick is knowing when you need to so I think the better question to ask is how to watch an application to monitor usage, growth, and identify bottlenecks.
I generally agree with the other commentors. You can totally prevent your project from ever being completed by always worrying about the problems you're going to have years from now. Facebook did not start with an architecture that was ready to support all of the users of Facebook; you don't have to do that, either. The best thing I can recommend is to set small, achievable goals that will build upon each other. Be aggressive in your timeline if you like, but be prepared to learn and adjust to how well you actually accomplish them. With each iteration, take the time to review what you learned. Do a little experimentation as you go -- write some proof-of-concept code and then benchmark it.
It doesn't matter as much right now whether you're using C# or Java or Python, or MySql or Cassandra, or Windows or Linux; what matters is that you get started and learn as much as you can. All the problems that you're anticipating will still be there as you get closer to them, but you'll know much more about how you can and want to address them than you could ever know right now, when you are still coding in WhiteBoardLanguage.
The greatest engineered invention on the planet is worthless if nobody wants to use it. This is why there is no pedestrian bridge to Hawaii. Focus FIRST on getting people who want to use your product, THEN worry about making it a great feat of engineering.
Don't worry things too much. Just build it the way you know how. Test constantly. Monitor everything. Identify weak spots, bottlenecks etc. Just simple MySQL can go a long way (replication, sharding etc) with some simple tweaks before you need anything else. Analyse your application data, and use this to research alternative technologies, many of which can be found on this blog. Choose solutions that best fit your problem, skill level and business constraints.
In short: you can spend forever thinking about the best architecture, but your idea will never get built.
And why not concern mobile platforms? And create product that will use millions of people on their phones.
I had a Facebook-size idea back in July 2009. I researched for days on how to do it. I spent days just brainstorming names! In the end I decided that to run a giant website I need to know the ins and outs of how it works so I'll build it myself. Here's my answers:
1. Erase mojoportal from your memory right now. If you're building it yourself, you need to know the back end server language (php/asp/ruby/python/etc) the front end markups (html css) and the front end functionality (javascript jquery) and the database (sql). I went with PHP and MySQL for 2 reasons: Most popular and most commonly used for big startups in the early stages. This means I would have the most learning support, and the confidence that it's capable of doing what I want to do. Now I know that all languages are capable of doing the same job and had I gone back now, I'd go with Python and Java as the server language because I think they MAY dominate the industry in 5 years, but don't quote me on that.
2. To use a CMS it has to be a CMS designed for the purpose you want to use it for, otherwise you'll end up redesigning the whole thing just to make it work, and in the process you'll have to learn the language anyway and it'll be incredibly difficult. Using a framework is perfect if you know the base language already, otherwise you'll have no idea how to customize it. Pro programmers always say you should know how every line of code works so you can troubleshoot and optimize it. The purpose of the framework is to let you use pre written scripts without needing to code it all from scratch, and there's a hundred frameworks out there doing things differently from one another. It is easier to learn the server script from scratch THEN benefit from frameworks, rather than learn the framework then try and go backwards to learn the script. I went back to learn the fundamentals of PHP after spending years copy pasting other people's work to make my site work. I'm learning things now that I absolutely regret not studying 3 years ago!!!! God, the amount of time and frustration I could have saved by starting at the beginning of the PHP.net documentation and going through it back then. Once you've got the basics of PHP down patt, definitely look at how frameworks work. Search frameworks on google and read a page or two on every one you find to find the one that suits your way of thinking. All frameworks can be modified to not include all the stuff they come with so don't worry about limited performance now. Interesting point: The creator of PHP recommends building your own framework instead of using someone elses. http://toys.lerdorf.com/archives/38-The-no-framework-PHP-MVC-framework.html#extended
3. Right now, do whatever will get your prototype running ASAP. Don't waste any more time on it than necessary. Just get a page and brainstorm every table you'll need and build it quickly. You can always add things later. Don't worry about scalability until you're actually getting high traffic and can afford to hire people to make a new database quickly. In fact, you could change the database quickly yourself. It's just case of creating a new one with better structure then pasting it over. Your database, file heirarchy and website structure will change AS you build the app. You'll think of better ways to do it. Even following the MVC structure people still do it differently inside the MVC folders. There's no set rules, just people's individual preferences. I spent days researching and designing my database structure only to revert back to the most simplest method possible because by the time it actually requires optimizing, I'd have income to employ professionals to do it for me.
4. Similar answer to which programming language. They all work. Focus on the Minimum Viable Product to make sure it works before you even worry about these things.
___
So if you wanna go with PHP, go straight to tizag or w3school and go through their guides, then go straight to php.net and start at the beginning and read through it. Most of the reader comments wont make sense so skip through them for now and go back once you've finished the site. Then look through open source CMSes and Frameworks and learn their file/folder structure and coding styles to see the standard practices, because, like me, you'll end up with a giant tub of procedural spaghetti code and end up rewriting the whole thing when you get overwhelmed and can't figure out what you've done.
I've started from scratch about 3 times and rewritten many pages about 5 times, making them more efficient and better structured each time and realising that I should have been patient and learnt PHP from the ground up rather than the top down.
Should take you about 50 hours of reading to be at a point where you know your capabilities and can search any technique to build whatever you want.
Another piece of advice I've learnt the hard way. Forget about looks until you have functionality. Looks distract you BIG TIME! You're in and out of photoshop every week trying to make it look just right and spending hours changing css. In the end, once you've finished the functionality, you'll decide to change the look of the site anyway. Guaranteed! So only worry about looks if you're in the initial planning phase and creating a mockup to see it with your own eyes and know if you'll want to use it once it's done.. Then archive that image file, forget about it, and put coloured shaded grey boxes everywhere as your layout. Make sure text and buttons are in rough position and focus entirely on functionality. lol, trust me on this one. You'll waste SO MUCH TIME on individual pixels. Don't get me wrong, once the project is launched, looks can be everything for a certain type of website. I ran a funny picture site called icanhasmotivation and every time I redesigned the appearance to make it look nicer and easier to use, traffic doubled! But the initial customers came regardless of what the site looked like and they're the ones you need in order to know if it's going to work.
Hmmm, another piece of advice. Watch The Social Network and watch the scenes where he's actually building facebook at the beginning. That's what it's all about. Plans and blueprints on paper all over the house, focusing and working hard while everyone else is partying, neglecting your social life in order to build something useful.
I'm a noob compared to most of the people on this site, but if you're a bigger noob than me, you can definitely learn from me to start with. Good luck!
Building your website to scale to multiple machines takes time and adds complexity. These slow you down from finding out if real people care enough to use the site and keep using, which is what you need to know to continue investing in it.
Don't worry about it. Programing language, server OS, framework/no framework - they don't matter to the end users. Use whatever you work best in.
As for database design, stick to one database. If your site takes off, you'll be able to scale for quite a while just by moving it to a bigger database server. Later, if you need it, you can split things off.
Keep the video files out of the database. Perhaps you could store them on Amazon S3 + somewhere else for backups.
This guy has no experience/skills, no funding and no original idea. Surely the only sane response is "don't quit your day job!"
As far as showing off an idea, use the easiest tools possible and view this effort as throwaway. There is no point in trying to avoid reinventing the wheel because you will have to do it multiple times as you scale up. Circumstances will change, different priorities will emerge so there is no way to predict what you will require. The best you can do is keep it simple and avoid tangents.
I would recommend a light-weigh, yet robust web framework for quick development of the alpha version. Something like Django or RoR and host it on Heroku, which makes it really easy to deploy and lets you focus on your actual development. Dont worry on scalability until you actually go live and strong. Technicalities should follow ideas, not the other way around.
Such an application/service with your own branding is already available, http://www.rawvoice.com/services/generator/
Started out in 2005 as a community network (blubrry.com) that aggregated podcast feeds into one location (Podcast Directory). The platform has since grown to include publishing tools to distribute media to both the web and to mobile/TV devices. Some examples of the system in use include www.techpodcasts.com and www.promednetwork.com.
Infrastructure behind the network and publishing platform is built to run both on physical dedicated servers and on Cloud based servers to grow based on both publishing and network web traffic needs.
Thx, guys for the advice! I didn't even expect so much information! Glad, I found this blog so quickly.
I've already started reading The Lean Startup and have installed Webbo+EasyPHP to start learning PHP. Will come back and report when anything important happens. Thx again!
Vaadin + PostgreSQL
One more question - is there a blog or a forum about startups where I could ask my noob questions? I've almost finished reading The Lean Startup but there's a few things I didn't understand and need someone's help clarifying it.
I found Google+ Lean Startup Circle but I'm not sure if it's the right place to post my questions.
Quora.com and onstartups.com for startup questions.
Stackoverflow.com for coding questions.
Dont worry too much about scalability.
Just use a layered approach, separate application from the backend and tackle things one at a time.
Best things to do in the beginning is to monitor your website traffic and use tools like blitz.io to run some estimates on latency.
Some key points are:
For a loadbalancer use keepalived to distribute traffic to your webservers/caches. I use it in production and its amazing.
Use varnish to cache your content.
Separate sql reads/writes on the application level. This will hunt you later when you will need to scale your database reads/writes and put it behind load balancers.
Use a key-value store such as memcache to cache queries
The only way to learn is to by start writing code. Youtube never knew that they had to scale that much anyway so start working on your idea first and figure out scalability later.
I get these questions every week from some "innovator" without the experience nor money that thinks he will be the next facebook, next megaupload, next youtube, next twitter etc.
I just ignore them, please dont post these on highscability anymore, its insulting.