Entries by HighScalability Team (1576)

Thursday
Oct082009

Riak - web-shaped data storage system

Update: Short presentation NYC by Bryan Fink  demonstrating the riak web-shaped data storage engine

Riak is another new and interesting key-value store entrant. Some of the features it offers are:

  • Document-oriented
  • Scalable, decentralized key-value store
  • Standard getput, and delete operations. 
  • Distributed, fault-tolerant storage solution.
  • Configurable levels of consistency, availability, and partition tolerance
  • Support for Erlang, Ruby, PHP, Javascript, Java, Python, HTTP
  •  open source and NoSQL
  • Pluggable backends
  • Eventing system
  • Monitoring
  • Inter-cluster replication
  • Links between records that can be traversed.
  • Map/Reduce. Functions are executed on the data node. One interesting difference is that a list keys are required to specify which values are operated on as apposed to running calculations on all values. 

Related Articles

  • Hacker News Thread. More juicy details on how Riak compares to Cassandra, mongodb, couchdb, etc. 

 

Tuesday
Oct062009

10 Ways to Take your Site from One to One Million Users by Kevin Rose  

At the Future of Web Apps conference Kevin Rose (Digg, Pownce, Wefollow) gave a cool presentation on the top 10 down and dirty ways you can grow your web app. He took the questions he's most often asked and turned it into a very informative talk.

This isn't the typical kind of scalability we cover on this site. There aren't any infrastructure and operations tips. But the reason we care about scalability is to support users and Kevin has a lot of good techniques to help your user base bloom.

Here's a summary of the 10 ways to grow your consumer web application:

1. Ego. Ask does this feature increase the users self-worth or stroke the ego? What emotional and visible awards will a user receive for contributing to your site? Are they gaining reputation, badges, show case what they've done in the community? Sites that have done it well:

Twitter.com followers. Followers turns every single celebrity as spokesperson for your service. Celebrities continually pimp your service in the hopes of getting more followers. It's an amazing self-reinforcing traffic generator. Why do followers work? Twitter communication is one way. It's simple. Followers don't have to be approved and there aren't complicated permission schemes about who can see what. It means something for people to increase their follower account. It becomes a contest to see who can have more. So even spam followers are valuable to users as it helps them win the game.

Digg.com leader boards. Leader Boards show the score for a user activity. In digg it was based on the number of articles submitted. Encourage people to have a competition and do work inside the digg ecosystem. Everyone wants to see their name in lights. 

Digg.com highlight users. Users who submitted stories where rewarded by having their name in a larger font and a friending icon put beside their story submission. Users liked this.

2. Simplicity. Simplicity is the key. A lot of people overbuild features. Don't over build features. Release something and see what users are going to do. Pick 2-3 on your site and do them extremely well. Focus on those 2-3 things. Always ask if there's anything you take out from a feature. Make it lighter and cleaner and easy to understand and use.

3. Build and Release. Stop thinking you understand your users. You think users will love this or that and you'll probably be wrong. So don't spend 6 months building features users may not love or will only use 20% of. Learn from what users actually do on your site. Avoid analysis paralysis, especially as you get larger. Decide, build, release, get feedback, iterate.

4. Hack the Press. There are techniques you can use that will get you more publicity.

Invite only system. Get press by creating an invite only system. Have a limited number of invites and seed them with bloggers.  Get the buzz going. Give each user a limited number of invites (4 or 5). It gets bloggers talking about your service. The main stream press calls and you say you are not ready. This amps the hype cycle. Make new features login-only, accessible only if you log in but make them visible and marked beta on the site. This increases the number of registered users.

Talk to junior bloggers. On Tech Crunch, for example, find the most junior blogger and pitch them. It's more likely you'll get covered.

Attend parties for events you can't afford.  You can go to the after parties for events you can't afford. Figure out who you want to talk to. Follow their twitter accounts and see where they are going. 

Have a demo in-hand. People won't understand your great vision without a demo. Bring an iPhone or laptop to show case the demo. Keep the demo short, 30-60 seconds. Say: Hey, I just need 30 seconds of your time, it's really cool, and here's why I think you'll like it. Slant it towards what they do or why they cover.

5. Connect with your community.

Start a podcast. A big driver in the early days of Digg. Influencers will listen and they are the heart of your ecosystem. 

Throw a launch party and yearly and quarterly events. Personally invite influencers and their friends. Just have a party at a bar. Throw them around conferences as people are already there. 

Engage and interact with your community.

Don't visually punish users. Often users don't understand bad behaviour yet as they think they are just playing they game your system sets up. Walk through the positive behaviours you want to reinforce on the site.

6. Advisors. Have a strong group of advisors. Think about which technical, marketing and other problems you'll have and seek out people to help you. Give them stock compensation. A strong advisory team helps with VCs.

7. Leverage your user base to spread the world. 

FarmVille. tells users when other players have helped them and asks the player to repay the favor. This gets players back into the system by using a social obligation hack. They also require having a certain number of friends before you expand your farm. They give away rare prizes.

Wefollow. Tweets hashtags when people follow someone else. This further publicizes the system. They also ask when a new user hits the system if they wanted to be added to the directory, telling the user that X hundred thousand of your closest friends have already added themselves. This is the number one way they get new users.

8. Provide value for third party sites. Wallstreet Journal, for example, puts FriendFeed, Twitter, etc links on every page because they think it adds value to their site. Is there some way you can provide value like that?

9. Analyze your traffic. Install Google analytics, See where people are entering from. Where they are going. Where they are exiting from and how you can improve those pages.

10. The entire picture. Step back and look at the entire picture. Look at users who are creating quality content. Quality content drives more traffic to your site. Traffic going out of your site encourages other sites to add buttons to your site which encourages more users and more traffic into your site. It's a circle of life. Look at how your whole eco system is doing.

Related Articles

 

Friday
Oct022009

HighScalability has Moved to Squarespace.com! 

You may have noticed something is a little a different when visiting HighScalability today: We've Moved! HighScalability.com has switched hosting services to Squarespace.com. House warming gifts are completely unnecessary. Thanks for the thought though.

It's been a long long long process. Importing a largish Drupal site to Wordpress and then into Squarespace is a bit like dental work without the happy juice, but the results are worth it. While the site is missing a few features I think it looks nicer, feels faster, and I'm betting it will be more scalable and more reliable. All good things.
I'll explain more about the move later in this post, but there's some admistrivia that needs to be handled to make the move complete:

  • If you have a user account and have posted on HighScalability before then you have a user account, but since I don't know your passwords I had to make new passwords up for you. So please contact me and I'll give you your password so you can login and change it. Then you can post again. Sorry for this hassle, but for posts to be assigned to authors on import user accounts had to exist so I had to create them. Another issue is that login names in Squarespace are less flexible than under Drupal. The only allowable special character is the '-'. So if your login name contains a space or '_' or a '.' I changed those characters to a '-'.
  • If you have a user account and have never posted on HighScalability before you'll have to register in order to recreate your user account. Sorry, but with so many users I couldn't recreate all the user accounts by hand.
  • If you could switch RSS over to http://feeds.feedburner.com/HighScalability that would help a lot. The old RSS will still work.
  • A lot of links were broken during the move due to the imperfection of the export/import process. Some of the formatting looks a little strange now too. It's going to take me a while to fix all these problems. If there's anything you see that needs fixing please shoot me an email.
  • There's no tag cloud anymore, but there's an All Posts page that lists every post by category, by week, and by month.

This isn't pleasant but there was no way I could make the procees transparent. I appreciate your help and understanding.

Why was the move made?

I've played with and considered virtually every CMS available. I went with Squarespace based on weighing a few of my own personal goals and pain points:

Eating my own dog food. I've been a big advocate of cloud based memory grids. Since Squarespace uses a memory grid architecture I felt it would be a good experience to make use of their service (if I could make it work).

End-to-end management. I don't want to have to worry about my site. Ever. I want it to be managed end-to-end by the hosting service. In industry when they say they offer a managed service they usually mean the hardware/network/software stack are managed, you are still responsible for site uptime. The problem is a Drupal + LAMP + VPS stack isn't a hands off affair. Things go wrong and you have to be always on call. That's fine if you have a few people working a site, they can take turns handling the load. But if your are alone or on vacation, it doesn't work. You are always in the back of your mind worrying that something might be going wrong. By leaving the management of the entire stack to the hosting service then this worry largely goes away, assuming the host is good at their job.

Performance, scalability, reliability. I want the system to feel fast, to handle a lot of users, and to be reliable. For my purposes I don't really expect to have more users than I do now so I'm not looking for infinite headroom. But for the traffic I do have there should be no problems.

Price. A managed VPS with any sort of capability is expensive for a site that doesn't generate a lot of revenue yet gets too many users for shared hosting. A price point between shared hosting and a managed VPS would be very attractive. Some of the end-to-end managed services are enterprise plays and are way too expensive for the little guy.

Support. You are always at the mercy of your host, even with a cloud or colo. Good support you can count on makes all the difference when you are trying to get a site up and running and when disaster hits. Some service providers promise to get back to you within 8 hours. This is the Internet, 8 hours might as well be forever. No thanks. 

So far my experience with Squarespace has been very possitive accross all my criteria. 

They manage the site completely so my end-to-end management requirement is satisfied. A site is managed through a truly innovative browser based GUI that makes template customization and other operations quite straightforward. It will also tell you cool things like how many RSS readers you have and which posts are getting the most traffic.

I am impressed with how robust the system feels and how fast it is even doing large operations. I never feel like I'm going to break it or corrupt it and I'm almost never waiting on it to finish operations. Things just work. There's a lot of quality thought and work that's been put into the system and it shows.

Will it scale? Obviously I haven't tested that out yet, but it seems to handle largers sites so I'm fairly confident.

The price is quite reasonable, but I feel it's enough that they can make money without having to cut corners. It's a good value.

Support is excellent. Questions are answered within a short period of time and they are generally helpful. And I've asked some really stupid questions. When I couldn't set the date using a calendar widget they hardly even laughed. What they did do is make a screencast showing me what I needed to do and I was back in business. 

Or course nothing is perfect. Those imperfections will show up in a lack of a few features and some of what needs to happen to make the transition to the new system complete.

It's clear they've put a lot of work in their back-end and front-end. What is missing are the wide numbers of modules you'll find for products like Drupal, Joomla, and Wordpress. Squarespace offers a small set of widgets, which are good, but the widget set is small and isn't as configurable as for other products. Part of the problem is that Squarespace doesn't offer an API for their system so third parties can't make widgets. So simple widgets like avatars, tag clouds, today's popular posts, the most popular posts of all time, recent forum posts, read counts, and logged in users are not available. 

Other problems are in the process of moving an existing site into Squarespace.

Drupal is not one of Squarespace's supported import platforms. Drats! So I had to write scripts to export Drupal to Wordpress in such a way that as much of the meta data as possible was available in Squarespace. This was not easy to do. Squarespace does not have their own defined import format, which would have made life a lot easier.

Some other problems is there's no way to bulk operations to import users and map URLs. Each user has to be created by hand. If you don't have the users already created when posts are imported then the posts won't be assigned to the correct author. 

One of the most important things to do when moving a site is preserve your old URLs. Every service sucks at this. Squarespace does have a way to map URLs, but again there's no way to bulk import the mappings. You have to do them one by one through their GUI. It's an enourmous pain. But it was doable, so that's something at least.

These issues weren't serious enough for me not to go with Squarespace, but a site looking to build a real community may have to look a little closer.

So that's the story. There's a lot of work yet to fix broken links and formatting, but I hope that won't take too long.

Please let me know what you think.

thanks

Todd Hoff

Thursday
Apr032008

Development of highly scalable web site

Not sure if this is the right place to post this but here goes anyway. We are looking to hire an outside firm to help with development of a scalable and potentially high-traffic web site. We are not looking for an individual but rather a firm with enough well rounded expertise to help us with various aspects of this. Basic requirements: LAMP stack or other open source solution Very proficient in cross-browser web development Flex/AIR development for RIA Java/C/C++ proficiency Expertise with Comet and push server technology Experience with development of high-traffic web sites Use of Amazon Web Services infrastructure a plus If anyone knows of consulting firms that can take on such a project, I would appreciate your feedback. TIA

Click to read more ...

Tuesday
Apr012008

How to update video views count effectively?

Hi, I am building a video-sharing site and I'm looking for an efficient way to update video views count. The easiest way would be to perform an SQL update to increase the "views" counter every time a video is viewed, but naturally I want to avoid DB write access as much as possible. I am looking for an efficient temporary storage to which I could connect and say "increment views of video X". Every so often I would save the changes to my main database, and remove the counter from this temporary storage. I am having a hard time finding such temporary storage, however. My first thought was memcache, but it's not ideal as I wouldn't like to lose the data if memcache goes down. Also, memcache's increment command requires that the key is already present - that means that every time a video is viewed, I would have to check if the key already exists in memcache, before I can actually send the increment command. What do people use to solve this kind of issues? Kind regards, Tomasz

Click to read more ...

Friday
Mar282008

How to Get DNS Names of a Web Server

For some special reason, I'm trying to make a web server able to get all the DNS names mapped to its IP. Let me explain more, I'm creating a website that will run in a web farm, every web server in the farm will have some subdomains mapped to its ip, what I want is that whenever my application starts on a web server is to be able to get all the subdomains mapped/assigned to that server, e.g. sub1.mydomain.com, sub2.mydomain.com. I understand that I have to use reverse dns lookup (i.e. give the IP get the domain name), but I also want to get all the subdomains not just the first one that maps to that IP. I've been reading about DNS on the internet but I don't seem to find any information on how to achieve what I want, normally you use dns to get the ip of a domain but I'm not sure that all servers enable reverse lookup. The problem is that I'm still not sure whether I'll host my own DNS server or use the services of some company (many companies offer DNS hosting services), so, my question is: - If I host my own DNS server, will it be possible to get all the subdomains using reverse lookup? Another question here, if I enable reverse lookup on my DNS server, can this have any negative side effects? As to security .. etc .. is there any way I can enable only my web servers to do reverse lookup while preventing anybody else on the internet from using reverse lookup? - If use the DNS hosting services of some company, will I be able to do what I want? ie. get the subdomains mapped to the IP address of a web server? Unfortunately I don't have much experience with working with web farms, so I would like also to ask whether every web server in the web farm gets its own static IP or how does it work? I mean you have the firewall ... etc .. so I don't know how IP assignments works in a web farm scenario .. Thanks a million in advance and sorry for my really long post .. Wal

Click to read more ...

Tuesday
Mar182008

Shared filesystem on EC2

Hi. I'm looking for a way to share files between EC2 nodes. Currently we are using glusterfs to do this. It has been reliable recently, but in the past it has crashed under high load and we've had trouble starting it up again. We've only been able to restart it by removing the files, restarting the cluster, and filing it up again with our files from backup. This takes ages, and will take even longer the more files we get. What worries me is that it seems to make each node a point of failure for the entire system. One node crashes and soon the entire cluster has crashed. The other problem is adding another node. It seems like you have to take down the whole thing, reconfigure to include the new node, and restart. This kind of defeats the horizontal scaling strategy. We are using 2 EC2 instances as web servers, 1 as a DB master, and 1 as a slave. GlusterFS is installed on the web server machines as well as the DB slave machine (we backup files to s3 from this machine). The files are mostly thumbnails, but also some larger images and media files. Does anyone have a good solution for sharing files between EC2 nodes? I like the ThruDB [http://trac.thrudb.org/] concept of using the local filesystem as a cache for S3, but I'm not sure if ThruDB is mature enough yet. Or maybe some kind of distributed filesystem built on top of git would work? Any ideas? Thanks! ~rvr

Click to read more ...

Saturday
Mar082008

DNS-Record TTL on worst case scenarios

i didnt find a nearly good solution for this problem yet: imagine, you're responsible for a small CDN network (static images), with two different datacenter. the balancing for the two DC is done with a anycast nameservice (a nameserver in every DC, user gets on nearest location). so, one of the scenario is that one of the datacenters goes down completly. you can do a monitoring on the nameserver and only route to the dc which is still alive, no problem. But what about the TTL from the DNS-Records? Tiny TTLs like 2 min. are often ignored by several ISP (e.g. AOL). so, the client doesn't get the IP from the other Datacenter. what could be a solution in this scenario?

Click to read more ...

Tuesday
Feb262008

Architecture to Allow High Availability File Upload

Hi, I was wondering if anyone has found any information on how to architect a system to support high availability file uploads. My scenario: I have an Apache server proxying requests to a bunch of Tomcat Java application servers. When I need to upgrade my site, I stop and upgrade each of the Tomcat servers one at a time. This seems to work well as Apache automatically routes subsequent requests for the stopped app server to the remaining app servers that are up. The problem is that if a user is uploading a file when the app server is stopped, the upload fails and the user has to upload the file again. This is problematic as uploading files is an integral feature of the site and it's frustrating for the users to have to restart their uploads every time I upgrade the site (which I want to be able to do frequently). Has anyone seen any information on how this can be done or have ideas on how this can be architected? I imagine sites like Flickr must have a solution to this problem as I have seen presentations they say that they are able to upgrade their site several times a day without the users noticing. Thanks! Tuyen

Click to read more ...

Thursday
Feb212008

Tracking usage of public resources - throttling accesses per hour

Hi, We have an application that allows the user to define a publicly available resource with an ID. The ID can then be accessed via an HTTP call, passing the ID. While we're not a picture site, thinking of a resource like a picture may help understand what is going on. We need to be able to stop access to the resource if it is accessed 'x' times in an hour, regardless of who is requesting it. We see two options - go to the database for each request to see if the # of returned in the last hour is within the limit. - keep a counter in each of the application servers and sync the counters every few minutes or # of requests to determine if we've passed the limit. The sync point would be the database. Going to the database (and updating it!) each time we get a request isn't very attractive. We also have a load balanced farm of servers, so we know 'x' is going to have to be a soft limit if we count in the app serevrs. (We know there will be a period of time between syncing the counts in the app servers where we'll overshoot the limit. That is okay since we'll catch the limit violation and stop the requests.) Other thoughts on how do to this? Thanks, Chris

Click to read more ...