High Scalability -

Permalink |

Clustered Storage System,

Cluster File System,

Product,

Storage Virtualization

Sunday

Jul152007

Lustre cluster file system

Sunday, July 15, 2007 at 5:25AM

Lustre® is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Cluster File Systems, Inc. The central goal is the development of a next-generation cluster file system which can serve clusters with 10,000's of nodes, provide petabytes of storage, and move 100's of GB/sec with state-of-the-art security and management infrastructure. Lustre runs on many of the largest Linux clusters in the world, and is included by CFS's partners as a core component of their cluster offering (examples include HP StorageWorks SFS, and the Cray XT3 and XD1 supercomputers). Today's users have also demonstrated that Lustre scales down as well as it scales up, and runs in production on clusters as small as 4 and as large as 25,000 nodes. The latest version of Lustre is always available from Cluster File Systems, Inc. Public Open Source releases of Lustre are available under the GNU General Public License. These releases are found here, and are used in production supercomputing environments worldwide.

Coyote Point Load Balancing Systems

Sunday, July 15, 2007 at 2:26AM

Appliances that: * Ensures Non-Stop application availability * Improves network and server maintainability * Delivers Enterprise-grade gigabit content switching * Offers true Application Acceleration * Provides maximum throughput at minimal cost

Permalink |

What the Web’s most popular sites are running on

Load Balancing,

Product

Thursday

Jul122007

FeedBurner Architecture

Thursday, July 12, 2007 at 10:34AM

FeedBurner is a news feed management provider launched in 2004. FeedBurner provides custom RSS feeds and management tools to bloggers, podcasters, and other web-based content publishers. Services provided to publishers include traffic analysis and an optional advertising system. Site: http://www.feedburner.com

Information Sources

FeedBurner - Scalable Web Applications using MySQL and Java

Platform

Java

MySQL

Hibernate

Spring

Tomcat

Cacti

Load balancing: NetScaler Application Switches

Routers, switches: HP, Cisco

DNS: bind

The Stats

FeedBurner is growing faster than MySpace and Digg with 385% traffic growth. Total feeds: 808,707, Number of publishers: 471,686.

11 million subscribers in 190 countries

Scaling History - July 2004: 300Kbps, 5,600 feeds, 3 app servers, 3 web servers 2 DB servers, Round Robin DNS - April 2005: 5Mbps, 47,700 feeds, 6 app servers, 6 web servers (same machines) - September 2005: 20Mbps, 109,200 feeds - Currently: 250 Mbps bandwidth usage, 310 million feed views per day, 100 Million hits per day

The Architecture

Scalability Problem 1: Plain old reliability - Single-server failure, seen by 1/3 of all users - Health Check all the way back to the database that is monitored by load balancers to route requests in to live machines on failure. - Use Cacti and Nagios for monitoring. Using these tools you can look at uptime and performance to identify performance problems.

Scalability Problem 2: Stats recording/mgmt - Every hit is recorded which slows everything down because of table level locks. - Used Doug Lea’s concurrency library to do updates in multiple threads. - Only stats for today are calculated in real-time. Other stats are calculate lazily.

Scalability Problem 3: Primary DB overload - Use master DB for everything. - Balance read and read/write load - Found where we could break up read vs. read/write - Balanced master vs. slave load

Scalability Problem 4: Total DB overload - Everything slowed down, was using the database has cache, used MyISAM - Add caching layers. RAM on the machines, memcached, and in the database

Scalability Problem 5: Lazy initialization - When stats get rolled up on demand popular feeds slowed down the whol system - Turned to batch processing, doing the rollups once a night.

Scalability Problem 6: Stats writes, again - Wrote to the master too much. More data with each feed. Added more stats tracking for ads, items, and circulation. - Use merge tables. Truncate the data from 2 days ago. - Went to horizontal partitioning: ad serving, flare serving, circulation. - Move hottest tables/queries to own clusters.

Scalability Problem 7: Master DB Failure - Using a primary and slave there's a single point of failure because it's hard to promote a slave to a master. Went to a multi master solution.

Scalability Problem 8: Power Failure - Needed a disaster recovery/secondary site. - Active/active not possible. Too much hardware, didn't like having half the hardware going to waste, and needed a really fast connection between data centers. - Create custom solution to download feeds to remote servers.

They have two sites in primary and secondary roles (active-passive) as their geographical redundancy plan. They plan on moving to active-active model in the future.

Lessons Learned

Know your DB workload, Cacti really helps with this.

‘EXPLAIN’ all of your queries. Helps keep crushing queries out of the system.

Cache everything that you can.

Profile your code, usually only needed on hard-to-find leaks.

The greatest challenge was finding the most efficient ways to locate hotspots and bottlenecks in the application. With a loose methodology for locating problems, the analysis became very easy. Detailed monitoring was crucial in this, keeping track of disk, CPU and memory usage, slow database queries, handler details in MySQL, etc.

3 Comments |

Permalink |

Java,

MySQL

Wednesday

Jul112007

Friendster Architecture

Wednesday, July 11, 2007 at 3:18PM

Friendster is one of the largest social network sites on the web. it emphasizes genuine friendships and the discovery of new people through friends. Site: http://www.friendster.com/

Information Sources

Friendster - Scaling for 1 Billion Queries per day

Platform

MySQL

Perl

PHP

Linux

Apache

What's Inside?

Dual x86-64 AMD Opterons with 8 GB of RAM

Faster disk (SAN)

Optimized indexes

Traditional 3-tier architecture with hardware load balancer in front of the databases

Clusters based on types: ad, app, photo, monitoring, DNS, gallery search DB, profile DB, user infor DB, IM status cache, message DB, testimonial DB, friend DB, graph servers, gallery search, object cache.

Lessons Learned

No persistent database connections.

Removed all sorts.

Optimized indexes

Don’t go after the biggest problems first

Optimize without downtime

Split load

Moved sorting query types into the application and added LIMITS.

Reduced ranges

Range on primary key

Benchmark -> Make Change -> Benchmark -> Make Change (Cycle of Improvement)

Stabilize: always have a plan to rollback

Work with a team

Assess: Define the issues

A key design goal for the new system was to move away from maintaining session state toward a stateless architecture that would clean up after each request

Rather than buy big, centralized boxes, [our philosophy] was about buying a lot of thin, cheap boxes. If one fails, you roll over to another box.

3 Comments |

Permalink |

Example,

Linux,

MySQL,

PHP,

Perl

Tuesday

Jul102007

mixi.jp Architecture

Tuesday, July 10, 2007 at 7:55AM

Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp

Information Sources

mixi.jp - scaling out with open source

Platform

Linux

Apache

MySQL

Perl

Memcached

Squid

Shard

What's Inside?

They grew to approximately 4 million users in two years and add over 15,000 new users/day.

Ranks 35th on Alexa and 3rd in Japan.

More than 100 MySQL servers

Add more than 10 servers/month

Use non-persistent connections.

Diary traffic is 85% read and 15% write.

Message traffic is is 75% read and 25% write.

Ran into replication performance problems so they had to split the database.

Considered splitting vertically by user or splitting horizontally by table type.

The ended up partitioning by table type and user. So all the messages for a group of users would be assigned to a particular database. Partitioning key is used to decide in which database data should be stored.

For caching they use memcached with 39 machines x 2 GB memory.

Stores more than 8 TB of images with about 23 GB added per day.

MySQL is only used to store metadata about the images, not the images themselves.

Images are either frequently accessed or rarely accessed.

Frequently accessed images are cached using Squid on multiple machines.

Rarely accessed images are served from the file system. There's no profit in caching them.

Lessons Learned

When using dynamic partitioning it's difficult to pick keys and algorithms for where data should be stored.

Once you partition data you can no longer do joins and you have to open a lot of connections to different databases to merge the data back together.

It's hard to add new hosts and rearrange data when you partition. For example, let's say your partitioning algorithm stores all the messages for users 1-N on host 1. Now let's say host 1 becomes overburdened and you want to repartition users across more hosts. This is very difficult to do.

By using distributed memory caching they rarely hit the DB and there average page load time is about .02 seconds. This reduces the problems associated with partitioning.

You will often have to develop strategies based on the type of content. For example, image will be treated differently than short text posts.

Social networking sites are very time oriented, so it might be useful to partition data by time as well as user and type.

Permalink |

Apache,

MySQL,

Perl,

Shard,

Squid

Tuesday

Jul102007

Webcast: Advanced Database High Availability and Scalability Solutions

Tuesday, July 10, 2007 at 5:35AM

If MySQL, PostgreSQL or EnterpriseDB High-Availability and Scalability issues are on your plate, you'll find this webcast very informative. Highly recommended! Webcast starts on Thursday, July 12, 2007 at 10:00AM PDT (1:00PM EDT, 18:00GMT). Duration: 50 minutes, plus Q&A Advanced Database High-Availability and Scalability Solutions ImageProgram Agenda Disk Based Replication • Overview, major features • Benefits, use cases • Limitations and challenges Master/Slave Asynchronous Replication • Overview, major features • Benefits, use cases • Limitations and challenges Synchronous Multi-Master Cluster: Continuent uni/cluster • Cluster overview, major features • Cluster benefits, use cases • Limitations and challenges Product Positioning: HA Continuum • Comparisons • Key differentiators • How to pick the right solution Continuent Professional Services • HA Quick Assessment Service • HA JumpStart Implementation Services Q&A Presented by: • Robert Hodges, CTO - Continuent • Robert Noyes, Director of Sales, Americas - Continuent Webcast starts on Thursday, July 12, 2007 at 10:00AM PDT (1:00PM EDT, 18:00GMT). Duration: 50 minutes, plus Q&A. Click Here to Register! Continuent, the High Availability and Scalability Experts! If you are concerned about any of the following… - Application Availability - Read Scalability - Write Scalability - ZERO data loss requirement - Disaster Recovery - Geographically Distributed Operations … you'll want to talk to us!

Permalink |

Future Event,

Webcast

Monday

Jul092007

LiveJournal Architecture

Monday, July 9, 2007 at 2:57AM

A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder. Site: http://www.livejournal.com/

Information Sources

LiveJournal - Behind The Scenes Scaling Storytime

Google Video

Tokyo Video

2005 version

Platform

Linux

MySql

Perl

Memcached

MogileFS

Apache

What's Inside?

Scaling from 1, 2, and 4 hosts to cluster of servers.

Avoid single points of failure.

Using MySQL replication only takes you so far.

Becoming IO bound kills scaling.

Spread out writes and reads for more parallelism.

You can't keep adding read slaves and scale.

Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles.

Caching to improve performance with memcached. Two-level hashing to distributed RAM.

Perlbal for web load balancing.

MogileFS, a distributed file system, for parallelism.

TheSchwartz and Gearman for distributed job queuing to do more work in parallel.

Solving persistent connection problems.

Lessons Learned

Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts.

Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do.

Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles.

Replication has a cost. You can't just keep adding more and more read slaves and expect to scale.

Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale.

Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid.

9 Comments |

Permalink |

Linux,

MySQL,

Shard

Sunday

Jul082007

Welcome to High Scalability

Sunday, July 8, 2007 at 4:36PM

We started High Scalability to help you build successful scalable websites. This site tries to bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your system with confidence. Hopefully this site will move you further and faster along the learning curve of success. Please Start Here.

20 Comments |

Permalink |

To help you build successful scalable websites. This site tries to bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your website with confidence. When it becomes clear you must grow your website or die, most people have no idea where to start. It's not a skill you learn in school or pick up from a magazine article on a plane flight home. No, building scalable systems is a body of knowledge slowly built up over time from hard won experience and many failed battles. Hopefully this site will move you further and faster along the learning curve of success. Makers of popular web sites eventually run into this all important question: How do I scale? Every builder of successful web sites must answer and that question and put their answers into practice. You might wonder:

Friday

Jul062007

Start Here

Friday, July 6, 2007 at 6:55AM

This page is here to help you get started using High Scalability. Here are a few useful topics to get you going...

Why does the High Scalability site exist?
How does this site work?
Good things to read.
Participate by reading and posting in the forums.
Participate by adding your own links to interesting sites and articles.
Participate by signing up for the RSS feed.
Consider the many benefits of registering as a user.
How do I get notification of content and comment changes?
Contacting High Scalability.
About.

Why does the High Scalability site exist?

How do I handle being digged or slashdotted?
What can I accomplish on my budget?
How do I add more and more users?
What software should I use? LAMP, WAMP, or .Net?
Should I use managed or unmanaged systems? Dedicated, co-located, VPS hosting or something else?
Which machine and OS should I use?
How do I recover from a disaster?
How do I measure and improve performance?
Where do I get people to help me?
Which data center should I use?
Which ISP should I use?
How can I structure my software to scale?
How do we setup caching?
What should my database schema look like?
Which database should I use?
Which language and framework should I use?
How do I ensure my data is always available and never lost?
How do I monitor all my software and machines?
How do I train my programmers to build this type of software?
How do I failover my web servers, databases, etc?
How do I expand to multiple geographical locations?
How should I handle session data?
How do I handle support and upgrades and feature rollouts?

You probably have 1000s of questions like these. Where do you find the answers? The answers are out there. How to build a scalable website is not a secret, the information is just spread out. And it's still more art than science. Every problem is different. Your site may have specific requirements that make it just different enough that you could use some advice. And that's what this site is all about. Bringing like-minded people together to help each learn everything we can about creating the best websites we can.

How does this site work?

You might be a little bit overwhelmed at first when you hit the front page of this site for the first time. There's a lot a going on. But it's really pretty simple once you learn the secret of what's where and why. The front page is divided into 4 major sections: middle page content section, top main menu section, left hand navigation section, and the right hand interesting stuff section.

First a Word About How Tags are Used on This Site

Most content on the site can be tagged. You can invent your own tags or use existing tags from the glossary page. The tag edit field will provide suggestions based on the first letters you type in. A tag categorizes a chunk of content and determines which lists it shows up in. For example, if you want a weblink you submit to show up in the Real Life Architectures page then you would tag the link with the Example tag. Here's a list of tags and where a tag makes the content show up: * Example - in the Real Life Architectures page. This page presents case studies of how real websites like eBay and Flickr implement their websites so you can learn from them how to implement your own website. * Book - in the Useful Books page. The book page presents books that will be helpful in building your site. * Blog- in the Useful Blogs page. The blog page presents blogs with ongoing useful information you can continue to learn from everyday. * Paper- in the Useful Papers page. * Product- in the Useful Products page. The product page presents products you might find useful in building your own site. There are an amazing variety of website related products out in the world. We'll try to show you real products used by real people to get real results. * Strategy- in the Strategy page. These are useful techniques you can directly use to help scale your site. Use these tags when you add content and it will show in the right place for everyone to see.

Middle Page Content Section

In the middle of the front page you see content from a blog or a weblink post that has been promoted to the front page for everyone to see. Not all content is displayed on the front page, just what's worth everyone seeing. When you look at the middle page section you'll primarily see content submitted via the Submit a Link top menu link. This is like Digg or Reddit in that if you are registered user you can submit scalability related links for other people to read. The idea is for this to be community site generating high quality scalability related links. Content isn't just available on the front page, it's also available through the same links we talked about in the tagging section. You can see all weblinks by category in the All Weblinks top menu item. You see the most recent content on the front page, but the content is always available via the menu system as well. So don't worry when you see a lot of technical articles on the front page, these are just scalability related posts you can page attention to or not pay attention to.

Top Main Menu Section

The main menu links to some of the more important things a frequent user of the site can do on the site.

Left Hand Navigation Section

On the left hand side of the page you'll see other things you can do on the site. There's not enough room in the top menu so all your other options are place on the left.

Right Hand Interesting Stuff Section

The right hand side of the page shows you what's happening currently on the site. You'll see: recent comments people have made in the forum, new and active forum topics, the post popular tags users are using, new links that users have posted, and new articles from scalability related RSS feeds.

Good things to read.

If you are interested in this site then you probably want to build your own monster website. There's no better way to learn than learning from the best Real Life Architectures out there. Real Life Architectures is a continuing series of posts on how real successful websites like eBay, Flickr, MySpace, LiveJournal, and Amazon build their websites. Learn from those who have already done it and add your own personal twist to make it your own. But the learning doesn't stop there! For more helpful ideas on building the next big thing, please visit Useful Books, Useful Products, Useful Blogs, Useful Papers, and Useful Strategies. And if you want to ask questions or help other people with their questions, please take a look at the Forums. If you are looking for a web host then web hosting is a good hosting guide to help you determine what you need for hosting.

Participate by reading and posting in the forums.

Forums are where the main action is. You can get to the forums using the Forums link from the navigation panel on the left or the menu in the upper right hand section of the page. We'll keep the forum structure very simple until it's clear creating new groups will do more good than harm. Nothing is worse than a 100 different groups with no posts! We'll just have a General Discussion group and an Interesting Resource group to discuss individual scalability related links found on the web.

Participate by adding your own links to interesting sites and articles.

One of the incredible free to the user rewards for registering as a user of High Scalability is that you can post weblink articles to the front page. A weblink article is short link to existing page on the web. If you happen to come across anything interesting in your intertube travels you can share it with the community by posting a quick weblink. Think of it as digg without all the silly high school popularity theatrics. The amount of materials on High Scalability topics is vast and ever evolving, so if people share what they find that will help everyone keep up on what's new. To post your own weblinks all you have to do is: * Register as a user. * Click the Submit a Link menu item in the upper right hand corner of the page. * Fill out the weblink form and click on submit. And you're done! * Remember to use the proper tags for each link you create. * Insert the  comment just after the teaser section you want to show up on the front page, otherwise the whole article will show up on the front page. The break comment tag is Drupal's rather odd way of breaking a page up into its teaser and content sections.

Participate by signing up for the RSS feed.

If you would like to participate in this web site by reading RSS postings then just paste the following URL into your favorite RSS reader: http://highscalability.com/rss.xml.

Consider the many benefits of registering as a user.

OK, to be honest, there aren't that many benefits of registering as a user. We hate sites that make you register before you can do anything useful. We've made it so you can do most everything interesting without registering. But if you do register you can:

Post weblinks to the front page.
Upload a nice avatar of someone who looks nothing like your real self.
Not have to answer those taxing math captcha questions.
Helps us brag about how many registered users we have.

That's about it. Hopefully we'll have some nice door prizes later.

How do I get notification of content and comment changes?

Register as a user.
Click on My account in the left hand navigation menu.
Click on My notification settings in the page menu.
Select what you want to get notified about and how you want to get notified.
Click on the Save setting button on the bottom of the page.
Notification are sent out on a regular basis so you should get changes soon.

Contacting High Scalability.

If you would like to contact a real live person you can email us through this contact form.

About

Some people have asked who I am. Good question. I am still working on that :-) My name, however, is Todd Hoff and my personal website is at http://possibility.com/Tmh/. I have a lot of experience in large scale distributed systems and a long standing interest in the subject. I finally decided since I'm reading this stuff all the time I might as well start a site about it! I hope you find this site useful in your day-to-day work in the trenches.

25 Comments |

Permalink |