High Scalability -

1 Comment |

Permalink |

Print Article

Email Article

Apache,

Example,

Memcached,

MogileFS,

Slides: Building Highly Scalable Web Applications

Perl

Monday

Jul302007

Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

Monday, July 30, 2007 at 8:11AM

Can you really create an infinitely scalable infrastructure for less than $100 using Amazon's storage, grid, and queuing services platform? It appears so, at least for the right application. Amazon beams a spot light on the future battle of the roll-your-own versus the connect-the-dots approach to building next generation websites using core external services. Their argument is strong. Using Amazon's platform you can quickly build an infrastructure that would otherwise take an eternity to make, a pile of money to create, and an unbounded mass of people to implement and maintain. Yet Amazon doesn't provide SLAs, so you can you really trust them with your crown jewels? Facebook recently leap frogged Amazon's vision with an even more comprehensive set of services. The battle for the future is on. Site: http://aws.amazon.com/

Information Sources

Podcast: Technometria: Amazon Web Services

Amazon Services Home.

Platform

Amazon ECS (E-Commerce Service)

Amazon S3 (simple storage service)

Amazon SQS (simple queuing service)

Amazon EC2 (grid service)

Amazon Web Search Service

Amazon Flexible Payments Service (Amazon FPS)

REST and SOAP Service Interfaces

What's Inside?

Why use external services?

Amazon's services replace the boxes, wires, and disk drives part of the application stack.

Amazon has spent ten years and over $1 billion developing a world-class web service that millions of customers use every day. Maybe you can leverage that experience for your site?

Focus on the customer. 70% if Web Development isn't about providing customer value. It's about building and managing data centers. Your efforts would be better spent on your customers and not plumbing.

Quicker to market. Scaling is hard. Let someone else worry about that while you concentrate on adding user value.

Designing for peak load is expensive. So turn fixed costs into variable costs. Say you want to handle high traffic flows from slashdot or digg, or you have high seasonal demand, having the infrastructure in place to handle those loads is a high fixed cost. You could use that money better elsewhere. It make sense to create an infrastructure where you can automatically and temporarily scale resources to handle peak demand.

High reliability and availability. A dedicated service may be more reliable than a service you could create. It say "may" because Amazon doesn't provide an SLA, so you wont get any guarantees. The idea is that Amazon is cheap enough and reliable enough that the few failures will be acceptable. Besides, SLAs usually just refund some money when things go wrong, they don't really guarantee anything.

It's a cheap CDN. Amazon's storage network could serve a relatively inexpensive content delivery network. This option is discussed in Reducing Your Website's Bandwidth Usage. The idea is that just the frequent downloading of a simple favicon.ico file can use a significant portion of your bandwidth. Using S3 for $2/month to offload 90% of your bandwidth to an external host is a good deal. However, without an SLA S3 can't be thought of as a proper CDN.

Amazon ECS (E-Commerce Service)

This service exposes Amazon's product data and e-commerce functionality: Detailed Product Information on all Amazon.com Products, Access to Product Images, All Customer Reviews associated with a Product, etc.

Amazon products are aggressively priced.

I found this service disappointing. If you want to build a store on top of Amazon it seems great, but I didn't see a way to add your own products to the store, so I don't think it's generally useful.

Amazon S3 (simple storage service)

This service stores data in Amazon's storage network.

$.15 per GPB per month storage

$.01 for 1000 to 10000 requests.

$.10 - $.17 per GB data transfer.

The service is: fast, relaible, scalable, redundant, dispersed.

You can have per object URLs. This means you can reference an image or other file directly with a URL, so it's usable in a web page.

Typical use: CDN and backup storage.

Storage is distributed to multiple locations so you get a level of geographical distribution.

Amazon SQS (simple queuing service)

This service provides an internet scale queuing service for storing messages. Distributed actors put work on the queue and take work off the queue.

$.10 per 1000 messages.

$.10 - $.18 per GB data transfer.

This service is: scalable, elastic, reliable, simple, secure.

Typical use: a centralized work queue. You put jobs on the queue and different actors can pop work of the queue and process them when they get CPU time.

Expected message latency, as of 2007, was 2-10 seconds. This is horrible for many applications, not bad for many others.

Part of scalability. Have any number of producers and consumers. You don't worry about it.

Queues are spread across multiple machines and multiple data centers.

Amazon EC2 (grid service)

This service provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

Basically you create a Xen image for your Linux distro and upload it into their "elastic compute cloud." Using an API you can then start as many instances as you like.

Typical use: transcoding, audio work, load testing.

Root level access to the server and full control over the machine.

Can scale up and scale down on a minute-by-minute basis.

For real-time processing one criticism has been slow CPUs (1.75 Ghz Xeon). This probably won't be a problem if your application is written to linearly scale.

An EC2 instance is not persistent so you can't store a database there. You have some local storage, but it goes away when the instance goes away.

Takes a few minutes to start and stop images, so it's not really on demand.

You can add anything you want to an image. If you want a database you can add it in.

GigaVox Media Example Web-Scale Architecture

You can start to see how Amazon's services can work together. Let's say you have a large batch of MP2s you would like transcode to MP3s. You would store the original media into S3, queue the work request into SQS, and have instances running in EC2 to take work of the queue and perform the transcoding, storing the results back into S3. And this is exactly what GigaVox does.

GigaVox is a podcasting company. - They take original recordings and transcode them say from MP2 to MP3. Many other transcodings are also performed. - Then these chunks of media are assembled together into a delivery format based on building a show. For example, old podcasts can be reassembled each night with up to date advertisements. - To do this at scale would take a lot of costly resources.

Using Amazon's services GigaVox gets geographically redundancy and failover for relatively inexpensive CPU, bandwidth, and storage charges, and bandwidth costs. You have no boxes or wires. No data center to manage. And you can grow with small fixed costs.

Messages are time stamped on the queue. If the message waited in the queue for too long then they can start more EC2 images. You can balance costs. You could also layer in a customer based priority mechanism.

They have each instance have its own messaging queue for command and control.

For security reasons they upload files through ftp to instances rather than going through S3.

All bandwidth withing the Amazon cloud is free. This is an important business consideration for making the services work together.

Another set of instances and queues handles assembling the delivered media.

Allows GigaVox to deliver value to their customer at a low startup cost.

Lessons Learned

Build or buy is always a difficult decision. If a service doesn't work then you may lose your customers and there's nothing more you can do other send yet another urgent email to nobody in particular. This is a horrible feeling. Yet, if it does work you could be way ahead of the game. How to choose? That would be telling :-)

Build a layer of virtualization so you can switch to another provider when they become available or so you can replace it with your own service. This lessens your dependency on Amazon in the event they get tired of offering services or their performance deteriorates.

As a startup using Amazon services isn't a big risk because you are already in a risky situation. And any risk is moderated by the very low cost of starting up and money is always an issue for startups.

For many use cases buying your own dedicated servers may still be a better approach as you get more control, lower latency, and the same hardware is usable for multiple purposes.

Software as a service is a powerful and practical idea. It changes how you build software. It forces you to layer your software around interfaces. And once your software is composed of interfaces you have loosely coupled components that can be easily replaced. You also have the basis for a platform API should you ever want to provide an API you your customers. The highest level of development would to use the same API you give your customers to build your service.

Loosely coupled, message based architectures combined with service interfaces allow you to think several levels up the abstraction layer. You don't have to wallow in the muck, which frees you how to structure your application using large scale blocks of behavior.

Designing a UI for an asynchronous interactive interface poses some challenges. It may take a while to perform an operation, so how do you interact with the user to handle that?

Instinctively I doubted Amazon could deliver. But if you have the right type of problem, you really can do a lot of work cheaply using Amazon services.

ThemBid Architecture

Thursday, July 26, 2007 at 5:27AM

ThemBid provides a market where people needing work done broadcast their request and accept bids from people competing for the job. Unlike many of the sites profiled at HighScalability, ThemBid is not in the popular press as often as Paris Hilton. It's not a media darling or a giant of the industry. But what I like is they have a strategy, a point-of-view for building websites and were gracious enough to share very detailed instructions on how to go about building a website. They even delve into actual installation details of the various software packages they use. Anyone can benefit by taking a look at their work. Site: http://www.thembid.com/

Information Sources

Build Scalable Web 2.0 Sites with Ubuntu, Symfony, and Lighttpd

Platform

Linux (Ubuntu)

Symfony

Lighttpd

PHP

eAccelerator

Eclipse

Munin

AWStats

What's Inside?

The Stats

Started work in December of 2006 and had a full demo by March 2007.

One developer/sys admin worked with a part-time graphics designer.

Targeted a few thousand users after launch.

The Architecture

Hardware. Dual core server with 2GB RAM

Storage. 2 x 36SCSI 10K RPM on RAID1.

Data Center. They went with with Layeredtech for the managed server because of past positive experiences.

Development Environment. Ubuntu and Eclipse.

OS. They chose the server distribution of Ubuntu because that's what they use on the client side and Ubuntu supports "simpler installation and easier maintenance than typical IT deployments."

Web Server. Lighttpd is used to handle static content and forward the dynamic PHP page requests to FastCGI.

Database. MySQL. When growth is necessary the idea is to move to a master-slave arrangement and them maybe MySQL cluster.

Web Framework. Went with PHP because they knew it and other successful sites like Digg and Yahoo successfully deploy PHP. They chose Symfony as there framework because of its nice documentation and active development community. And Yahoo also uses Symfony. It's a decision that has worked well for them.

PHP Cache. eAccelerator is used to compile and cache PHP scripts.

Object and Content Cache. The plan is to cache a lot of content. For a bid site like theirs this makes sense. Many of the pieces are used over and over again so putting them in memory will speed up the entire system and take pressure off the database and the IO system. Initially the used a SQLite cache on top of of a memory based file system. This choice was because it was supported by Symfony. When a memcached plugin is available they'll try that.

Client Side Cache. Lighttp's mod_expire module is used to prevent Javascript, style sheets, and images that rarely change from being uncessarily redownloaded by the browser.

Monitoring. Munin is used to monitor their resource usage. It's as simple as visiting "yoursite.com/status" to see what's going on.

Log Analysis. AWStats is used to track hits and types of requests. This information can be used to target bottlenecks.

Scalability Plan. - Use Munin to tell when to think about upgrading. When your growth trend will soon cross your resources trend, it's time to do something. - Move MySQL to a separate server. This frees up resources (CPU, disk, memory). What you want to run on this server depend on its capabilities. Maybe run a memcached server on it. - Move to a distributed memory cache using memcached. - Add a MySQL master/slave configuration. - If more webservers are needed us LVS on the front end as a load balancer.

Future Directions. Work on fault tolerance.

Lessons Learned

It's possible to create a nice site fairly quickly with just a few people using commonly available low cost tools. And your system will be solid and powerful. No cut corners.

Use feedback from your system to know what needs optimizing and when it's time to scale.

Good documentation and an active community draw people. These are very attractive qualities for people making decisions about what to use. It's hard to go with a tool chain when it looks like you may get stuck in the future with no way out and no help. If you make tools make them dead easy to understand, learn, use, and deploy.

Stick with the familiar. It may not be optimal, it may not be the best, but it's more important that you get started and make progress. You don't want to delay releasing your site so you can learn a completely different tool chain that may make your life somewhat easier and in some projected future. The future is now.

Use what works for other people. The fact that Yahoo and Digg use PHP is a good recommendation. Certainly PHP is not the only way to build a site, but it does cut your risk level and help you sleep at night. It also means there's an active community that can help you when you have problems.

2 Comments |

Permalink |

PHP,

Symfony,

lighttpd

Monday

Jul232007

GoogleTalk Architecture

Monday, July 23, 2007 at 8:47AM

Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk

Information Sources

GoogleTalk Architecture

Platform

Linux

Java

Google Stack

Shard

What's Inside?

The Stats

Support presence and messages for millions of users.

Handles billions of packets per day in under 100ms.

IM is different than many other applications because the requests are small packets.

Routing and application logic are applied per packet for sender and receiver.

Messages must be delivered in-order.

Architecture extends to new clients and Google services.

Lessons Learned

Measure the right thing. - People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question. - Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges - A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day. - Have a large number friends and presence explodes. The number IMs not that big of deal.

Real Life Load Tests - Lab tests are good, but don't tell you enough. - Did a backend launch before the real product launch. - Simulate presence requests and going on-line and off-line for weeks and months, even if real data is not returned. It works out many of the kinks in network, failover, etc.

Dynamic Resharding - Divide user data or load across shards. - Google Talk backend servers handle traffic for a subset of users. - Make it easy to change the number of shards with zero downtime. - Don't shard across data centers. Try and keep users local. - Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.

Add Abstractions to Hide System Complexity - Different systems should have little knowledge of each other, especially when separate groups are working together. - Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system. - Abstract these complexities into a set of gateways that are discovered at runtime. - RPC infrastructure should handle rerouting.

Understand Semantics of Lower Level Libraries - Everything is abstracted, but you must still have enough knowledge of how they work to architect your system. - Does your RPC create TCP connections to all or some of your servers? Very different implications. - Does the library performance health checking? This is architectural implications as you can have separate system failing independently. - Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.

Protect Again Operation Problems - Smooth out all spoke in server activity graphs. - What happens when servers restart with an empty cache? - What happens if traffic shifts to a new data center? - Limit cascading problems. Back of from busy servers. Don't accept work when sick. - Isolate in emergencies. Don't infect others with your problems. - Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.

Any Scalable System is a Distributed System - Add fault tolerance to every component of the system. Everything fails. - Add ability to profile live servers without impacting server. Allows continual improvement. - Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects. - Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.

Software Development Strategies - Make sure binaries are both backward and forward compatible so you can have old clients work with new code. - Build an experimentation framework to try new features. - Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.

3 Comments |

Permalink |

Java,

What the Web’s most popular sites are running on

Shard

Thursday

Jul122007

FeedBurner Architecture

Thursday, July 12, 2007 at 10:34AM

FeedBurner is a news feed management provider launched in 2004. FeedBurner provides custom RSS feeds and management tools to bloggers, podcasters, and other web-based content publishers. Services provided to publishers include traffic analysis and an optional advertising system. Site: http://www.feedburner.com

Information Sources

FeedBurner - Scalable Web Applications using MySQL and Java

Platform

Java

MySQL

Hibernate

Spring

Tomcat

Cacti

Load balancing: NetScaler Application Switches

Routers, switches: HP, Cisco

DNS: bind

The Stats

FeedBurner is growing faster than MySpace and Digg with 385% traffic growth. Total feeds: 808,707, Number of publishers: 471,686.

11 million subscribers in 190 countries

Scaling History - July 2004: 300Kbps, 5,600 feeds, 3 app servers, 3 web servers 2 DB servers, Round Robin DNS - April 2005: 5Mbps, 47,700 feeds, 6 app servers, 6 web servers (same machines) - September 2005: 20Mbps, 109,200 feeds - Currently: 250 Mbps bandwidth usage, 310 million feed views per day, 100 Million hits per day

The Architecture

Scalability Problem 1: Plain old reliability - Single-server failure, seen by 1/3 of all users - Health Check all the way back to the database that is monitored by load balancers to route requests in to live machines on failure. - Use Cacti and Nagios for monitoring. Using these tools you can look at uptime and performance to identify performance problems.

Scalability Problem 2: Stats recording/mgmt - Every hit is recorded which slows everything down because of table level locks. - Used Doug Lea’s concurrency library to do updates in multiple threads. - Only stats for today are calculated in real-time. Other stats are calculate lazily.

Scalability Problem 3: Primary DB overload - Use master DB for everything. - Balance read and read/write load - Found where we could break up read vs. read/write - Balanced master vs. slave load

Scalability Problem 4: Total DB overload - Everything slowed down, was using the database has cache, used MyISAM - Add caching layers. RAM on the machines, memcached, and in the database

Scalability Problem 5: Lazy initialization - When stats get rolled up on demand popular feeds slowed down the whol system - Turned to batch processing, doing the rollups once a night.

Scalability Problem 6: Stats writes, again - Wrote to the master too much. More data with each feed. Added more stats tracking for ads, items, and circulation. - Use merge tables. Truncate the data from 2 days ago. - Went to horizontal partitioning: ad serving, flare serving, circulation. - Move hottest tables/queries to own clusters.

Scalability Problem 7: Master DB Failure - Using a primary and slave there's a single point of failure because it's hard to promote a slave to a master. Went to a multi master solution.

Scalability Problem 8: Power Failure - Needed a disaster recovery/secondary site. - Active/active not possible. Too much hardware, didn't like having half the hardware going to waste, and needed a really fast connection between data centers. - Create custom solution to download feeds to remote servers.

They have two sites in primary and secondary roles (active-passive) as their geographical redundancy plan. They plan on moving to active-active model in the future.

Lessons Learned

Know your DB workload, Cacti really helps with this.

‘EXPLAIN’ all of your queries. Helps keep crushing queries out of the system.

Cache everything that you can.

Profile your code, usually only needed on hard-to-find leaks.

The greatest challenge was finding the most efficient ways to locate hotspots and bottlenecks in the application. With a loose methodology for locating problems, the analysis became very easy. Detailed monitoring was crucial in this, keeping track of disk, CPU and memory usage, slow database queries, handler details in MySQL, etc.

3 Comments |

Permalink |

Java,

MySQL

Wednesday

Jul112007

Friendster Architecture

Wednesday, July 11, 2007 at 3:18PM

Friendster is one of the largest social network sites on the web. it emphasizes genuine friendships and the discovery of new people through friends. Site: http://www.friendster.com/

Information Sources

Friendster - Scaling for 1 Billion Queries per day

Platform

MySQL

Perl

PHP

Linux

Apache

What's Inside?

Dual x86-64 AMD Opterons with 8 GB of RAM

Faster disk (SAN)

Optimized indexes

Traditional 3-tier architecture with hardware load balancer in front of the databases

Clusters based on types: ad, app, photo, monitoring, DNS, gallery search DB, profile DB, user infor DB, IM status cache, message DB, testimonial DB, friend DB, graph servers, gallery search, object cache.

Lessons Learned

No persistent database connections.

Removed all sorts.

Optimized indexes

Don’t go after the biggest problems first

Optimize without downtime

Split load

Moved sorting query types into the application and added LIMITS.

Reduced ranges

Range on primary key

Benchmark -> Make Change -> Benchmark -> Make Change (Cycle of Improvement)

Stabilize: always have a plan to rollback

Work with a team

Assess: Define the issues

A key design goal for the new system was to move away from maintaining session state toward a stateless architecture that would clean up after each request

Rather than buy big, centralized boxes, [our philosophy] was about buying a lot of thin, cheap boxes. If one fails, you roll over to another box.

3 Comments |

Permalink |

Print Article

Email Article

Example,

PHP,

Perl

Tuesday

Jul102007

mixi.jp Architecture

Tuesday, July 10, 2007 at 7:55AM

Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp

Information Sources

mixi.jp - scaling out with open source

Platform

Linux

Apache

MySQL

Perl

Memcached

Squid

Shard

What's Inside?

They grew to approximately 4 million users in two years and add over 15,000 new users/day.

Ranks 35th on Alexa and 3rd in Japan.

More than 100 MySQL servers

Add more than 10 servers/month

Use non-persistent connections.

Diary traffic is 85% read and 15% write.

Message traffic is is 75% read and 25% write.

Ran into replication performance problems so they had to split the database.

Considered splitting vertically by user or splitting horizontally by table type.

The ended up partitioning by table type and user. So all the messages for a group of users would be assigned to a particular database. Partitioning key is used to decide in which database data should be stored.

For caching they use memcached with 39 machines x 2 GB memory.

Stores more than 8 TB of images with about 23 GB added per day.

MySQL is only used to store metadata about the images, not the images themselves.

Images are either frequently accessed or rarely accessed.

Frequently accessed images are cached using Squid on multiple machines.

Rarely accessed images are served from the file system. There's no profit in caching them.

Lessons Learned

When using dynamic partitioning it's difficult to pick keys and algorithms for where data should be stored.

Once you partition data you can no longer do joins and you have to open a lot of connections to different databases to merge the data back together.

It's hard to add new hosts and rearrange data when you partition. For example, let's say your partitioning algorithm stores all the messages for users 1-N on host 1. Now let's say host 1 becomes overburdened and you want to repartition users across more hosts. This is very difficult to do.

By using distributed memory caching they rarely hit the DB and there average page load time is about .02 seconds. This reduces the problems associated with partitioning.

You will often have to develop strategies based on the type of content. For example, image will be treated differently than short text posts.

Social networking sites are very time oriented, so it might be useful to partition data by time as well as user and type.

LiveJournal Architecture

Monday, July 9, 2007 at 2:57AM

A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder. Site: http://www.livejournal.com/

Information Sources

LiveJournal - Behind The Scenes Scaling Storytime

Google Video

Tokyo Video

2005 version

Platform

Linux

MySql

Perl

Memcached

MogileFS

Apache

What's Inside?

Scaling from 1, 2, and 4 hosts to cluster of servers.

Avoid single points of failure.

Using MySQL replication only takes you so far.

Becoming IO bound kills scaling.

Spread out writes and reads for more parallelism.

You can't keep adding read slaves and scale.

Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles.

Caching to improve performance with memcached. Two-level hashing to distributed RAM.

Perlbal for web load balancing.

MogileFS, a distributed file system, for parallelism.

TheSchwartz and Gearman for distributed job queuing to do more work in parallel.

Solving persistent connection problems.

Lessons Learned

Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts.

Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do.

Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles.

Replication has a cost. You can't just keep adding more and more read slaves and expect to scale.

Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale.

Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid.

9 Comments |

Permalink |