High Scalability -

Entries in MySQL (70)

Thursday

Oct302008

Olio Web2.0 Toolkit - Evaluate Web Technologies and Tools

Thursday, October 30, 2008 at 4:33AM

How do you evaluate and decide which web technologies (and there are myriads out there) to use for your new web application, which one potentially gives you the best performance, which one will likely give you the shortest time-to-market? The Apache incubator project Olio might help. Olio is a is an open source web 2.0 toolkit to help evaluate the suitability, functionality and performance of web technologies. Olio defines an example web2.0 application (an events site somewhat like yahoo.com/upcoming) and provides three initial implementations : PHP, Java EE and RubyOnRails (ROR). The toolkit also defines ways to drive load against the application in order to measure performance. Apache Olio could be used to

Understand how to use various web 2.0 technologies such as AJAX, memcached, mogileFS etc. Use the code in the application to understand the subtle complexities involved and how to get around issues with these technologies.
Evaluate the differences in the three implementations: php, ruby and java to understand which might best work for your situation.
Within each implementation, evaluate different infrastructure technologies by changing the servers used (e.g: apache vs lighttpd, mysql vs postgre, ruby vs Jruby etc.)
Drive load against the application to evaluate the performance and scalability of the chosen platform.
Experiment with different algorithms (e.g. memcache locking, a different DB access API) by replacing portions of code in the application.

Olio started it's life as the web2.0kit developed by Sun Microsystems in colloboration with U.C. Berkeley RAD Lab and was presented on Velocity2008.

Sun is using Olio to create deployment, reference architecture and performance tuning guides for their customers such as: An Open Source Web Solution - Lighttpd Web Server and Chip Multithreading Technology
U.C. Berkeley is using Olio in a variety of research projects, the first of which is Cloudstone.
A few web2.0 companies are using Olio as a training tool to teach their new developers PHP and Rails.

Click to read more ...

geekr |

3 Comments |

Permalink |

Print Article

Email Article

Faban,

Java,

MySQL,

Olio,

PHP,

RoR,

web 2.0

Wednesday

Sep102008

Shard servers -- go big or small?

Wednesday, September 10, 2008 at 1:48PM

Hello everyone, I'm designing a website/widget that my business partner and I expect to serve millions of hits daily. As such we must shard our database (and we're designing with shards in mind right from the beginning). However, the one thing I haven't been able to figure out from Googling is the best hardware to go with for shards. I'm using exclusively InnoDB tables. We'll (eventually) be running 3 groups of database servers: a) Session servers for php sessions. These will have a very high write volume. b) ID servers. These will match a couple primary indices (such as user ID) to a given shard. These will have an intense read load, plus a moderate amount of writes. c) Shard servers. These will hold the bulk of the data. These will have a high read load and a lowish write load. Group A is done as a database instead of using memcached so users aren't logged out if a memcached server goes down. As the write load is high, a pair of high performance master-master servers seems obvious. What's the ideal hardware setup for machines with this role? Maxed RAM and fast disks seem reasonable. Should I bother with RAID > 0 if I have a live backup on the other master? I hear 4 cores is optimal for InnoDB -- recommendations? Group B. Again, it looks like maxed RAM is recommended here. What about disks? Should I go for 10K or will regular SATA2 drives be okay? RAID 0, 5, 10? Cores? Should I think about slaves to a master-master setup? Group C. It seems to me these machines can be of any capacity because the data they hold is easily spread between shards. What is the query-per-second per dollar sweet spot when it comes to cores and number of disks? Should I beef these machines up, or stick with low end hardware? Should I still max the RAM? I have some other thoughts on system setup, too. As the data stored in the PHP sessions won't change frequently (it'll likely remain static for a user's entire visit -- all variable data can be stored in Group C shard servers), I'm thinking of using a memcached setup in front of the database and only pushing writes through to the database when necessary. Your thoughts? We're also starting this on a minimal budget (of course), so where in the above is it best spent? Keep in mind that I can recycle machines used in Group A & B in Group C as times goes on. Anyway, I'd love to hear from the expertise of the forum. I've been reading for a long time, and I'll be writing as our project evolves :) --Mark

Click to read more ...

Mark Rose |

2 Comments |

Permalink |

Print Article

Email Article

General Discussion,

MySQL,

Shard,

hardware,

innodb

Friday

Sep052008

Product: Tungsten Replicator

Friday, September 5, 2008 at 12:52AM

With Tungsten Replicator Continuent is trying to deliver a better master/slave replication system. Their goal: scalability, reliability with seamless failover, no performance loss. From their website: The Tungsten Replicator implements open source database-neutral master/slave replication. Master/slave replication is a highly flexible technology that can solve a wide variety of problems including the following: * Availability - Failing over to a slave database if your master database dies * Performance Scaling - Spreading reads across many copies of data * Cross-Site Clustering - Maintaining active database replicas across WANs * Change Data Capture - Extracting changes to load data warehouses or update other systems * Zero Downtime Upgrade - Performing upgrades on a slave server which then becomes the master The Tungsten Replicator architecture is flexible and designed to support addition of new databases easily. It includes pluggable extractor and applier modules to help transfer data from master to slave. The Replicator is designed to include a number of specialized features designed to improve its usefulness for particular problems like availability. * Replicated changes have transaction IDs and are stored in a transaction history log that is identical for each server. This feature allows masters and slaves to exchange roles easily. * Smooth procedures for planned and unplanned failover. * Built-in consistency check tables and events allow users to check consistency between tables without stopping replication or applications. * Support for statement as well as row replication. * Hooks to allow data transformations when replicating between different database types. Tungsten Replicator is not a toy. It is designed to allow commercial construction of robust database cluster

Tungsten ScaleOut Stack - an open source collection of integrated projects for database scale-out making use of commodity hardware.

Continuent Intros Tungsten Replicator by Shamila Janakiraman.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

MySQL,

Product,

replication

Sunday

Aug172008

Strategy: Drop Memcached, Add More MySQL Servers

Sunday, August 17, 2008 at 5:02AM

Update 2: Michael Galpin in Cache Money and Cache Discussions likes memcached for it's expiry policy, complex graph data, process data, but says MySQL has many advantages: SQL, Uniform Data Access, Write-through, Read-through, Replication, Management, Cold starts, LRU eviction. Update: Dormando asks Should you use memcached? Should you just shard mysql more?. The idea of caching is the most important part of caching as it transports you beyond a simple CRUD worldview. Plan for caching and sharding by properly abstracting data access methods. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs over planning and wasting valuable time. Feedster's François Schiettecatte wonders if Fotolog's 21 memcached servers wouldn't be better used to further shard data by adding more MySQL servers? He mentions Feedster was able to drop memcached once they partitioned their data across more servers. The algorithm: partition until all data resides in memory and then you may not need an additional memcached layer. Parvesh Garg goes a step further and asks why people think they should be using MySQL at all?

The Death of Read Replication by Brian Aker. Caching layers have replaced read replication. Cache can't fix a broken database layer. Partition the data that feeds the cache tier: "Keep your front end working through the cache. Keep all of your data generation behind it."

Read replication with MySQL by François Schiettecatte. Read replication is dead and it should be used only for backup purposes. Take the memory used for caching and give it to your database servers.

Replication++, Replication 2.0, Replication.Next by Ronald Bradford. What should read replication be used for?

Replication, caching, and partitioning by Greg Linden. Caching overdone because it adds complexity, latency on a cache miss, and inefficiently uses cluster resources. Hitting disk is the problem. Shard more and get your data in memory.

Click to read more ...

Todd Hoff |

8 Comments |

Permalink |

Print Article

Email Article

Memcached,

MySQL,

Strategy

Monday

Aug042008

A Bunch of Great Strategies for Using Memcached and MySQL Better Together

Monday, August 4, 2008 at 5:06PM

The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success, which was one of my personal favorites. Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a sophisticated approach to using memcached that others can learn from and emulate. We'll cover some of the highlightable strategies from the webinar down below the fold.

What is Memcached?

The first part of the first webinar gives a good overview of memcached. Just in case the rock you've been hiding under recently disintegrated, you may not have heard about memcached (as if). Memached is: A high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached essentially creates an in-memory shard on top of a pool of servers from which an application can easily get and set up to 1MB of unstructured data. Memcached is two hash tables, one from the client to the server and another one inside the server. The magic is that none of the memcached servers need know about each other. To scale up you just add more servers and the key hashing algorithm makes it all work out right. Memcached is not redundant, has no failover, and has no authentication. It's simple server for storing and getting data, the complex bits must be implemented by applications. The rest of the first webinar is Farhan explaining in wonderful detail how they use memcached at Fotolog. The beginning of the second seminar covers much of the same ground as the first. In the last half of the second seminar Farhan gives a number of excellent code examples showing how memcached is installed, used, and managed. If you've never used memcached before there's a lot of good stuff presented.

Memcached and MySQL Go Better Together

There's a little embrace and extend in the webinar as MySQL cluster is presented several times as doing much the same job as memcached, but more reliably. However, the recommended approach for using memcached and MySQL is:

Write scale the database by sharding. Partition data across multiple servers so more data can be written in parallel. This avoids a single server becoming the bottleneck.

Front MySQL with a memcached farm to scale reads. Applications access memcached first for data and if the data is not in memcached then the application tries the database. This removes a great deal of the load on a database so it can continue to perform it's transactional duties for writes. In this architecture the database is still the system of record for the true value of data.

Use MySQL replication for reliability and read query scaling. There's an effective limit to the number of slaves that can be supported so just adding slaves won't work as scaling strategy for larger sites. Using this approach you get scalable reads and writes along with high availability. Given that MySQL has a cache, why is memcached needed at all?

The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.

The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.

The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.

Cache everything that is slow to query, fetch, or calculate.

This is Fotolog's rule of deciding what to cache. What is considered slow depends on your requirements. But when something becomes slow it's a candidate for caching.

Fotolog's Caching Typology

Fotolog has come up with an interesting typology of their different caching strategies:

Non-Deterministic Cache - the classic memcached model of reading through the cache and writing to the database.

State Cache - maintain current application state in cache.

Proactive Cache - push changes from the database directly to the cache.

File System Cache - save NFS load by serving files from the cache instead of the file system.

Partial Page Cache - cache displayable page elements, not just data.

Application Based Replication - use a client side API to hide all the low level details of interacting with the cache.

Non-Deterministic Cache

This is the typical way of using memcached. I think the non-determinism comes in because an application can't depend on data being in the cache. The data may have been evicted because a slab is full or because the data simply hasn't been added yet. Other Fotolog strategies are deterministic, which means the application can assume the data is always present.

Useful For

Ideal for complex objects that are read several times. Especially for sharded environments where you need to collect data from multiple shards.

Good replacement for MySQL query cache

Caching relationships and other lists

Slow data that’s used across many pages

Don’t cache if its more taxing to cache than you’ll save

Tag clouds and auto-suggest lists For example, when a photo is uploaded the photo is uploaded on the page of every friend. These lists are taxing to calculate so they are cached.

Usage Steps

Check memcached for your data key.

If the data doesn't exist in the cache then check database.

If the data exists in the database then populate memcached.

Stats

45 memcached instances dedicated to nondeterministic cache

Each instance (on average): * ~440 gets per second * ~40 sets per second * ~11 gets/set Fotolog likes to characterize their caching policy in terms of the ratio of gets to sets.

Potential Problems

While memcached is usually a light CPU user, Fotolog ran their cache on their application servers and found some problems: * 90% CPU usage * Memory garbage collected nearly once a minute * Experienced blocking on memcached on app servers

Operations are not transactional. One work around is to set expirations so stale data doesn't stay around long.

State Cache

Keeps the current state of an application in cache.

Useful For

Expensive operations.

Sessions. If a memcached server goes down just make people login again.

Keep track of who's online and their current status, especially for IM applications. Use memcached or the load would cripple the database.

Usage Steps

It's a form of non-deterministic caching so the same steps apply.

Stats

9G dedicated. Depending on the number users all the users can keep in cache.

Deterministic Cache

Keep all data for particular database tables in the cache. An application always assumes that the data they need is in the cache (deterministic), applications never go to the database for data. Applications don't have to check memcached before accessing the data. The data will exist in the cache because the database is essentially loaded into the cache and is always kept in sync.

Useful For

Read scalability. All reads go through the cache and the database is completely offloaded.

Ideal for caching things that have no expiration.

Heavily accessed data/objects/lists

User credentials

User profiles

User preferences

Active media belonging to users, like photo lists.

Outsourcing logins to memcached. Don't hit database on login. All logins are processed through the cache.

Usage Steps

Multiple dedicated cache pools are maintained. Instead of one large pool make separate standalone pools. Multiple pools are necessary for high availability. If a cache pool goes down then applications switch over to the next pool. If a multiple pools did not exist and a pool did go down then the database would be swamped with the load.

All cache pools are maintained by the application. A write to the database means writing to multiple memcache pools. When an update to a user profile happens, for example, the update has to replicated to multiple caches from that point onwards.

When the site starts after a shutdown then when it comes up the deterministic cache is populated before the site is up. The site is rarely rebooted so this is rare occurrence.

Reads could also be load balanced against the cache pools for better performance and higher scalability.

Stats

~ 90,000 gets / second across cache cluster

~ 300 sets / second

get/set ratio of ~ 300

Potential Problems

Must have enough memory to hold everything.

The database has rows and the cache may have objects. The database caching logic must know enough to create objects from the database schema.

Maintaining the multiple caches seems complex. Fotolog uses Java and Hibernate. They wrote their own client to handle rotation through pools.

Maintaining multiple caches adds a lot of overhead to the application. In practice there's little replication overhead compared to benefit.

Proactive Caching

Data magically shows up in the cache. As updates happen to the database the cache is populated based on the database change. Since the cache is updated as the database is updated the chances of data being in cache are high. It's non-deterministic caching with a twist.

Useful For

Pre-Populating Cache: Keeping memcached updated minimizes calls to database if object not present.

“Warm up” cache in cases of cross data-center replication.

Usage Steps

There are typically three implementation approaches:

Parse binary log for updates. When an update is found perform the same operation on the cache.

Implement user defined functions. Setup triggers that call UDF to update the cache. See http://tangent.org/586/Memcached_Functions_for_MySQL.html for more details.

Use the Blackhole gambit. Facebook is rumored to use the Blackhole storage engine to populate cache. Data written to a Blackhole table is replicated to cache. Facebook uses it more to invalidate data and for cross country replication. The advantage of this approach is the data is not replicated through MySQL which means there are no binary logs for the data and it's not CPU intensive.

File System Caching

NFS has significant overhead when used with a large number of servers. Fotolog originally stored XML files on a SAN and exposed them using NFS. They saw contention on these files so they put files in memcached. Big performance improvements were seen and it kept NFS mounts open for other requests. Smaller media can also be stored in the cache.

Partial Page Caching

Cache directly displayable page elements. The other caching strategies cache data used to create pages, but some things are still compute intensive and require a lot of work. So instead of just caching objects, prepare and cache entire page elements for reuse. For page elements that are accessed many times per second this can be a big win. For example: calculating top users in a region, popular photo list, and featured photo list. Especially when using sharding it can take some time to calculate these lists, so caching the resulting page elements makes a lot of sense.

Application Based Replication

Write data to the cache through your own API. The API hides implementation details like:

An application writes to one memcached client which writes to multiple memcached instances.

Where the cache pools are and how many their are.

Writing to multiple pools at the same time.

Rotating to another pool on a pool failure until another pool is brought up and updated with data. This approach is very fast if not network bound and very cost effective.

Miscellaneous

There were a few suggestions on using memcached that didn't fit in any other section, so they're gathered here for posterity:

Have a lot of nodes to handle loss. Losing a node with a few nodes will cause a spike on the database as everything reloads. Having more servers means less database load on failure.

Use a warm standby that takes over IP of a memcached server that fails. This means you clients will not have to update their cache lists.

Memcached can operate with UDP and TCP. Persistence connections are better because there's less overhead. Cache designed to use 1000s of connections.

Use separate memcached servers to reduce contention with applications.

Check that your slab sizes match the size of the data you are allocating or you could be wasting a lot of memory. Here are some additional strategies from Memcached and MySQL tutorial:

Don't think row-level (database) caching, think complex objects.

Don't run memcached on your database server, give your database all the memory it can get.

Don't obsess about TCP latency - localhost TCP/IP is optimized down to an in-memory copy.

Think multi-get - run things in parallel whenever you can.

Not all memcached client libraries are made equal, do some research on yours.

Instead of invalidating your data, expire it whenever you can - memcached will do all the work

Generate smart keys - ex. on update, increment a version number, which will become part of the key

For bonus points, store the version number in memcached - call it generation

The latter will be added to Memcached soon - as soon as Brian gets around to it

Final Thoughts

Fotolog has obviously put a great deal of thought and effort into creating sophisticated scaling strategies using memcached and MySQL. What I'm struck with is the enormous amount of effort that goes into syncing rows and objects back and forth between the cache and the database. Shouldn't it be easier? What role is the database playing when the application makes such constant use of the object cache? Wouldn't more memory make the disk based storage unnecessary?

Designing and Implementing Scalable Applications with Memcached and MySQL by Farhan Mashraqi from Fotolog, Monty Taylor from Sun, and Jimmy Guerrero from Sun

Memcached for Mysql Advanced Use Cases by Farhan Mashraqi of Fotolog

Memcached and MySQL tutorial by Brian Aker, Alan Kasindorf - Overview with examples of how a few companies use memcached. Good presentation notes by Colin Charles.

Strategy: Break Up the Memcache Dog Pile

Secrets to Fotolog's Scaling Success

Memcached for MySQL

Click to read more ...

Todd Hoff |

28 Comments |

Permalink |

Print Article

Email Article

Memcached,

MySQL,

Strategy

Tuesday

Jul222008

Scaling Bumper Sticker: A 1 Billion Page Per Month Facebook RoR App

Tuesday, July 22, 2008 at 12:48AM

Several months ago I attended a Joyent presentation where the spokesman hinted that Joyent had the chops to support a one billion page per month Facebook Ruby on Rails application. Even under a few seconds of merciless grilling he would not give up the name of the application. Now we have the big reveal: it was LinkedIn's Bumper Sticker app. For those not currently sticking things on bumps, Bumper Sticker is quite surprisingly a viral media sharing application that allows users to express their individuality by sticking small virtual stickers on Facebook profiles. At the time I was quite curious how Joyent's cloud approach could be leveraged for this kind of app. Now that they've released a few details, we get to find out.

Site: http://www.Facebook.com/apps/application.php?id=2427603417

Information Sources

Video: Scaling to 1 Billion Page Views Per MonthVideo (very flashy)

Web Scalability Practices: Bumper Sticker on Rails by Ikai Lan and Jim Meyer from LinkedIn

1 Billion Page Views a Month by David Young from Joyent

Ruby on Rails: scaling to 1 billion page views per month by Dennis Howlettby from Zdnet

Joyent's Grid Accelerators for Web Applications by Jason Hoffman from Joyent

On Grids, the Ambitions of Amazon and Joyent by Jason Hoffman from Joyent

Scaling Ruby on Rails to 1 Billion Page Views a Month by Joe Pruitt from DevCentral

The Platform

MySQL

Nginx

Mongrel

CDN

Ruby on Rails (rapid prototype development approach)

Facebook

Joyent Accelerator - provides a highly scalable on-demand infrastructure for running web sites, including rich web applications written in Ruby on Rails, PHP, Python and Java. Joyent Accelerators are next-generation virtual computers that can grow and multiply (or shrink and consolidate) depending on the real world demands faced by your Web application. Accelerators are built on OpenSolaris, multi-core (8+), RAM-rich servers (32GB+ each) and vast amounts of NAS storage.

Masochism Plugin - provides an easy solution for Ruby on Rails applications to work in a replicated database environment. Connection proxy sends some database queries (those in a transaction, update statements, and ActiveRecord::Base#reload) to a master database, and the rest to the slave database.

The Stats

1 billion page views per month

13.5 million installations

1.5 million daily active users. Recruited 1 million users in first 46 days.

20-27 million canvas page views a day

13 web application servers running Nginx and Mongrel

8 static asset servers serving over 3,500,000 stickers (migrating to a CDN)

4 MySQL servers in a master/slave configuration using Masochism as a proxy to load balance database operations.

Cost is about $25K/month.

The Architecture

Bumper Sticker was an experiment to see how fast the Light Engineering Development (LED) team at LinkedIn could build a Ruby on Rails Facebook application.

RoR was an easy an environment to prototype in, but they needed a production environment in which they could quickly develop, deploy, and scale. Joyent was selected.

Some Notes on Joyent:
* Joyent is a scale on demand cloud. Allows customers to have a dynamic data center instead of being stuck using their own rigid infrastructure.
* There's an API if you need one. The service is unmanaged, you get root on all your boxes.
* They consider their infrastructure to be better and more open than Amazon. You get access to a high end load balancer and the capabilities of OpenSolaris (Dtrace, Zones, lower request processing overhead, sub 10 second reboot times).
* Joyent's primary scalability principle is to organize apps around silos built from their powerful Accelerator blocks: put applications on different servers based on the quality of service you want to give them. For example, put static content on their own servers so the static content is always served fast and reliably. This allows you to prioritize based on what's important to you. You could, for example, prioritize the virality of your application by putting the Invite Friends functionality on their own servers, thus assuring the growth of your application through your viral functionality possibly at the expense of less important functionality.
* Has three data centers in the US and are opening a fourth, none in Europe.
* Considers their secret sauce to be their highly sophisticated administration system which allows a few people to easily manage a large infrastructure.
* Has a peering relationship with Facebook. That means there are direct high-speed fiber links between Joyent’s data center in Emeryville and Facebook’s data center in San Francisco.

80% of the content for Bumber Sticker is static. The Facebook API can directly render content at a specified memory location. Bumper Sticker was able to use the scripting feature of F5 BIG-IP load balancer to directly load static content by passing a pointer to the Facebook API.

The Lessons

Rails scales exactly like any other app. Take into account all the components from the moment the request is received at the load balancer all the way down and all the way back again.

The development process is: put some measurements in place, find problems, fix problems, more people adopt and scale you out of your solution, and the cycle repeats. Sun's Dtrace feature makes it easy to instrument the stack to identify bottlenecks.

Rails scales as long as the development team using it understands that many of the bottlenecks are exactly those faced by developers on any other database-driven web platform.

Hit a disk spindle and you are screwed. Avoid going to the database or the file system. The more they avoided disk the fewer timeouts they experienced.

Convert anything dynamic into static content. Dynamic content is your enemy. Convert anything dynamic into static content so it can be removed from the disk path.

Push content to the edge. Move content as close to the client as possible. Move cache to the CDN. Reduce time going across the network.

Faster means more viral. On a viral system the better the performance the more people can play with your application. The more people who play with your system the more likely they are to pull more people in, which means the more the app will spread and go viral. Bumper Sticker has been successful at creating a community of fans who enjoy uploading and sharing their own stickers.

Some issues:

Since most of the content is static and served by the load balancer, the impact of Rails in the system is not clear.

The functionality of Bumper Sticker is relatively simple. What would the impact be on scalability if other often requested features like search were added?

Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App

Todd Hoff |

5 Comments |

Permalink |

CDN,

MySQL,

Nignx,

RoR,

facebook

Wednesday

Jun112008

Pyshards aspires to build sharding toolkit for Python

Wednesday, June 11, 2008 at 12:36PM

I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools. The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm. Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity. So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI. http://code.google.com/p/pyshards/wiki/Pyshards Devin

Click to read more ...

devinvenable |

9 Comments |

Permalink |

Print Article

Email Article

MySQL,

Python,

Shard,

pyshards,

sharding

Wednesday

May282008

Webinar: Designing and Implementing Scalable Applications with Memcached and MySQL

Wednesday, May 28, 2008 at 4:55PM

The following technical Webinar could be of interest to the community. WHO:

Farhan "Frank" Mashraqi, Director of Business Operations and Technical Strategy, Fotolog Inc
Monty Taylor, Senior Consultant, Sun Microsystems
Jimmy Guerrero, Sr Product Marketing Manager, Sun Microsystems - Database Group

WHAT:

Designing and Implementing Scalable Applications with Memcached and MySQL web presentation.

WHEN:

Thursday, May 29, 2008, 10:00 am PST, 1:00 pm EST, 18:00 GMT
The presentation will be approximately 45 minutes long followed by Q&A.

Check out the details here!

Click to read more ...

geekr |

3 Comments |

Permalink |

Print Article

Email Article

Memcached,

MySQL

Monday

May052008

HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy

Monday, May 5, 2008 at 3:41PM

Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results. Update: A new presentation at An Introduction to HSCALE. After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots. One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding? In the first article four approaches to sharding were identified:

Using MySQL Cluster

Using MySQL Proxy with transparent query rewriting and load balancing

Implement it into a JDBC driver

Implement it into the application data access layer. The proxy solution was selected because it's transparent to the application layer. Applications need not know about the partitioning scheme to make it work. Not mucking with apps is a big win. The downside is implementation complexity. How do you parse a query and and map it correctly to the right server? Will this cause a big performance degradation? How is this new more complex and dynamic system to be tested? Can we run the same queries they did before or will they have to rewrite parts of their application? A lot of questions to be worked out. The second article starts working out those problems using MySQL Proxy. The process was broken into a few steps:

Analyze the query to find out which tables are involved and what the parition key would be.

Validate the query and reject queries that cannot be analyzed.

Determine the partition table / database. This could be done by a simple lookup, a hashing function or anything else.

Rewrite the query and replace the table names with the partition table names.

Execute the query on the correct database server and return the result back to the client. Some of the comments were concerned that a modulus scheme was being used to identify a partition. The recommendation was to use a directory service for mapping to partitions instead. A directory service allows you to logically map partitions behind the scenes and doesn't tie you to a deterministic physical mapping. After getting all this working they generously released it to the world as HSCALE - Transparent MySQL Partitioning: HSCALE is a plugin written for MySQL Proxy which allows you to transparently split up tables into multiple tables called partitions. In later versions you will be able to put each partition on a different MySQL server. Application based partitioning means that your split up your data logically and rewrite your application to select the right piece of data (i.e. partition) at any given time. More on application based partitioning. Read here some more about what could be done with HSCALE. HSCALE helps in application based partitioning. Using the MySQL Proxy it sits between your application and the database server. Whenever a sql statement is sent to the server HSCALE analyzes it to find out whether a partitioned table is used. It then tries to find out which partition the sql statement should go to. Access release .1 at HSCALE 0.1 released - Partitioning Using MySQL Proxy. The transparent proxy ability is very powerful, but what we are lacking that various companies have created internally is a partition management layer. How do you move partitions? How do you split partitions when a table outgrows the shard or performance declines? Lots of cool tools still to build.

HSCALE - Transparent MySQL Partitioning

Pero: HSCALE 0.1 released - Partitioning Using MySQL Proxy

Pero: MySQL Partitioning on Application Side

Pero: Progress on MySQL Proxy Partitioning

HighScalability: Flickr Architecture - more information on partitioning.

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

HighScalabilty: An Unorthodox Approach to Database Design : The Coming of the Shard.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

MySQL,

sharding

Friday

May022008

Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App

Friday, May 2, 2008 at 1:33AM

Update: Jake in Does Django really scale better than Rails? thinks apps like FFS shouldn't need so much hardware to scale.

In a short three months Friends for Sale (think Hot-or-Not with a market economy) grew to become a top 10 Facebook application handling 200 gorgeous requests per second and a stunning 300 million page views a month. They did all this using Ruby on Rails, two part time developers, a cluster of a dozen machines, and a fairly standard architecture. How did Friends for Sale scale to sell all those beautiful people? And how much do you think your friends are worth on the open market?

Site: http://www.facebook.com/apps/application.php?id=7019261521

Information Sources

Siqi Chen and Alexander Le, co-creators of Friends for Sale, answering my standard questionairre.

Virality on Facebook

The Platform

Ruby on Rails

CentOS 5 (64 bit)

Capistrano - update and restart application servers.

Memcached

MySQL

Nginx

Starling - distributed queue server

Softlayer - hosting service

Pingdom - for website monitoring

LVM - logical volume manager

Dr. Nics Magic Multi-Connections Gem - split database reads and writes to servers

The Stats

10th most popular application on Facebook.

Nearly 600,000 active users.

Half a million unique visitors a day and growing fast.

300 million page views a month.

300% monthly growth rate, but that is plateauing.

2.1 million unique visitors in the past month

200 requests per second.

5TB of bandwidth per month.

2 part time (now full time), and 1 remote DBA contractor.

4 DB servers, 6 application servers, 1 staging server, and 1 front end server.
- 6, 4 core 8 GB application servers.
- Each application server runs 16 mongrels for a total of 96 mongrels. -
- 4 GB memcache instance on each application server
- 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup

Getting to Know You

What is your system is for?

Our system is designed for our Facebook application, Friends for Sale.
It's basically Hot-or-Not with a market economy. At the time of this
writing it's the 10th most popular application on Facebook.

Their Facebook description reads: Buy and sell your friends as pets! You can make your pets poke, send gifts, or just show off for you.
Make money as a shrewd pets investor or as a hot commodity! Friends for Sale is the bees knees!

Why did you decide to build this system?

We designed this as more of an experiment to see if we understood virality concepts and metrics on Facebook. I guess we do. =)

What particular design/architecture/implementation challenges do your system have?

As a Facebook application, every request is dynamic so no page caching is possible. Also, it is a very interactive, write heavy application so scaling the database was a challenge.

What did you do to meet these challenges?

We memcached extensively early on - every page reload results in 0 SQL calls. We use Rail's fragment caching with custom expiration logic mostly.

How big is your system?

We had more than half a million unique visitors yesterday and growing fast. We're on track to do more than 300 million page views this month.

What is your in/out bandwidth usage?

We used around 3 terabytes of bandwidth last month. This month should be at least 5TB or so. This number is just for a few icons and XHTML/CSS.

How many documents, do you server? How many images? How much data?

We don't really have unique documents ... we do have around 10 million user profiles though.

The only images we store are a few static image icons.

How fast are you growing?

We went from around 3M page views per day a month ago to more than 10M page views a day. A month before that we were doing 1M page views per day. So that's around a 300% monthly growth rate but that is plateauing. On a request per second basis, we get around 200 requests per second.

What is your ratio of free to paying users?

It's all free.

What is your user churn?

It's around 1% per day, with a growth rate of 3% or so per day in terms of installed users.

How many accounts have been active in the past month?

We had roughtly 2.1 million unique visitors in the past month according to Google.

What is the architecture of your system?

It's a relatively standard Rails cluster. We have a dedicated front end proxy balancer / static web server running nginx, which proxies directly to 6, 4 core 8 GB application servers. Each application server runs 16 mongrels for a total of 96 mongrels. The front end load balancer proxies directly to the mongrel ports. In addition, we run a 4 GB memcache instance on each application server, along with a local starling distributed queue server and misc background processes.

We use god to monitor our processes.

On the DB layer, we have 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup. We use Dr Nic's magic multi-connection's gem in production split reads and writes to each
box.

We are adding more slaves right now so we can distribute the read load better and have better redundancy and backup policies. We also get help from Percona (the mysqlperformanceblog guys) for remote DBA work.

We're hosted on Softlayer - they're a fantastic host. The only problem was that their hardware load balancing server doesn't really work very well ... we had lots of problems with hanging connections and latency. Switching a dedicated box running just nginx fixed everything.

How is your system architected to scale?

It really isn't. On the application layer we are shared-nothing so it's pretty trivial. On the database side we're still with a monolithic master and we're trying to push off sharding for as long as we can. We're still vertically scaled on the database side and I think we can get away with it for quite some time.

What do you do that is unique and different that people could best learn from?

The three things that are unique is -

1. Neither of the two developers in involved had previous experience in large scale Rails deployment.
2. Our growth trajectory is relatively rare in the history of Rails deployments
3. We had very little opportunity for static page caching - each request does hit the full Rails stack

What lessons have you learned? Why have you succeeded? What do you wish you would have done differently? What wouldn't you change?

We learned that a good host, good hardware, and a good DBA are very important. We used to be hosted on Railsmachine, which to be fair is an excellent shared hosting company and they did go out of there way to support us. In the end though, we were barely responsive for a good month due to hardware problems, and it only took two hours to get up and running on Softlayer without a hitch. Choose a good host if you plan on scaling, because migrating isn't fun.

The most important thing we learned is that your scalability problems is pretty much always, always, always the database. Check it first, and if you don't find anything, check again. Then check again. Without exception, every performance problem we had can be traced to the database server, the database configuration, the query, or the use and non-use of indices.

We definitely should have gotten on to a better host earlier in the game so we would have been up.

We definitely wouldn't change our choice of framework - Rails was invaluable for rapid application development, and I think we've pretty much proven that two guys without a lot of scaling experience can scale a Rails app up. The whole 'but does Rails scale?' discussion sounds like a bunch of masturbation - the point is moot.

How is your team setup?

We have two Rails developers, inclusive of me. We very recently retained the services of a remote DBA for help on the database end.

How many people do you have?

On the technical side, 2 part time (now full time), and 1 remote DBA contractor.

Where are they located?

The full time employees are also located in the SOMA area of San Francisco.

Who performs what roles?

The two developers server as co-founders . I (Siqi) was responsible for front end design and development early on, but since I had some experience with deployment I also ended up handling network operations and deployment as well. My co founder Alex is responsible for the bulk of the Rails code - basically all the application logic is from him. Now I find myself doing more deep back end network operations tasks like MySQL optimization and replication - it's hard to find time to get back to the front end which is what I love. But it's been a real fun learning experience so I've been eating up all I can from this.

Do you have a particular management philosophy?

Yes - basically find the smartest people you can, give them the best deal possible, and get out of their way. The best managers GET OUT OF THE WAY, so I try to run the company as much as I can with that in mind. I think I usually fail at it.

If you have a distributed team how do you make that work?

We'd have to have some really good communication tools in the cloud - somebody would have to be a Basecamp nazi. I think remote work / outsourcing is really difficult - I prefer to stay away with from it
for core development. For something like MySQL DBA or even sysadmin - it might make more sense.

What do you use?

We use Rails with a bunch of plugins, most notable cache-fu from Chris Wanstrath and magic multi connections from Dr. Nic. I use VIM as the editor with the rails.vim plugin.

Which languages do you use to develop your system?

Ruby / Rails

How many servers do you have?

We now have 12 servers in the cluster.

How are they allocated?

4 DB servers, 6 application servers, 1 staging server, and 1 front end server.

How are they provisioned?

We order them from Softlayer - there's a less than 4 hour turn around for most boxes, which is awesome.

What operating systems do you use?

CentOS 5 (64 bit)

Which web server do you use?

nginx

Which database do you use?

MySQL 5.1

Do you use a reverse proxy?

We just use nginx's built in proxy balancer.

How is your system deployed in data centers?

We use a dedicated hosting service, Softlayer.

What is your storage strategy?

We use NAS for backups but internal SCSI drives for our production boxes.

How much capacity do you have?

Across all of our boxes we probably have around ... 5 TB of storage or
thereabouts.

How do you grow capacity?

Ad-hoc. We haven't done a proper capacity planning study, to our detriment.

Do you use a storage service?

Nope.

Do you use storage virtualization?

Nope.

How do you handle session management?

Right now we just persist it to the database - it would be fairly easy to use memcache directly for this purpose though.

How is your database architected? Master/slave? Shard? Other?

Master/slave right now. We're moving towards a Master/Multi-slave with a read only load balancing proxy to the slave cluster.

How do you handle load balancing?

We do it in software via nginx.

Which web framework/AJAX Library do you use?

Rails.

Which real-time messaging frame works do you use?

None.

Which distributed job management system do you use?

Starling

How do you handle ad serving?

We run network ads. We also weight our various ad networks by eCPM on our application layer.

Do you have a standard API to your website?

Nope.

How many people are in your team?

2 developers.

What skill sets does your team possess?

Me: Front end design, development, limited Rails. Obviously, recently proficient in MySQL optimization and large scale Rails deployment.
Alex: application logic development, front end design, general software engineering.

What is your development environment?

Alex develops on OSX while I develop on Ubuntu. We use SVN for version control. I use VIM for editing and Alex uses TextMate.

What is your development process?

On the logic layer, it's very test driven - we test extensively. On the application layer, it's all about quick iterations and testing.

What is your object and content caching strategy?

We cache both in memcache with no TTL, and we just manually expire.

What is your client side caching strategy?

None.

How do you manage your system?

How do check global availability and simulate end-user performance?

We use Pingdom for external website monitoring - they're really good.

How do you health check your server and networks?

Right now we're just relying on our external monitoring and Softlayer's ping monitoring. We're investigating FiveRuns for monitoring as a possible solution to server monitoring.

How you do graph network and server statistics and trends?

We don't.

How do you test your system?

We deploy to staging and run some sanity tests, then we do a deploy to all application servers.

How you analyze performance?

We trace back every SQL query in development to make sure we're not doing any unnecessary calls or model instantiations. Other than that, we haven't done any real benchmarking.

How do you handle security?

Carefully.

How do you decide what features to add/keep?

User feedback and critical thinking. We are big believers in simplicity so we are pretty careful to consider before we add any major features.

How do you implement web analytics?

We use a home grown metrics tracking system for virality optimization,
and we also use Google Analytics.

Do you do A/B testing?

Yes, from the time to time we will tweak aspects of our design to optimize for virality.

How is your data center setup?

Which firewall product do you use?

Which DNS service do you use?

Which routers do you use?

Which switches do you use?

Which email system do you use?

How do you handle spam?

How do you handle virus checking of email and uploads?

Don't know to all of the above.

How do you backup and restore your system?

We use LVM to do incrementals on a weekly and daily basis.

How are software and hardware upgrades rolled out?

Right now they are done manually, except for new Rails application deployments. We use capistrano to update and restart our application servers.

How do you handle major changes in database schemas on upgrades?

We usually migrate on a slave first and then just switch masters.

What is your fault tolerance and business continuity plan?

Not very good.

Do you have a separate operations team managing your website?

Oh we wish.

Do you use a content delivery network? If so, which one and what for?

Nope

What is your revenue model?

CPM - more page views more money. We also have incentivized direct offers through our virtual currency.

How do you market your product?

Word of mouth - the social graph. We just leverage viral design tactics to grow.

Do you use any particularly cool technologies are algorithms?

I think Ruby is pretty particularly cool. But no, not really - we're not doing rocket science, we're just trying to get people laid.

Do your store images in your database?

No, that wouldn't be very smart.

How much up front design should you do?

Hm. I'd say none if you haven't scaled up anything before, and a lot if you have. It's hard to know what's actually going to be the problem until you've actually been through and see what real load problems look like. Once you've done that, then you have enough domain knowledge to do some actual meaningful up front design on our next go around.

Has anything surprised your either for the good or bad?

How unreliable vendor hardware can be, and how different support can be from host to host. The number one most important thing you will need is a scaled up dedicated host who can support your needs. We use Softlayer and we can't recommend them highly enough.

On the other hand, it's surprising how far just a master-multislave setup can take you on commodity hardware. You can easily do a Billion page views per month on this setup.

How does your system evolve to meet new scaling challenges?

It doesn't really, we just fix bottle necks as they come and we see them coming.

Who do you admire?

Brad Fitzpatrick for inventing memcache, and anyone who has successfully horizontally scaled anything.

How are you thinking of changing your architecture in the future?

We will have to start sharding by users soon as we hit database size and write limits.

Their Thoughts on Facebook Virality

Facebook models the social graph in digital form as accurately and completely as possible.

Social graph is more important that features.

Facebook enables rapid social distribution of new applications through the social graph.

Your application idea should be: social, engaging, and universal.

The social aspect makes it viral.

Engaging makes it monetizable.

Universal gives it potential.

Friends for Sale is social because you are buying and selling your social graph.

It's engaging because it's a twist on an idea, low pressure, flirty, and a bit cynical.

It's universal because everyone is vain, has a price, and wants to flirt with hot people.

Every touch point in the application is a potential for recruiting new users.

Every user converts 1.4 other users which is the basis for exponential growth.

For every new user track the number of invites, notifications, minifeed items, profile clicks, and other channels.

For every channel track the percent clicked, converted, uninstalls.

Lessons Learned

Scaling from the start is a requirement on Facebook. They went to 1 million pages/day in 4 weeks.

Ruby on Rails can scale.

Anything scales on the right architecture. Focus on architecture and operations.

You need a good DBA, good host, and good well configured hardware.

With caching and the heavy duty servers available today, you can go a long time without adopting more complicated database architectures.

The social graph is real. It's truly staggering the number of accessible users on Facebook with the right well implemented viral application.

Most performance problems are in the database. Look to the database server, the database configuration, the query, or the use and non-use of indexes.

People still use Vi!

I'd really like to thank Siqi taking the time to answer all my questions and provide this fascinating look in to their system. It's amazing what you've done in so little time. Excellent job and thanks again.

Todd Hoff |

52 Comments |

Permalink |

CentOS,

MySQL,

RoR,