Entries in Strategy (358)

Monday
Sep082008

Guerrilla Capacity Planning and the Law of Universal Scalability

In the era of Web 2.0 traditional approaches to capacity planning are often difficult to implement. Guerrilla Capacity Planning facilitates rapid forecasting of capacity requirements based on the opportunistic use of whatever performance data and tools are available. One unique Guerrilla tool is Virtual Load Testing, based on Dr. Gunther's "Universal Law of Computational Scaling", which provides a highly cost-effective method for assessing application scalability. Neil Gunther, M.Sc., Ph.D. is an internationally recognized computer system performance consultant who founded Performance Dynamics Company in 1994. Some reasons why you should understand this law: 1. A lot of people use the term "scalability" without clearly defining it, let alone defining it quantitatively. Computer system scalability must be quantified. If you can't quantify it, you can't guarantee it. The universal law of computational scaling provides that quantification. 2. One the greatest impediments to applying queueing theory models (whether analytic or simulation) is the inscrutibility of service times within an application. Every queueing facility in a performance model requires a service time as an input parameter. No service time, no queue. Without the appropriate queues in the model, system performance metrics like throughtput and response time, cannot be predicted. The universal law of computational scaling leapfrogs this entire problem by NOT requiring ANY low-level service time measurements as inputs. The universal scalability model is a single equation expressed in terms of two parameters α and β. The relative capacity C(N) is a normalized throughput given by: C(N) = N / ( 1 + αN + βN (N − 1) ) where N represents either: 1. (Software Scalability) the number of users or load generators on a fixed hardware configuration. In this case, the number of users acts as the independent variable while the CPU configuration remains constant for the range of user load measurements. 2. (Hardware Scalability) the number of physical processors or nodes in the hardware configuration. In this case, the number of user processes executing per CPU (say 10) is assumed to be the same for every added CPU. Therefore, on a 4 CPU platform you would run 40 virtual users. with `α' (alpha) the contention parameter, and `β' (beta) the coherency-delay parameter. This model has wide-spread applicability, including:

  • Accounts for such effects as VM thrashing, and cache-miss latencies.
  • Can also be used to model disk arrays, SANs, and multicore processors.
  • Can also be used to model certain types of network I/O
  • The user-load form is the most common application of eqn.
  • Can be used in combination with measurement tools like LoadRunner, Benchmark Factory, etc.
Gunther's book: Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services and its companion Analyzing Computer Systems Performance: With Perl: PDQ offers practical advise on capacity planning. Distilled notes of his Guerilla Capacity Planning (GCaP) classes are available online in The Guerilla Manual.

Click to read more ...

Sunday
Aug172008

Strategy: Drop Memcached, Add More MySQL Servers

Update 2: Michael Galpin in Cache Money and Cache Discussions likes memcached for it's expiry policy, complex graph data, process data, but says MySQL has many advantages: SQL, Uniform Data Access, Write-through, Read-through, Replication, Management, Cold starts, LRU eviction. Update: Dormando asks Should you use memcached? Should you just shard mysql more?. The idea of caching is the most important part of caching as it transports you beyond a simple CRUD worldview. Plan for caching and sharding by properly abstracting data access methods. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs over planning and wasting valuable time. Feedster's François Schiettecatte wonders if Fotolog's 21 memcached servers wouldn't be better used to further shard data by adding more MySQL servers? He mentions Feedster was able to drop memcached once they partitioned their data across more servers. The algorithm: partition until all data resides in memory and then you may not need an additional memcached layer. Parvesh Garg goes a step further and asks why people think they should be using MySQL at all?

Related Articles

  • The Death of Read Replication by Brian Aker. Caching layers have replaced read replication. Cache can't fix a broken database layer. Partition the data that feeds the cache tier: "Keep your front end working through the cache. Keep all of your data generation behind it."
  • Read replication with MySQL by François Schiettecatte. Read replication is dead and it should be used only for backup purposes. Take the memory used for caching and give it to your database servers.
  • Replication++, Replication 2.0, Replication.Next by Ronald Bradford. What should read replication be used for?
  • Replication, caching, and partitioning by Greg Linden. Caching overdone because it adds complexity, latency on a cache miss, and inefficiently uses cluster resources. Hitting disk is the problem. Shard more and get your data in memory.

    Click to read more ...

  • Saturday
    Aug162008

    Strategy: Serve Pre-generated Static Files Instead Of Dynamic Pages

    Pre-generating static files is an oldy but a goody, and as Thomas Brox Røst says, it's probably an underused strategy today. At one time this was the dominate technique for structuring a web site. Then the age of dynamic web sites arrived and we spent all our time worrying how to make the database faster and add more caching to recover the speed we had lost in the transition from static to dynamic. Static files have the advantage of being very fast to serve. Read from disk and display. Simple and fast. Especially when caching proxies are used. The issue is how do you bulk generate the initial files, how do you serve the files, and how do you keep the changed files up to date? This is the process Thomas covers in his excellent article Serving static files with Django and AWS - going fast on a budget", where he explains how he converted 600K thousand previously dynamic pages to static pages for his site Eventseer.net, a service for tracking academic events. Eventseer.net was experiencing performance problems as search engines crawled their 600K dynamic pages. As a solution you could imagine scaling up, adding more servers, adding sharding, etc etc, all somewhat complicated approaches. Their solution was to convert the dynamic pages to static pages in order to keep search engines from killing the site. As an added bonus non logged-in users experienced a much faster site and were more likely to sign up for the service. The article does a good job explaining what they did, so I won't regurgitate it all here, but I will cover the highlights and comment on some additional potential features and alternate implementations... They estimated it would take 7 days on single server to generate the initial 600K pages. Ouch. So what they did was use EC2 for what it's good for, spin up a lot of boxes to process data. Their data is backed up on S3 so the EC2 instances could read the data from S3, generate the static pages, and write them to their deployment area. It took 5 hours, 25 EC2 instances, and a meager $12.50 to perform the initial bulk conversion. Pretty slick. The next trick is figuring out how to regenerate static pages when changes occur. When a new event is added to their system hundreds of pages could be impacted, which would require the effected static pages to be regenerated. Since it's not important to update pages immediately they queued updates for processing later. An excellent technique. A local queue of changes was maintained and replicated to an AWS SQS queue. The local queue is used in case SQS is down. Twice a day EC2 instances are started to regenerate pages. Instances read twork requests from SQS, access data from S3, regenerate the pages, and shutdown when the SQS is empty. In addition they use AWS for all their background processing jobs.

    Comments

    I like their approach a lot. It's a very pragmatic solution and rock solid in operation. For very little money they offloaded the database by moving work to AWS. If they grow to millions of users (knock on wood) nothing much will have to change in their architecture. The same process will still work and it still not cost very much. Far better than trying to add machines locally to handle the load or moving to a more complicated architecture. Using the backups on S3 as a source for the pages rather than hitting the database is inspired. Your data is backed up and the database is protected. Nice. Using batched asynchronous work queues rather than synchronously loading the web servers and the database for each change is a good strategy too. As I was reading I originally thought you could optimize the system so that a page only needed to be generated once. Maybe by analyzing the events or some other magic. Then it hit me that this was old style thinking. Don't be fancy. Just keep regenerating each page as needed. If a page is regenerated a 1000 times versus only once, who cares? There's plenty of cheap CPU available. The local queue of changes still bothers me a little because it adds a complication into the system. The local queue and the AWS SQS queue must be kept synced. I understand that missing a change would be a disaster because the dependent pages would never be regenerated and nobody would ever know. The page would only be regenerated the next time an event happened to impact the page. If pages are regenerated frequently this isn't a serious problem, but for seldom touched pages they may never be regenerated again. Personally I would drop the local queue. SQS goes down infrequently. When it does go down I would record that fact and regenerate all the pages when SQS comes back up. This is a simpler and more robust architecture, assuming SQS is mostly reliable. Another feature I have implemented in similar situations is to setup a rolling page regeneration schedule where a subset of pages are periodically regenerated, even if no event was detected that would cause a page to be regenerated. This protects against any event drops that may cause data be undetectably stale. Over a few days, weeks, or whatever, every page is regenerated. It's a relatively cheap way to make a robust system resilient to failures.

    Click to read more ...

    Tuesday
    Aug122008

    Strategy: Limit The New, Not The Old

    One of the most popular and effective scalability strategies is to impose limits (GAE Quotas, Fotolog, Facebook) as a means of protecting a website against service destroying traffic spikes. Twitter will reportedly limit the number followers to 2,000 in order to thwart follow spam. This may also allow Twitter to make some bank by going freemium and charging for adding more followers. Agree or disagree with Twitter's strategies, the more interesting aspect for me is how do you introduce new policies into an already established ecosystem? One approach is the big bang. Introduce all changes at once and let everyone adjust. If users don't like it they can move on. The hope is, however, most users won't be impacted by the changes and that those who are will understand it's all for the greater good of their beloved service. Casualties are assumed, but the damage will probably be minor. Now in Twitter's case the people with the most followers tend to be opinion leaders who shape much of the blognet echo chamber. Pissing these people off may not be your best bet. What to do? Shegeeks.net makes a great proposal: Limit The New, Not The Old. The idea is to only impose the limits on new accounts, not the old. Old people are happy and new people understand what they are getting into. The reason I like this suggestion so much is that it has deep historical roots, all the way back to the fall of the Roman republic and the rise of the empire due to the agrarian reforms laws passed in 133BC. In ancient Rome property and power, as they tend to do, became concentrated in the hands of a few wealthy land owners. Let's call them the nobility. The greatness that was Rome was founded on a agrarian society. People made modest livings on small farms. As power concentrated small farmers were kicked of the land and forced to move to the city. Slaves worked the land while citizens remained unemployed. And cities were no place to make a life. Civil strife broke out. Pliny said that "it was the large estates which destroyed Italy." To redress the imbalance and reestablish traditional virtuous Roman character, "In 133 BC, Tiberius Sempronius Gracchus, the plebeian tribune, passed a series of laws attempting to reform the agrarian land laws; the laws limited the amount of public land one person could control, reclaimed public lands held in excess of this, and attempted to redistribute the land, for a small rent, to farmers now living in the cities." At this time Rome was a republic. It had a mixed constitution with two consuls at the top, a senate, and tribunes below. Tribunes represented the people. The agrarian laws passed by Gracchus set off a war between between the nobility and the people. The nobles obviously didn't want to lose their the land and set in motion a struggle eventually leading to the defeat of the people, the fall of the Republic, and the rise of the Empire. Machiavelli, despite his unpopular press, was a big fan of the republican form of government, and pinpointed the retroactive and punitive nature of the agrarian laws as why the Republic fell. He didn't say reform wasn't needed, but he thought the mistake was in how in how the reform was carried out. By making the laws retroactive it challenged the existing order. And by making the laws so punitive it made the nobility willing to fight to keep what they already had. This divided the people and caused a class war. In fact, this is the origin in the ex post facto clause in the US Constitution, which says you can't pass a law that with retroactive effect. Machiavelli suggested the reforms should have been carried out so they only impacted the future. For example, in new conquered lands small farmers would be given more land. In this way the nobility wouldn't be directly challenged, reform could happen slowly over time, and the republic would have been preserved. Cool parallel, isn't it? Hey, who says there's nothing to learn from history! Let's just hope history doesn't rerepeat itself again.

    Click to read more ...

    Monday
    Aug042008

    A Bunch of Great Strategies for Using Memcached and MySQL Better Together

    The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success, which was one of my personal favorites. Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a sophisticated approach to using memcached that others can learn from and emulate. We'll cover some of the highlightable strategies from the webinar down below the fold.

    What is Memcached?

    The first part of the first webinar gives a good overview of memcached. Just in case the rock you've been hiding under recently disintegrated, you may not have heard about memcached (as if). Memached is: A high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached essentially creates an in-memory shard on top of a pool of servers from which an application can easily get and set up to 1MB of unstructured data. Memcached is two hash tables, one from the client to the server and another one inside the server. The magic is that none of the memcached servers need know about each other. To scale up you just add more servers and the key hashing algorithm makes it all work out right. Memcached is not redundant, has no failover, and has no authentication. It's simple server for storing and getting data, the complex bits must be implemented by applications. The rest of the first webinar is Farhan explaining in wonderful detail how they use memcached at Fotolog. The beginning of the second seminar covers much of the same ground as the first. In the last half of the second seminar Farhan gives a number of excellent code examples showing how memcached is installed, used, and managed. If you've never used memcached before there's a lot of good stuff presented.

    Memcached and MySQL Go Better Together

    There's a little embrace and extend in the webinar as MySQL cluster is presented several times as doing much the same job as memcached, but more reliably. However, the recommended approach for using memcached and MySQL is:
  • Write scale the database by sharding. Partition data across multiple servers so more data can be written in parallel. This avoids a single server becoming the bottleneck.
  • Front MySQL with a memcached farm to scale reads. Applications access memcached first for data and if the data is not in memcached then the application tries the database. This removes a great deal of the load on a database so it can continue to perform it's transactional duties for writes. In this architecture the database is still the system of record for the true value of data.
  • Use MySQL replication for reliability and read query scaling. There's an effective limit to the number of slaves that can be supported so just adding slaves won't work as scaling strategy for larger sites. Using this approach you get scalable reads and writes along with high availability. Given that MySQL has a cache, why is memcached needed at all?
  • The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
  • The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
  • The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.

    Cache everything that is slow to query, fetch, or calculate.

    This is Fotolog's rule of deciding what to cache. What is considered slow depends on your requirements. But when something becomes slow it's a candidate for caching.

    Fotolog's Caching Typology

    Fotolog has come up with an interesting typology of their different caching strategies:
  • Non-Deterministic Cache - the classic memcached model of reading through the cache and writing to the database.
  • State Cache - maintain current application state in cache.
  • Proactive Cache - push changes from the database directly to the cache.
  • File System Cache - save NFS load by serving files from the cache instead of the file system.
  • Partial Page Cache - cache displayable page elements, not just data.
  • Application Based Replication - use a client side API to hide all the low level details of interacting with the cache.

    Non-Deterministic Cache

    This is the typical way of using memcached. I think the non-determinism comes in because an application can't depend on data being in the cache. The data may have been evicted because a slab is full or because the data simply hasn't been added yet. Other Fotolog strategies are deterministic, which means the application can assume the data is always present.

    Useful For

  • Ideal for complex objects that are read several times. Especially for sharded environments where you need to collect data from multiple shards.
  • Good replacement for MySQL query cache
  • Caching relationships and other lists
  • Slow data that’s used across many pages
  • Don’t cache if its more taxing to cache than you’ll save
  • Tag clouds and auto-suggest lists For example, when a photo is uploaded the photo is uploaded on the page of every friend. These lists are taxing to calculate so they are cached.

    Usage Steps

  • Check memcached for your data key.
  • If the data doesn't exist in the cache then check database.
  • If the data exists in the database then populate memcached.

    Stats

  • 45 memcached instances dedicated to nondeterministic cache
  • Each instance (on average): * ~440 gets per second * ~40 sets per second * ~11 gets/set Fotolog likes to characterize their caching policy in terms of the ratio of gets to sets.

    Potential Problems

  • While memcached is usually a light CPU user, Fotolog ran their cache on their application servers and found some problems: * 90% CPU usage * Memory garbage collected nearly once a minute * Experienced blocking on memcached on app servers
  • Operations are not transactional. One work around is to set expirations so stale data doesn't stay around long.

    State Cache

    Keeps the current state of an application in cache.

    Useful For

  • Expensive operations.
  • Sessions. If a memcached server goes down just make people login again.
  • Keep track of who's online and their current status, especially for IM applications. Use memcached or the load would cripple the database.

    Usage Steps

    It's a form of non-deterministic caching so the same steps apply.

    Stats

  • 9G dedicated. Depending on the number users all the users can keep in cache.

    Deterministic Cache

    Keep all data for particular database tables in the cache. An application always assumes that the data they need is in the cache (deterministic), applications never go to the database for data. Applications don't have to check memcached before accessing the data. The data will exist in the cache because the database is essentially loaded into the cache and is always kept in sync.

    Useful For

  • Read scalability. All reads go through the cache and the database is completely offloaded.
  • Ideal for caching things that have no expiration.
  • Heavily accessed data/objects/lists
  • User credentials
  • User profiles
  • User preferences
  • Active media belonging to users, like photo lists.
  • Outsourcing logins to memcached. Don't hit database on login. All logins are processed through the cache.

    Usage Steps

  • Multiple dedicated cache pools are maintained. Instead of one large pool make separate standalone pools. Multiple pools are necessary for high availability. If a cache pool goes down then applications switch over to the next pool. If a multiple pools did not exist and a pool did go down then the database would be swamped with the load.
  • All cache pools are maintained by the application. A write to the database means writing to multiple memcache pools. When an update to a user profile happens, for example, the update has to replicated to multiple caches from that point onwards.
  • When the site starts after a shutdown then when it comes up the deterministic cache is populated before the site is up. The site is rarely rebooted so this is rare occurrence.
  • Reads could also be load balanced against the cache pools for better performance and higher scalability.

    Stats

  • ~ 90,000 gets / second across cache cluster
  • ~ 300 sets / second
  • get/set ratio of ~ 300

    Potential Problems

  • Must have enough memory to hold everything.
  • The database has rows and the cache may have objects. The database caching logic must know enough to create objects from the database schema.
  • Maintaining the multiple caches seems complex. Fotolog uses Java and Hibernate. They wrote their own client to handle rotation through pools.
  • Maintaining multiple caches adds a lot of overhead to the application. In practice there's little replication overhead compared to benefit.

    Proactive Caching

    Data magically shows up in the cache. As updates happen to the database the cache is populated based on the database change. Since the cache is updated as the database is updated the chances of data being in cache are high. It's non-deterministic caching with a twist.

    Useful For

  • Pre-Populating Cache: Keeping memcached updated minimizes calls to database if object not present.
  • “Warm up” cache in cases of cross data-center replication.

    Usage Steps

    There are typically three implementation approaches:
  • Parse binary log for updates. When an update is found perform the same operation on the cache.
  • Implement user defined functions. Setup triggers that call UDF to update the cache. See http://tangent.org/586/Memcached_Functions_for_MySQL.html for more details.
  • Use the Blackhole gambit. Facebook is rumored to use the Blackhole storage engine to populate cache. Data written to a Blackhole table is replicated to cache. Facebook uses it more to invalidate data and for cross country replication. The advantage of this approach is the data is not replicated through MySQL which means there are no binary logs for the data and it's not CPU intensive.

    File System Caching

    NFS has significant overhead when used with a large number of servers. Fotolog originally stored XML files on a SAN and exposed them using NFS. They saw contention on these files so they put files in memcached. Big performance improvements were seen and it kept NFS mounts open for other requests. Smaller media can also be stored in the cache.

    Partial Page Caching

    Cache directly displayable page elements. The other caching strategies cache data used to create pages, but some things are still compute intensive and require a lot of work. So instead of just caching objects, prepare and cache entire page elements for reuse. For page elements that are accessed many times per second this can be a big win. For example: calculating top users in a region, popular photo list, and featured photo list. Especially when using sharding it can take some time to calculate these lists, so caching the resulting page elements makes a lot of sense.

    Application Based Replication

    Write data to the cache through your own API. The API hides implementation details like:
  • An application writes to one memcached client which writes to multiple memcached instances.
  • Where the cache pools are and how many their are.
  • Writing to multiple pools at the same time.
  • Rotating to another pool on a pool failure until another pool is brought up and updated with data. This approach is very fast if not network bound and very cost effective.

    Miscellaneous

    There were a few suggestions on using memcached that didn't fit in any other section, so they're gathered here for posterity:
  • Have a lot of nodes to handle loss. Losing a node with a few nodes will cause a spike on the database as everything reloads. Having more servers means less database load on failure.
  • Use a warm standby that takes over IP of a memcached server that fails. This means you clients will not have to update their cache lists.
  • Memcached can operate with UDP and TCP. Persistence connections are better because there's less overhead. Cache designed to use 1000s of connections.
  • Use separate memcached servers to reduce contention with applications.
  • Check that your slab sizes match the size of the data you are allocating or you could be wasting a lot of memory. Here are some additional strategies from Memcached and MySQL tutorial:
  • Don't think row-level (database) caching, think complex objects.
  • Don't run memcached on your database server, give your database all the memory it can get.
  • Don't obsess about TCP latency - localhost TCP/IP is optimized down to an in-memory copy.
  • Think multi-get - run things in parallel whenever you can.
  • Not all memcached client libraries are made equal, do some research on yours.
  • Instead of invalidating your data, expire it whenever you can - memcached will do all the work
  • Generate smart keys - ex. on update, increment a version number, which will become part of the key
  • For bonus points, store the version number in memcached - call it generation
  • The latter will be added to Memcached soon - as soon as Brian gets around to it

    Final Thoughts

    Fotolog has obviously put a great deal of thought and effort into creating sophisticated scaling strategies using memcached and MySQL. What I'm struck with is the enormous amount of effort that goes into syncing rows and objects back and forth between the cache and the database. Shouldn't it be easier? What role is the database playing when the application makes such constant use of the object cache? Wouldn't more memory make the disk based storage unnecessary?

    Related Articles

  • Designing and Implementing Scalable Applications with Memcached and MySQL by Farhan Mashraqi from Fotolog, Monty Taylor from Sun, and Jimmy Guerrero from Sun
  • Memcached for Mysql Advanced Use Cases by Farhan Mashraqi of Fotolog
  • Memcached and MySQL tutorial by Brian Aker, Alan Kasindorf - Overview with examples of how a few companies use memcached. Good presentation notes by Colin Charles.
  • Strategy: Break Up the Memcache Dog Pile
  • Secrets to Fotolog's Scaling Success
  • Memcached for MySQL

    Click to read more ...

  • Sunday
    Jul202008

    Strategy: Front S3 with a Caching Proxy

    Given S3's recent failure (Cloud Status tells the tale) Kevin Burton makes the excellent suggestion of fronting S3 with a caching proxy server. A caching proxy server can reply to service requests without contacting the specified server, by retrieving content saved from a previous request, made by the same client or even other clients. This is called caching. Caching proxies keep local copies of frequently requested resources. In normal operation when an asset (a user's avatar, for example) is requested the cache is tried first. If the asset is found in the cache then it's returned. If the asset is not in the cache it's retrieved from S3 (or wherever) and cached. So when S3 goes down it's likely you can ride out the down time by serving assets out of the cache. This strategy only works when using S3 as a CDN. If you are using S3 for its "real" purpose, as a storage service, then a caching proxy can't help you... Amazon doesn't used S3 as a CDN either Amazon Not Building Out AWS To Compete With CDNs. They use Limelight Networks. Some proxy options are: Squid, Nginx, Varnish. Planaroo shares how a small startup responds to an S3 outage (summarized):

  • Up-to-date backups are a good thing. Keep current backups such that you can switch to a new URL for your assets. Easier said than done I think.
  • Switch it, don't fix it. Switch to your backup rather than wait for the system to come up quickly, because it may not.
  • Serve CSS, JavaScript, icons, and Google AJAX libraries from alternate sources. Don't rely S3 or Google to always be able to server your crown jewels.

    Click to read more ...

  • Friday
    Jul182008

    Robert Scoble's Rules for Successfully Scaling Startups

    Robert Scoble in an often poignant FriendFeed thread commiserating PodTech's unfortunate end, shared what he learned about creating a successful startup. Here's a summary of a Robert's rules and why Machiavelli just may agree with them:

  • Have a story.
  • Have everyone on board with that story.
  • If anyone goes off of that story, make sure they get on board immediately or fire them.
  • Make sure people are judged by the revenues they bring in. Those that bring in revenues should get to run the place. People who don't bring in revenues should get fewer and fewer responsibilities, not more and more.
  • Work ONLY for a leader who will make the tough decisions.
  • Build a place where excellence is expected, allowed, and is enabled.
  • Fire idiots quickly.
  • If your engineering team can't give a media team good measurements, the entire company is in trouble. Only things that are measured ever get improved.
  • When your stars aren't listened to the company is in trouble.
  • Getting rid of the CEO, even if it's all his fault, won't help unless you replace him/her with someone who is visionary and who can fix the other problems. An excellent list that meshes with much of my experience, which is why I thought it worth sharing :-) My take-away from Robert's rules can be summarized in one word: focus. Focus is the often over looked glue binding groups together so they can achieve great things... When Robert says have "a story" that to me is because a story provides a sort of "decision box" giving a group its organizing principle for determining everything that comes later. Have a decision? Look at your story for guidance. Have a problem? Look at your story for guidance. Following your stories' guidance is another matter completely. Without a management strong enough act in accordance with the story, centripetal forces tear an organization apart. It takes a lot of will to keep all the forces contained inside the box. Which is why I think Robert demands a "focused" leadership. Machiavelli calls this idea "virtue." Princes must exhibit virtue if they are to keep their land. Machiavelli doesn't mean virtue in the modern sense of be good and eat your peas, but in the ancient sense of manliness (sorry ladies, this was long ago). Virtue shares the same root as virility. So to act virtuously is to be bold, to act, to take risks, be aggressive, and make the hard unpopular decisions. For Machiavelli that's the only way to reach your goals in accordance with how the world really works, not how it ought to work. Any Prince who acts otherwise will lose their realm. Firing people (yes, I've been fired) not contributing to your story is by Machiavelli's definition a virtuous act. It is a messy ugly business nobody likes doing. It requires admitting a mistake was made, a lot of paperwork, and looking like the bad guy. And that's why people are often not fired as Robert suggests. The easy way is to just ignore the problem, but that's not being virtuous. Addition by subtraction is such a powerful force precisely because it maintains group focus on excellence and purpose. It's a statement that the story really matters. Keeping people who aren't helping is a vampire on a group's energy. It slowly drains away all vivacity until only a pale corpse remains. Robert's rules may seem excessively ruthless and cruel to many. Decidedly unmodern. But in true Machiavellian fashion let's ask what is preferable: a strong secure long-lived state ruled by virtue or a state ruled according to how the world ought to work that is constantly at the mercy of every invader? If you are still hungry for more starter advice, Gordon Ramsay has some unintentionally delicious thoughts on developing software as well. Serve yourself at: Gordon Ramsay On Software/04.html#a225"> and . Gordon Ramsay's Lessons for Software Take Two. Kevin Burton share's seven deadly sins startups should avoid and makes an inspiring case how his company stronger and better able to compete by not taking VC funds. Really interesting. Diary of a Failed Startup says solve a problem, not a platform to solve a class of problems. Truer words were never spoken.

    Click to read more ...

  • Wednesday
    Jul162008

    The Mother of All Database Normalization Debates on Coding Horror

    Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow. Here are a few of the pithier points (summarized):

  • Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. (Jeff)
  • Normalize until it hurts, denormalize until it works. (Jeff)
  • Use materialized views which are tables created and maintained by your RDBMS. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically. (Goran)
  • According to Codd and Date table names should be singular, but what did they know. (Pablo, LOL)
  • Denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed. Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on. (Dare Obasanjo)
  • What happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrieval needs to be fast, therefore I am not going to normalise!" (Tubs)
  • You can read fast and store slow or you can store fast and read slow. The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow. (Goran)
  • The biggest scalability problems I face are with human processes, not computer processes. (John)
  • Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend. (Chris)
  • Normalization is about design, denormalization is about optimization. (Peter Becker)
  • You're just another knucklehead. (BuggyFunBunny)
  • Lets unroll our loops next. RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter.(Christog)
  • Jeff, are you awake? (wiggle)
  • Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations. (Sean)
  • Do you sex a cat, or do you gender it? (Simon)
  • This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue. (Marcel)
  • This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). (TED)
  • Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application. (superjason)
  • Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, (Ed)
  • Keep a de-normalized replica for expensive operations (e.g. reports), Cache Results for repeat queries (Memcache), Partition the database for scalability (vertical or horizontal) (Gareth)
  • Speaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does. (A. Lloyd Flanagan)
  • I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals. (Wayne)
  • De-normalization is for reporting, not for OLTP. (Eric)
  • Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys. (Leandro)
  • What I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues. (Sam)
  • You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. (RevMike)
  • Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data. (Chris)
  • Cache the generated view pages first. Then cache the data. You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much. (Matt)
  • I wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set. (Thomi)
  • Denormalization is the last possible step to take, yet the first to be suggested by fools. (Jeremy)
  • There is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. (David)
  • The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones). (Rob)
  • Don't fear normalization. Embrace it. (Charles)
  • I read on and discovered all the loons and morons who think they know a lot more then they do about databases. (Paul)
  • Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future!; Put the rest of the info into a TEXT variable, like "extra_info", in JSON format. This could be educational history, or anything else that you would never search by; If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need. (Greg)
  • How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting? How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter. (Steve)
  • It might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized. (PRMAN)
  • Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.) (Jonn)
  • For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code. (John)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (PRMan)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (Steve)
  • I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary). (Mo)
  • Stop playing with mySQL. (Jonathan)
  • What a horrible, cowboy attitude to DB design. I hope you don't design any real databases. (Dave)
  • IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism. (Yawn)
  • Is my Site OLTP? If the answer is yes then Normalize. Is my site OLAP? If the answer is yes then De-Normalize! (WeAreJimbo)
  • This is a dangerous article, or perhaps you just haven't seen the number of horrific "denormalised" databases I have. People use these posts as excuses to build some truly horrific crimes against sanity. (AbGenFac)
  • Be careful not to confuse a denormalised database with a non-normalised database. The former exists because a previously normailsed database needed to be 'optimised' in some way. The latter exists because it was 'designed' that way from scratch. The difference is subtle, but important. (Bob) OK, more than a few quotes. There's certainly no lack of passion on the issue! One thing I would add is to organize your application around an application level service layer rather than allowing applications to access the database directly at a low level. Amazon is a good example of this approach. Many of the denormalization comments have to do with the problems of data inconsistency, which is of course true because that's why normalization exists. Many of these problems can be reduced if there's a single service access point over which to get data. It's when data access is spread throughout an application that we see serious problems.

    Related Articles

  • Denormalization Patterns by Kenneth Downs
  • When Databases Lie: Consistency vs. Availability in Distributed Systems by Dare Obasanjo
  • Stored procedure reporting & scalability by Jason Young
  • When Not to Normalize your SQL Database by Dare Obasanjo

    Click to read more ...

  • Tuesday
    Jul152008

    ZooKeeper - A Reliable, Scalable Distributed Coordination System 

    ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, and configuration maintenance. In addition ZooKeeper can be used for event notification, locking, and as a priority queue mechanism. It's a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and events are the hormones and neurotransmitters used for messaging. Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use. The target market for ZooKeeper are multi-host, multi-process C and Java based systems that operate in a data center. ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space that is modeled after a file system. Data is kept in memory and is backed up to a log for reliability. By using memory ZooKeeper is very fast and can handle the high loads typically seen in chatty coordination protocols across huge numbers of processes. Using a memory based system also mean you are limited to the amount of data that can fit in memory, so it's not useful as a general data store. It's meant to store small bits of configuration information rather than large blobs. Replication is used for scalability and reliability which means it prefers applications that are heavily read based. Typical of hierarchical systems you can add nodes at any point of a tree, get a list of entries in a tree, get the value associated with an entry, and get notification of when an entry changes or goes away. Using these primitives and a little elbow grease you can construct the higher level services mentioned above. Why would you ever need a distribute coordination system? It sounds kind of weird. That's more the question I'll be addressing in this post rather than how it works because the slides and the video do a decent job explaining at a high level what ZooKeeper can do. The low level details could use another paper however. Reportedly it uses a version of the famous Paxos Algorithm to keep replicas consistent in the face of the failures most daunting. What's really missing is a motivation showing how you can use a coordination service in your own system and that's what I hope to provide... Kevin Burton wants to use ZooKeeper to to configure external monitoring systems like Munin and Ganglia for his Spinn3r blog indexing web service. He proposes each service register its presence in a cluster with ZooKeeper under the tree "/services/www." A Munin configuration program will add a ZooKeeper Watch on that node so it will be notified when the list of services under /services/www changes. When the Munin configuration program is notified of a change it reads the service list and automatically regenerates a munin.conf file for the service. Why not simply use a database? Because of the guarantees ZooKeeper makes about its service:

  • Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.
  • A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
  • The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service. You can't get these guarantees from an event system plopped on top of a database and these are the sort of guarantees you need in a complex distributed system where connections drop, nodes, fail, retransmits happen, and chaos rules the day. What rules the night is too terrible to contemplate. For example, it's important that a service-up event is seen after the service-down or you may unnecessarily drop revenue producing work because of an event out-of-order issue. Not that I would know anything about this mind you :-) A weakness of ZooKeeper is the fact that changes happened are dropped: Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.) This means that ZooKeeper is a state based system more than an event system. Watches are set as a side-effect of getting data so you'll always have a valid initial state and on any subsequent change events you'll refresh to get new values. If you want to use events to log when and how something changed, for example, then you can't do that. You would have to include change history in the data itself. Let's take a look at another example of where ZooKeeper could be useful. Picture a complex backend system running on, let's say, a 100 nodes (maybe a lot less, maybe a lot more) in a data center. For example purposes assume the system is an ad system for serving advertisements to web sites. Ad systems are complex beasts that require a fair bit of coordination. Imagine all the subsystems needing to run on those 100 nodes: database, monitoring, fraud detectors, beacon servers, web server event log processors, failover servers, customer dashboards, targeting engines, campaign planners, campaign scenario testers, upgrades, installs, media managers, and so on. There's a lot going on. Now imagine the power in the data center flips and all the machines power on. How do all the processes across all the hosts know what to do? Now imagine everything is up and a few machines go down. How do all the processes know what to do in this situation? This is where a coordination service comes in. A coordination service acts as the backplane over which all these subsystems figure out what they should do relative to all the other subsystems in a product. How, for example, do the ad servers know which database to use? Clearly there are many options for solving this problem. Using standard DNS naming conventions is one. Configuration files is another. Hard coding is still a favorite. Using a bootstrap service locator service is yet another (assuming you can bootstrap the bootstrap service). Ideally any solution must work just as well during unit testing, system testing, and deployment. In this scenario ZooKeeper acts as the service locator. Each process goes to ZooKeeper and finds out which is the primary database. If a new primary is elected, say because a host fails, then ZooKeeper sends an event that allows everyone dependent on the database to react by getting the new primary database. Having complicated retry logic in application code to fail over to another database server is simply a disaster as every programmer will mess it up in their own way. Using a coordination service nicely deals with the problem of services locating other services in all scenarios. Of course, using a proxy like MySQL Proxy would remove even more application level complexity in dealing with failover and request routing. How did the database servers decide which role they'll would play in the first place? All the database servers boot up and say "I'm a database server brave and strong, what's my role in life? Am I a primary or a secondary server? Or if I'm a shard what key range do I serve?" If 10 servers are database servers negotiating roles can be a very complicated and error prone process. A declarative approach through configuration files that specify a failover ring in configuration files is a good approach, but it's a pain to get to work in local development and test environments as the machines are always changing. It's easier to let the database servers come up and self organize themselves on initial role election and in failure scenarios. The advantage of this system is that it can run locally on one machine or on a dozen machines in the data center with very little effort. ZooKeeper supports this type of coordination behavior. Now let's say I want to change the configuration of ad targeting state machines currently running in 40 processes on 40 different hosts. How do I do that? The first approach is no approach. Most systems make it so a new code release has to happen, which is very sloooow. Another approach is a configuration file. Configuration is put in a distribution package and pushed to all nodes. Each process then periodically checks to see if the configuration file has changed and if it has then the new configuration read. That's the basics. Infinite variations can follow. You can have configuration for different subsystems. There's complexity because you have to know what packages are running on which nodes. You have to deal with rollback in case all packages don't push correctly. You have to change the configuration, make a package, test it, then push it to the data center operations team which may take a while to perform the upgrade. It's a slow process. And when you get into changes that impact multiple subsystems it gets even more complicated. Another approach I've taken is to embed a web server in each process so you can see the metrics and change the configuration for each process on the fly. While powerful for a single process it's harder to manipulate sets of processes across a data center using this approach. Using ZooKeeper I can store my state machine definition as a node which is loaded from the static configuration collected from every distribution package in a product. Every process dependent on that node can register as a watcher when they initially read the state machine. When the state machine is updated all dependent entities will get an event that causes them reload the state machine into the process. Simple and straightforward. All processes will eventually get the change and any rebooting processes will pick up the new state machine on initialization. A very cool way to reliably and centrally control a large distributed application. One caveat is I don't see much activity on the ZooKeeper forum. And the questions that do get asked largely go unanswered. Not a good sign when considering such a key piece of infrastructure. Another caveat that may not be obvious on first reading is that your application state machine using ZooKeeper will have to be intimately tied to ZooKeeper's state machine. When a ZooKeeper server dies, for example, your application must process that event and reestablish all your watches on a new server. When a watch event comes your application must handle the event and set new watches. The algorithms to carry out higher level operations like locks and queues are driven by multi-step state machines that must be correctly managed by your application. And as ZooKeeper deals with state that is probably stored in your application it's important to worry about thread safety. Callbacks from the ZooKeeper thread could access shared data structures. An Actor model where you dump ZooKeeper events into your own Actor queues could be a very useful application architecture here for synthesizing different state machines in a thread safe manner.

    Some Fast Facts

  • How data are partitioned across multiple machines? Complete replication in memory. (yes this is limiting)
  • How does update happen (interaction across machines)? All updates flow through the master and are considered complete when a quorum confirms the update.
  • How does read happen (is getting a stale copy possible) ? Reads go to any member of the cluster. Yes, stale copies can be returned. Typically, these are very fresh, however.
  • What is the responsibility of a leader? To assign serial id's to all updates and confirm that a quorum has received the update.
  • There are several limitations that stand out in this architecture: - complete replication limits the total size of data that can be managed using Zookeeper. This is acceptable in some applications, not acceptable in others. In the original domain of Zookeeper (management of configuration and status), it isn't an issue, but zookeeper is good enough to encourage creative misuse where this can become a problem. - serializing all updates through a single leader can be a performance bottleneck. On the other hand, it is possible to push 50K updates per second through a leader and the associated quorum so this limit is pretty high. - the data storage model is completely non relational. These answers were provided by Ted Dunning on the Cloud Computing group.

    Related Articles

  • An Introduction to ZooKeeper Video (Hadoop and Distributed Computing at Yahoo!) (PDF)
  • ZooKeeper Home, Email List, and Recipes (which has some odd conotations for a Zoo).
  • The Chubby Lock Service for Loosely-Coupled Distributed Systems from Google
  • Paxos Made Live – An Engineering Perspective by Tushar Chandra, Robert Griesemer, and Joshua Redstone from Google.
  • Updates on Open Source Distributed Consensus by Kevin Burton
  • Using ZooKeeper to configure External Monitoring Systems by Kevin Burton
  • Paxos Made Simple by Leslie Lamport
  • Hyperspace - API description of Hyperspace, a Chubby-like service
  • Notes on ZooKeeper at the Hadoop Summit by James Hamilton.

    Click to read more ...

  • Thursday
    Jul102008

    Can cloud computing smite down evil zombie botnet armies?

    In the more cool stuff I've never heard of before department is something called Self Cleansing Intrusion Tolerance (SCIT). Botnets are created when vulnerable computers live long enough to become infected with the will to do the evil bidding of their evil masters. Security is almost always about removing vulnerabilities (a process which to outside observers often looks like a dog chasing its tail). SCIT takes a different approach, it works on the availability angle. Something I never thought of before, but which makes a great deal of sense once I thought about it. With SCIT you stop and restart VM instances every minute (or whatever depending in your desired window vulnerability).... This short exposure window means worms and viri do not have long enough to fully infect a machine and carry out a coordinated attack. A machine is up for a while. Does work. And then is torn down again only to be reborn as a clean VM with no possibility of infection (unless of course the VM mechanisms become infected). It's like curing cancer by constantly moving your consciousness to new blemish free bodies. Hmmm... SCIT is really a genius approach to scalable (I have to work in scalability somewhere) security and and fits perfectly with cloud computing and swarm (cloud of clouds) computing. Clouds provide plenty of VMs so there is a constant ready supply of new hosts. From a software design perspective EC2 has been training us to expect failures and build Crash Only Software. We've gone stateless where we can so load balancing to a new VM is not problem. Where we can't go stateless we use work queues and clusters so again, reincarnating to new VMs is not a problem. So purposefully restarting VMs to starve zombie networks was born for cloud computing. If a wider move could be made to cloud backed thin clients the internet might be a safer place to live, play, and work. Imagine being free(er) from spam blasts and DDOS attacks. Oh what a wonderful world it would be...

    Click to read more ...