High Scalability -

1 Comment |

Permalink |

ER-Modeling with Google App Engine (updated)

Strategy

Tuesday

May272008

How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

Tuesday, May 27, 2008 at 10:04AM

Update 3: ReadWriteWeb says Google App Engine Announces New Pricing Plans, APIs, Open Access. Pricing is specified but I'm not sure what to make of it yet. An image manipulation library is added (thus the need to pay for more CPU :-) and memcached support has been added. Memcached will help resolve the can't write for every read problem that pops up when keeping counters. Update 2: onGWT.com threw a GAE load party and a lot of people came. The results at Load test : Google App Engine = 1, Community = 0. GAE handled a peak of 35 requests/second and a sustained 10 requests/second. Some think performance was good, others not so good. My GMT watch broke and I was late to arrive. Maybe next time. Also added a few new design rules from the post. Update: Added a few new rules gleaned from the GAE Meetup: Design By Explicit Cost Model and Puts are Precious. How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You--pause for dramatic effect--duplicate data instead of normalize it. *shudder* Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones... But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture? From one world view comments logically belong to a relation binding comments and users together. But if your unit of scalability is the user shard there is no separate relation space. So you go against all your training and decide to duplicate the comments. Nerd heroism at its best. Let inductive rules derived from observation guide you rather than deductions from arbitrarily chosen first principles. Very Enlightenment era thinking. Voltaire would be proud. In a relational world duplication is removed in order to prevent update anomalies. Error prevention is the driving force in relational modeling. Normalization is a kind of ethical system for data. What happens, for example, if a comment changes? Both copies of the comment must be updated. That leads to errors because who can remember where all the data is stored? A severe ethical violation may happen. Go directly to relational jail :-) BigTable data ethics are more Mardi Gras than dinner with the in-laws. Data just wants to have fun. BigTable won’t stop you from hurting yourself. And to get the best results you may have to engage in some conventionally risky behaviors. But if those are the glass bead necklaces you have to give for a peak at scalability, why not take a walk on the wild side? For a more modern post-relational discussion of data ethics I’m using as my primary source a thread of conversations from JA Robson, Ben the Indefatigable, Michael Brunton-Spall, and especially Brett Morgan. According to our new Voltaire, Locke, Bacon, and Newton, here’s what it takes to act ethically in a BigTable world:

Don’t bother with BigTable unless your goal is to create a web site that scales to millions of users. The techniques for building scalable read-mostly web applications are difficult and require a radical mindset change. Standard relational techniques work very well until you scale to huge numbers of users. It is at that point you need to break the rules and do something counter-intuitively different. More of the same will not work. If you don’t plan to get to that point it may not be worth the effort to change. BigTable is targeted at building web applications, It's nature makes it a poor match for OLAP, data warehousing, data mining, and other applications performing complex data manipulations.

Assume slower random data access rather than fast sequential access. Every get of an entity could be from a different disk block on a different machine in a cluster. Calculating, for example, the average over a column in SQL can be efficient because data is stored together on disk. In BigTable data can be anywhere so iterating over every value in a column is expensive. Each read is potentially a random block from anywhere which means the average retrieval time can be relatively high. The implication is to use BigTable you must adopt some unfamiliar and unintuitive strategies in order to deal with such a very different performance profile. Using relational database we are used to writing applications against fast highly performant databases. With BigTable you have to become familiar with the rules for developing against a slower but more scalable database. Neither approach is better for all purposes, but BigTable has the edge for high scalability.

Group data for concurrent reads. Given the high cost of reading data from BigTable your application will not scale if every page requires a large number of reads. The solution: denormalize. Store data in the same entity based on what data needs to be read concurrently. Relational modeling groups data together based on the “minimize problems” rule. BigTable’s new rule is “maximize concurrent reads” which implies denormalization. Store entities so they can be read in one access rather than performing a join requiring multiple reads. Instead of storing attributes in separate entities in order to remove duplication, duplicate the attributes and store them where they need to be used. Following this rule minimizes the number of reads required to return an entity.

Disk and CPU are cheap so stop worrying about them and scale. A criticism of denormalization is storing duplicate data wastes disk space. Google’s architecture trades disk space for better performance. Disk is (relatively) cheap, so don’t fight it. On the CPU front a data center’s worth of CPU is at your service. As long as you structure your application in the way GAE forces you to, your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed.

Structure data around how it will be used. Trade SQL sets for application based entities. Queries are slow so the closer data is to the format it is to be used the faster pages will render. It’s like the database model becomes the model previously used at the caching layer. Complete entities tend to be cached, not low level detail rows. That’s what BigTable models should look like because that’s how concurrent reads are maximized. This isn’t the same as an object oriented database because the behavior is provided by applications, behavior is not bound to the entity so multiple applications can read the same entities yet implement very different behaviors.

Compute attributes at write time. Since looping over large columns of data is inefficient with BigTable the idea is to calculate values at write time instead of read time. For example, instead of calculating an average by reading an entire column at read time, track the total number and the total value at write time so the average can be calculated with one read on page display. Programmer effort is made up front at write time to minimize the work needed at read time. Preventing applications from iterating over huge data is key for making applications scale. Given the limitations of GAE transactions and quotas, GAE may not be appropriate for business applications that need exact summary statistics. Warning: if the summary stat is written on every read request then this approach will not scale as writes don't scale.

Create large entities with optional fields. Normalization creates lots of small entities. Instead, create larger entities with optional parts so you can do one read and then determine what’s present at run time. This shifts work from the database to the CPU while minimizing the number joins.

Define schemas in models. Denormalization requires user developed code to properly keep data consistent across multiple entities. The database won’t do it for you anymore. Schemas are really defined in code because it’s only code that can track all the relationships and maintain correctness. All database access must go through the models or otherwise the much feared inconsistency problems will result.

Hide updates using Ajax. Updates are slow so big bang updates of many entities will appear slow to users . Instead, use Ajax to update the database in little increments. As a user enters form data update the database so the update cost is amortized over many calls rather than one big call at the end. The result is a good user experience and a more scalable app.

Puts are Precious. Updating entities in large batches, say even 200 at a time, isn't part of the BigTable model. Entity attributes are automatically and synchronously indexed on writes. Indexing is an expensive operation that accumulates a lot of CPU time so the number updates that can be performed in one query is quite limited. The work around is to perform updates in smaller batches driven by an external CPU. Even when GAE provides the ability run batches within GAE the programming model for writes needs to be accounted for in a design.

Design By Explicit Cost Model. If you are going to be charged for an operation GAE wants you to explicitly ask for it. This is why some automatic navigation between objects isn't provided because that will force an explicit query to be written. Writing an explicit query is a sort of EULA for being charged. Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.

Place a many-to-many relation in the entity with the fewest number of elements. One way to create a many-to-many relationship is to have a list property that contains keys to the other related entities. A Company entity, for example, could contain a list of keys to Contact entities or a Contact entity could contain a list of keys to Company entities. Since it's likely a Contact is associated with fewer Companies the list should be contained in the Contact. The reasoning is maintaining large lists is relatively inefficient so you want to minimize the number of items in a list as much as possible.

Avoid unbounded queries. Large queries don't scale. Consider showing only the most recent 10 or so values from an attribute.

Avoid contention on datastore entities. If every request to your app reads or writes a particular entity, latency will increase as your traffic goes up because reads and writes on a given entity are sequential. One example construct you should avoid at all costs is the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.

Avoid large entity groups. Any two entities that share a common ancestor belong to the same entity group. All writes to an entity group are sequential, so large entity groups can bog down popular apps quickly if there are a lot of writes to that group. Instead, use small, localized groups in your design.

Shard counters. Increment one of N counters and sum those N counters on the read side. This avoids the dreaded write bottleneck. See Efficient Global Counters by App Engine Fan for more details. An excellent example showing some of these principles in action can be found in this GQL thread. Take this nicely normalized schema:

Customer: 
 - Name 
 - Country 
Product: 
- Code 
- Name 
- Description 
Purchases: 
- Reference to Product Entity 
- Reference to Customer Entity 
- Date of order

Anyone from a relational background would look at this schema and give it a big thumbs up. With a little effort we can imagine the original physical purchase order that has now been normalized into three different tables. To recreate the original purchase order a join on purchases, produce and customer is needed. Read speed is not optimized, safety is optimized. Here’s what the same schema looks like optimized for reading:

Purchase: 
- Customer Name 
- Customer Country 
- Product Code 
- Product Name 
- Purchase Order Number 
- Date Of Order

The three original tables have been folded into one entity. Now a purchase order can be read in one get operation. No join necessary. Notice how the entity looks more like an original purchase order. It is also what would probably be cached and is what our model would probably look like. But what if you want to update a product name or a customer name? Those attributes are duplicated in all entities. Here’s where the protection offered by the relational model comes in. Only one entity needs updating in a normalized model. In BigTable you have to remember everywhere a customer name and product name and change every instance to new values. It’s not a simple, safe, or reliable approach. But it does optimize for read speed and scalability. For an application with a high proportion of updates to reads this approach wouldn’t make sense. But on the web reads usually dominate. How often do you really change a customer name or a product name? Seldom. How often do you read them? All the time. Designing to scale for reads and taking the pain on writes takes some getting used to. It’s a massive change to standard relational tactics. But this is what it takes to scale web applications, even if it feels a little strange at first.

Tips on writing scalable apps

35 Comments |

Permalink |

BigTable,

Database,

GAE,

Shard,

Strategy,

cloud

Tuesday

May272008

Should Twitter be an All-You-Can-Eat Buffet or a Vending Machine?

Tuesday, May 27, 2008 at 7:03AM

Om proposes one solution to the Twitter Problem is to limit followers to three square meals a day. The reasonable idea being that lower limits should mean fewer scaling problems. And as a kicker raising those limits is a good way to raise much needed revenue. Scoble thinks users should consume without limit and will drive to another buffet if all-you-can-eat privileges are revoked. The reasonable idea being that if an internet service can't solve internet scale problems then there's not much use for it. Dave says comp power users a top floor suite and shower them with free passes to the buffet. Let the good times roll! The reasonable idea being that power users help create popular restaurants, er, services in the first place and limiting them starves users and starved users won't come back. So, should web services like Twitter be a buffet, a fixed eight course fine dining experience, a small plate restaurant, a family style joint, or a vending machine? Or something else entirely? In a distant barely remembered past I actually worked at an all-you-can-eat buffet. The food was very good and most customers didn't over over indulge. If they did the place wouldn't stay in business long. But some customers did. They were called stackers. Stackers were so named because a large stack of plates would pile up on their table throughout the meal. Stackers followed a power law distribution. Few customers at any one time were stackers, but their effect could be devastating. How devastating depended on their favorite foods... A stacker who loved potato salad was manageable. We had plenty of potato salad and it was cheap and quick to make. No problem. Stacking itself was not frowned upon and never discouraged. It's an all-you-can-eat buffet after all! But if a stacker's favorite food was roast beef, that was trouble. Not only is roast beef expensive, it comes in a limited supply because it has to be prepared ahead of time. Once you ran out there was no more roast beef for the rest of the night. Good roast beef takes hours to prepare, it must be planned for. Management's job was to carefully balance projected demand against waste. The goal was to prepare enough meat to meet demand, yet not have a lot of left-overs. Stackers blow apart the finely balanced calculation of how much roast beef to make and the carving station is left trying to push the ham while apologizing for an embarrassing lack of roast beef. An ugly ugly scene. As a carver you are armed with a long scary looking knife and you are shielded by Medieval chain-mail looking glove, but hungry customers are mean and fast. You never see it coming. Unfortunately the distribution of stackers on any given night is unpredictable. You can't always cook a maximum amount of meat or you'll go broke. And if you make too little everyone is unhappy. It needs to be just right. As a person with serious stacker tendencies I try to remember the cost of things and keep a reasonable balance. The only way to make Goldilocks happy and have just the right balance is to place limits. Eventually the restaurant had to limit the number of trips to the roast beef station to three a meal. Enough that you get value for your dollar, but not so much that the restaurant goes under. Everyone happy? Of course not. The world doesn't work like that. It's all-you-can-eat some would say so I should be able to eat all I can eat ! But there are always limits. Would it be fair to back a truck up to the restaurant and start loading up because that's part of your meal? No. Is it fair to stuff your backpack with food on the way out? No. So there are always limits. The question is what are fair limits? It has been said FriendFeed has no problems handling 10,000 friends so neither should Twitter. Now, let's imagine if I spun up 1000 EC2 servers whose only task was to add more friends to feed. Would FriendFeed limit me then? Of course. It's basic web site self-defense, a right guaranteed under the constitution and long recognized by the courts in certain situations. But still, what are fair limits? How much roast beef should you be able to eat? Limit setting is a strategy we've talked about many times as a way of protecting sites from complete devastation. My favorite example is Mailinator whose prime directive is surviving attacks and they've deployed many clever practices in their own defense. And most every large web site on earth is busy watching your every move so they can bounce you at the first sign of DDOS Armageddon. Limits aren't inherently bad. But limits don't make you scale, they simply stop you from unscaling. An adequate scalable infrastructure must still be put in place. In the end I agree with Scoble in that the power of the internet is having interesting conversations with interesting people about interesting topics. For interesting conversations to happen you must be able to freely create relationships. If you or they have to pay for relationships they simply won't form. Would Google's Page Rank algorithm work so well if it could only analyze paid relationships? A web formed under a paid relationship model would look totally different and be decidedly less valuable. Similarly, a social network that can't grow naturally through preferential attachment would have much less value. Scaling relationships is a core social network competency. Relationships should be subject to DDOS type limits, but not limits artificially out of proportion with a user's internet audience. I doubt Twitter would disagree, but they are going through a tough time right now. I also agree with Om. The Freemium model is a great idea and linking that to site protecting prophylactics is even better. But limiting a core competency may not be the right target. Fotolog is an example of a service that puts Freemium ideas to good use. They charge extra for adding more photos a day, more comments a day, custom profile abilities, and social status add ons. What is the equivalent in Twitter? I don't know, but I would try to treat relationships more like potato salad than roast beef. And I also agree with Dave. It's hard to get noticed on the web. Those who help you storm the attention barrier shouldn't be punished. They should be rewarded with a tasty appropriately sized meal.

Permalink |

Strategy,

limit,

problem

Saturday

May102008

Hitting 300 SimbleDB Requests Per Second on a Small EC2 Instance

Saturday, May 10, 2008 at 3:25AM

High Performance Multithreaded Access to Amazon SimpleDB is a great follow up to the idea in How SimpleDB Differs from a RDBMS that more programming is the price paid for performance in SimpleDB. It shows how much work and infrastructure is required to batter better performance out of SimpleDB. Remember, in SimpleDB you get keys to records from queries so if you want to get all the fields for records you need to make separate requests. Since SimpleDB isn't exactly a speed daemon the obvious strategy is to parallelize. Even if a job takes a 100 msecs you can get a lot done in a little time if you can execute enough jobs in parallel. Parallelization is the approach taken by Haakon@AWS in his Java code example of how to get the most out of SimpleDB. You can find the code at Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB. We'll also consider how a back-end service architecture built on Erlang may be a better fit with cloud computing. Two general mechanisms of parallelism are available: threads and boxes. To get the most bang out of a single machine you need threads (events, etc). To scale beyond the load handled by a single machine you need multiple boxes. The example code uses the Executor Thread Pool for parallelism within a program. Thread pools are a pretty common idiom by now. Amazon's queue service SQS was used to distribute work amongst boxes. Work was queued to SQS in batches of 1000 work items. The items were pulled by the thread pool and processed. Why 1000? The idea is to balance processing overhead with work overhead. You don't want popping items off SQS to dominate your processing time so you have to do enough work in each pass to make it worth the investment. The architecture uses two thread pools: one to run queries and one to get record values. Applications must carefully tune the number of threads in each pool so the queries to overwhelm the gets. Using a query thread pool with 2 threads and a get thread pool with 32 threads it was possible to perform 300 TPS on a small EC2 instances. Theoretically the advantage of this architecture is that it will scale to any size you need. SQS is your work distribution backbone and you just spin up the number of thread pool instances you need. The disadvantage is that this is a lot of programmer effort. But let's consider that you had to do some serious processing on each record, you would need something like this approach anyway to scale out the processing. But to perform simple aggregation operations it's total overkill which is why more time needs to be spent on the write site of the equation in SimpleDB/BigTable than the read side as we are used to with a RDBMS. What's the best way to go parallel? On the front-end life is simple. Go shared nothing and compose your pages from scalable back-end services. This is how Amazon does it and it's how Google AppEngine does it. GAE completely punts on the back-end service layer architecture. Unfortunately we still need to create a back-end architecture for more complex applications. Thread pools and SQS is one parallelization approach. Instead of thread pools something like Java's fork/join framework could be used. Initially I thought piling on more low level primitive threading facilities into Java was the wrong way to go. Yes, it is a "'multicore-friendly lightweight parallel framework' that supports a style of parallel programming where problems are recursively split into smaller fragments, solved in parallel and recombined," but it's also a style of programming that is very difficult to program correctly. If cloud architectures will rely on these primitives for efficiency then I think we have regressed. Erlang style architectures described by Luke Hoersten in Scalable Web Apps: Erlang + Python is a simpler more reliable to programming model. An event driven actor based approach is much harder to screw up than closely cooperating threads in a shared memory space. Erlang originally ran in embedded systems where the requirement was to reliably squeeze the most work possible out of limited CPU and other compute resources. Oddly enough the embedded node of old closely parallels your basic cloud VM. Start your work horse Erlang (or other similar system) instances and let them efficiently chew up your work loads. Erlang's scheduling model fits perfectly with a service centric job engine cloud instance. It will get more work done then your typical thread based system ever would.

HSCALE - Transparent MySQL Partitioning

Permalink |

AWS,

erlang

Monday

May052008

HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy

Monday, May 5, 2008 at 3:41PM

Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results. Update: A new presentation at An Introduction to HSCALE. After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots. One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding? In the first article four approaches to sharding were identified:

Using MySQL Cluster

Using MySQL Proxy with transparent query rewriting and load balancing

Implement it into a JDBC driver

Implement it into the application data access layer. The proxy solution was selected because it's transparent to the application layer. Applications need not know about the partitioning scheme to make it work. Not mucking with apps is a big win. The downside is implementation complexity. How do you parse a query and and map it correctly to the right server? Will this cause a big performance degradation? How is this new more complex and dynamic system to be tested? Can we run the same queries they did before or will they have to rewrite parts of their application? A lot of questions to be worked out. The second article starts working out those problems using MySQL Proxy. The process was broken into a few steps:

Analyze the query to find out which tables are involved and what the parition key would be.

Validate the query and reject queries that cannot be analyzed.

Determine the partition table / database. This could be done by a simple lookup, a hashing function or anything else.

Rewrite the query and replace the table names with the partition table names.

Execute the query on the correct database server and return the result back to the client. Some of the comments were concerned that a modulus scheme was being used to identify a partition. The recommendation was to use a directory service for mapping to partitions instead. A directory service allows you to logically map partitions behind the scenes and doesn't tie you to a deterministic physical mapping. After getting all this working they generously released it to the world as HSCALE - Transparent MySQL Partitioning: HSCALE is a plugin written for MySQL Proxy which allows you to transparently split up tables into multiple tables called partitions. In later versions you will be able to put each partition on a different MySQL server. Application based partitioning means that your split up your data logically and rewrite your application to select the right piece of data (i.e. partition) at any given time. More on application based partitioning. Read here some more about what could be done with HSCALE. HSCALE helps in application based partitioning. Using the MySQL Proxy it sits between your application and the database server. Whenever a sql statement is sent to the server HSCALE analyzes it to find out whether a partitioned table is used. It then tries to find out which partition the sql statement should go to. Access release .1 at HSCALE 0.1 released - Partitioning Using MySQL Proxy. The transparent proxy ability is very powerful, but what we are lacking that various companies have created internally is a partition management layer. How do you move partitions? How do you split partitions when a table outgrows the shard or performance declines? Lots of cool tools still to build.

Pero: HSCALE 0.1 released - Partitioning Using MySQL Proxy

Pero: MySQL Partitioning on Application Side

Pero: Progress on MySQL Proxy Partitioning

HighScalability: Flickr Architecture - more information on partitioning.

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

HighScalabilty: An Unorthodox Approach to Database Design : The Coming of the Shard.

Permalink |

MySQL,

sharding

Tuesday

Apr292008

Strategy: Sample to Reduce Data Set

Tuesday, April 29, 2008 at 4:01PM

Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows "rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result." When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results. Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help. But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate. We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.

Permalink |

PyCon Presentation: Supervisor as a Platform

Strategy,

sampling

Wednesday

Apr022008

Product: Supervisor - Monitor and Control Your Processes

Wednesday, April 2, 2008 at 3:12AM

It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all. This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit. Adapted from their website: Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.

It is often inconvenient to need to write "rc.d" scripts for every single process instance. rc.d scripts are a great lowest-common-denominator form of process initialization/autostart/management, but they can be painful to write and maintain. Additionally, rc.d scripts cannot automatically restart a crashed process and many programs do not restart themselves properly on a crash. Supervisord starts processes as its subprocesses, and can be configured to automatically restart them on a crash. It can also automatically be configured to start processes on its own invocation.

It's often difficult to get accurate up/down status on processes on UNIX. Pidfiles often lie. Supervisord starts processes as subprocesses, so it always knows the true up/down status of its children and can be queried conveniently for this data.

Users who need to control process state often need only to do that. They don't want or need full-blown shell access to the machine on which the processes are running. Supervisorctl allows a very limited form of access to the machine, essentially allowing users to see process status and control supervisord-controlled subprocesses by emitting "stop", "start", and "restart" commands from a simple shell or web UI.

Users often need to control processes on many machines. Supervisor provides a simple, secure, and uniform mechanism for interactively and automatically controlling processes on groups of machines.

Processes which listen on "low" TCP ports often need to be started and restarted as the root user (a UNIX misfeature). It's usually the case that it's perfectly fine to allow "normal" people to stop or restart such a process, but providing them with shell access is often impractical, and providing them with root access or sudo access is often impossible. It's also (rightly) difficult to explain to them why this problem exists. If supervisord is started as root, it is possible to allow "normal" users to control such processes without needing to explain the intricacies of the problem to them.

Processes often need to be started and stopped in groups, sometimes even in a "priority order". It's often difficult to explain to people how to do this. Supervisor allows you to assign priorities to processes, and allows user to emit commands via the supervisorctl client like "start all", and "restart all", which starts them in the preassigned priority order. Additionally, processes can be grouped into "process groups" and a set of logically related processes can be stopped and started as a unit. Supervisor also has a web interface and an XMP-RPC interface:

A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an internet socket. Visit the server URL (e.g. http://localhost:9001/) to view and control process status through the web interface after activating the configuration file's [inet_http_server] section. XML-RPC Interface

The same HTTP server which serves the web UI serves up an XML-RPC interface that can be used to interrogate and control supervisor and the programs it runs. To use the XML-RPC interface, connect to supervisor's http port with any XML-RPC client library and run commands against it. An example of doing this using Python's xmlrpclib client library is as follows.

Monitor Pylons application with supervisord

Supervisor Manual

5 Comments |

Permalink |

Strategy

Saturday

Mar292008

20 New Rules for Faster Web Pages

Saturday, March 29, 2008 at 2:31AM

Update: Nice explanation in The importance of bandwidth versus latency of how long latencies cause cascading delays in resource loading. Doloto tries to optimize how resources are loaded. Twenty new rules have been added to the original 14 rules for sizzling web performance. Part of scalability is worrying about performance too. The front-end is where 80-90% of end-user response time is spent and following these best practices improved the performance of Yahoo! properties by 25-50%. The rules are divided into server, content, cookie, JavaScript, CSS, images, and mobile categories. The new rules are:

Flush the buffer early [server]

Use GET for AJAX requests [server]

Post-load components [content]

Preload components [content]

Reduce the number of DOM elements [content]

Split components across domains [content]

Minimize the number of iframes [content]

No 404s [content]

Reduce cookie size [cookie]

Use cookie-free domains for components [cookie]

Minimize DOM access [javascript]

Develop smart event handlers [javascript]

Choose <link> over @import [css]

Avoid filters [css]

Optimize images [images]

Optimize CSS sprites [images]

Don't scale images in HTML [images]

Make favicon.ico small and cacheable [images]

Keep components under 25K [mobile]

Pack components into a multipart document [mobile] Thanks to Simon Willison for the link.

2 Comments |

Permalink |

yslow

Wednesday

Mar192008

Serving JavaScript Fast

Wednesday, March 19, 2008 at 5:23AM

Cal Henderson writes at thinkvitamin.com: "With our so-called "Web 2.0' applications and their rich content and interaction, we expect our applications to increasingly make use of CSS and JavaScript. To make sure these applications are nice and snappy to use, we need to optimize the size and nature of content required to render the page, making sure we’re delivering the optimum experience. In practice, this means a combination of making our content as small and fast to download as possible, while avoiding unnecessarily refetching unmodified resources." A lot of good comments too.

Revolver |

1 Comment |

Permalink |

Strategy

Friday

Mar142008

Problem: Mobbing the Least Used Resource Error

Friday, March 14, 2008 at 3:04AM

A thoughtful reader recently suggested creating a series of posts based on real-life problems people have experienced and the solutions they've created to slay the little beasties. It's a great idea. Often we learn best from great trials and tribulations. I'll start off the new "Problem Report" feature with a diabolical little problem I dubbed the "Mobbing the Least Used Resource Error." Please post your own. And if you know someone with an interesting problem report, please tag them too. It could be a lot of fun. Of course, feel free to scrub your posts of all embarrassing details, but be sure to keep the heroic parts in :-)

The Problem

There's an unexpected and frequently fatal type of error that can happen when new resources are added to a horizontally scaled architecture. Because the new resource has the least of something, load or connections or whatever, a load balancer configured with a least metric will instantaneously direct all new traffic to that new resource. And bam! Your system dies. All the traffic that was meant to be spread across your entire cluster is now directed like a laser beam to one small part of it. I love this problem because it's such a Heisenberg. Everyone is screaming for more storage space so you bring up a new filer. All new data streams flow to the new filer and it crumbles and crawls because it can't handle the load for the entire system. It's in the very act of turning up more storage you bring your system down. How "cruel world the universe hates me" is that? Let's say you add database slaves to handle load. Your load balancer redirects traffic to the new slaves, but the slaves are trying to sync, yet they can't sink because they are getting hammered by the new traffic. Down goes Frazier. This is the dark side of partitioning. You partition data to get high performance via parallelization. For example, you hash on the user name to a cluster dedicated to handle those users. Unless your system is very flexible you can't scale anymore by adding resources because you can't repartition the data. All users are handled by their cluster. If you want a different organization you would have to redistribute data across all the clusters. Most systems can't handle that and you end not being able to scale out as easily as you hoped.

The Solution

The solution depends of course on the resource in question. Butting knowing a potential problem is present gives you the heads up you need to avoid destruction.

For filers migrate storage from existing filers to the new filers so storage is evened out. Then new storage will be allocated evenly across all the filers.

For services have a life cycle state machine indicating when a service is up and ready for work. Simply being alive doesn't mean it's ready.

Consistent Hashing to assign resources to a pool of servers in a scalable fashion.

For servers use random or round-robin balancing when the load balancer can receive incorrect feedback from pool servers. The Thundering Herd Problem is supposedly the same problem described here, but it doesn't seem the same to me.