High Scalability -

3 Comments |

Permalink |

Product,

Strategy,

backend

Sunday

May252008

Product: Condor - Compute Intensive Workload Management

Sunday, May 25, 2008 at 3:01AM

From their website: Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

High Throughput Computing by Miron Livny

Condor Presentations

(my) Principles of Distributed Computing by Miron Livny

1 Comment |

Permalink |

HSCALE - Transparent MySQL Partitioning

Grid,

Product,

job scheduling

Monday

May052008

HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy

Monday, May 5, 2008 at 3:41PM

Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results. Update: A new presentation at An Introduction to HSCALE. After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots. One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding? In the first article four approaches to sharding were identified:

Using MySQL Cluster

Using MySQL Proxy with transparent query rewriting and load balancing

Implement it into a JDBC driver

Implement it into the application data access layer. The proxy solution was selected because it's transparent to the application layer. Applications need not know about the partitioning scheme to make it work. Not mucking with apps is a big win. The downside is implementation complexity. How do you parse a query and and map it correctly to the right server? Will this cause a big performance degradation? How is this new more complex and dynamic system to be tested? Can we run the same queries they did before or will they have to rewrite parts of their application? A lot of questions to be worked out. The second article starts working out those problems using MySQL Proxy. The process was broken into a few steps:

Analyze the query to find out which tables are involved and what the parition key would be.

Validate the query and reject queries that cannot be analyzed.

Determine the partition table / database. This could be done by a simple lookup, a hashing function or anything else.

Rewrite the query and replace the table names with the partition table names.

Execute the query on the correct database server and return the result back to the client. Some of the comments were concerned that a modulus scheme was being used to identify a partition. The recommendation was to use a directory service for mapping to partitions instead. A directory service allows you to logically map partitions behind the scenes and doesn't tie you to a deterministic physical mapping. After getting all this working they generously released it to the world as HSCALE - Transparent MySQL Partitioning: HSCALE is a plugin written for MySQL Proxy which allows you to transparently split up tables into multiple tables called partitions. In later versions you will be able to put each partition on a different MySQL server. Application based partitioning means that your split up your data logically and rewrite your application to select the right piece of data (i.e. partition) at any given time. More on application based partitioning. Read here some more about what could be done with HSCALE. HSCALE helps in application based partitioning. Using the MySQL Proxy it sits between your application and the database server. Whenever a sql statement is sent to the server HSCALE analyzes it to find out whether a partitioned table is used. It then tries to find out which partition the sql statement should go to. Access release .1 at HSCALE 0.1 released - Partitioning Using MySQL Proxy. The transparent proxy ability is very powerful, but what we are lacking that various companies have created internally is a partition management layer. How do you move partitions? How do you split partitions when a table outgrows the shard or performance declines? Lots of cool tools still to build.

Pero: HSCALE 0.1 released - Partitioning Using MySQL Proxy

Pero: MySQL Partitioning on Application Side

Pero: Progress on MySQL Proxy Partitioning

HighScalability: Flickr Architecture - more information on partitioning.

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

HighScalabilty: An Unorthodox Approach to Database Design : The Coming of the Shard.

PyCon Presentation: Supervisor as a Platform

3 Comments |

Permalink |

MySQL,

sharding

Saturday

May032008

Product: nginx

Saturday, May 3, 2008 at 4:20AM

Update 6: nginx_http_push_module. Turn nginx into a long-polling message queuing HTTP push server.

Update 5: In Load Balancer Update Barry describes how WordPress.com moved from Pound to Nginx and are now "regularly serving about 8-9k requests/second and about 1.2Gbit/sec through a few Nginx instances and have plenty of room to grow!".
Update 4: Nginx better than Pound for load balancing. Pound spikes at 80% CPU, Nginx uses 3% and is easier to understand and better documented.
Update 3: igvita.com combines two cool tools together for better performance in Nginx and Memcached, a 400% boost!.
Update 2: Software Project on Installing Nginx Web Server w/ PHP and SSL. Breaking away from mother Apache can be a scary proposition and this kind of getting started article really helps easy the separation.
Update: Slicehost has some nice tutorials on setting up Nginx.

From their website:
Nginx ("engine x") is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3/SMTP proxy server. Nginx was written by Igor Sysoev for Rambler.ru, Russia's second-most visited website, where it has been running in production for over two and a half years. Igor has released the source code under a BSD-like license. Although still in beta, Nginx is known for its stability, rich feature set, simple configuration, and low resource consumption.

Bob Ippolito says of Nginx:

The only solution I know of that's extremely high performance that offers all of the features that you want is Nginx... I currently have Nginx doing reverse proxy of over tens of millions of HTTP requests per day (thats a few hundred per second) on a single server. At peak load it uses about 15MB RAM and 10% CPU on my particular configuration (FreeBSD 6).

Under the same kind of load, Apache falls over (after using 1000 or so processes and god knows how much RAM), Pound falls over (too many threads, and using 400MB+ of RAM for all the thread stacks), and Lighty leaks more than 20MB per hour (and uses more CPU, but not significantly more).

Product: Supervisor - Monitor and Control Your Processes

Wednesday, April 2, 2008 at 3:12AM

It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all. This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit. Adapted from their website: Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.

It is often inconvenient to need to write "rc.d" scripts for every single process instance. rc.d scripts are a great lowest-common-denominator form of process initialization/autostart/management, but they can be painful to write and maintain. Additionally, rc.d scripts cannot automatically restart a crashed process and many programs do not restart themselves properly on a crash. Supervisord starts processes as its subprocesses, and can be configured to automatically restart them on a crash. It can also automatically be configured to start processes on its own invocation.

It's often difficult to get accurate up/down status on processes on UNIX. Pidfiles often lie. Supervisord starts processes as subprocesses, so it always knows the true up/down status of its children and can be queried conveniently for this data.

Users who need to control process state often need only to do that. They don't want or need full-blown shell access to the machine on which the processes are running. Supervisorctl allows a very limited form of access to the machine, essentially allowing users to see process status and control supervisord-controlled subprocesses by emitting "stop", "start", and "restart" commands from a simple shell or web UI.

Users often need to control processes on many machines. Supervisor provides a simple, secure, and uniform mechanism for interactively and automatically controlling processes on groups of machines.

Processes which listen on "low" TCP ports often need to be started and restarted as the root user (a UNIX misfeature). It's usually the case that it's perfectly fine to allow "normal" people to stop or restart such a process, but providing them with shell access is often impractical, and providing them with root access or sudo access is often impossible. It's also (rightly) difficult to explain to them why this problem exists. If supervisord is started as root, it is possible to allow "normal" users to control such processes without needing to explain the intricacies of the problem to them.

Processes often need to be started and stopped in groups, sometimes even in a "priority order". It's often difficult to explain to people how to do this. Supervisor allows you to assign priorities to processes, and allows user to emit commands via the supervisorctl client like "start all", and "restart all", which starts them in the preassigned priority order. Additionally, processes can be grouped into "process groups" and a set of logically related processes can be stopped and started as a unit. Supervisor also has a web interface and an XMP-RPC interface:

A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an internet socket. Visit the server URL (e.g. http://localhost:9001/) to view and control process status through the web interface after activating the configuration file's [inet_http_server] section. XML-RPC Interface

The same HTTP server which serves the web UI serves up an XML-RPC interface that can be used to interrogate and control supervisor and the programs it runs. To use the XML-RPC interface, connect to supervisor's http port with any XML-RPC client library and run commands against it. An example of doing this using Python's xmlrpclib client library is as follows.

Monitor Pylons application with supervisord

Supervisor Manual

Technical Presentation on GlusterFS

5 Comments |

Permalink |

Strategy

Sunday

Mar162008

Product: GlusterFS

Sunday, March 16, 2008 at 2:24AM

Adapted from their website: GlusterFS is a clustered file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. Storage bricks can be made of any commodity hardware such as x86-64 server with SATA-II RAID and Infiniband HBA). Cluster file systems are still not mature for enterprise market. They are too complex to deploy and maintain though they are extremely scalable and cheap. Can be entirely built out of commodity OS and hardware. GlusterFS hopes to solves this problem. GlusterFS achieved 35 GBps read throughput. The GlusterFS Aggregated I/O Benchmark was performed on 64 bricks clustered storage system over 10 Gbps Infiniband interconnect. A cluster of 220 clients pounded the storage system with multiple dd (disk-dump) instances, each reading / writing a 1 GB file with 1MB block size. GlusterFS was configured with unify translator and round-robin scheduler. The advantages of GlusterFS are: * Designed for O(1) scalability and feature rich. * Aggregates on top of existing filesystems. User can recover the files and folders even without GlusterFS. * GlusterFS has no single point of failure. Completely distributed. No centralized meta-data server like Lustre. * Extensible scheduling interface with modules loaded based on user's storage I/O access pattern. * Modular and extensible through powerful translator mechanism. * Supports Infiniband RDMA and TCP/IP. * Entirely implemented in user-space. Easy to port, debug and maintain. * Scales on demand.

Open Fest 5th Annual Conference

Zresearch

GlusterFS FAQ

5 Comments |

Permalink |

Cluster File System,

Product

Saturday

Mar082008

Product: FAI - Fully Automatic Installation

Saturday, March 8, 2008 at 2:35AM

From their website: FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system. You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment. FAI's target group are system administrators who have to install Linux onto one or even hundreds of computers. It's not only a tool for doing a Cluster installation but a general purpose installation tool. It can be used for installing a Beowulf cluster, a rendering farm, a web server farm, or a linux laboratory or a classroom. Even installing a HPC cluster or a GRID and fabric management can be realized by FAI. Large-scale linux networks with different hardware and different installation requirements are easy to establish using FAI and its class concept. Remote OS installations, Linux rollout, mass unattended installation and automated server provisioning are other topics for FAI. The city of Munich is using the combination of GOsa and FAI for their Limux project. Features: * Boot methods: network boot (PXE), CD-ROM, USB stick, floppy disk * Installs Debian, Ubuntu, SuSe, CentOS, Mandriva, Solaris, ... * Centralized installation and configuration management * Installs XEN domains and Vserver

FAI wiki

FAI the fully automated installation framework for linux from Debian Administration

Fully Automatic Installation (FAI) Video Interviewby http://www.perspektive89.com/

Rolling Out Unattended Debian Installations by Carla Schroder from LinuxPlanet

A talk on fai and debian

MySQL clustering strategies and comparisions

2 Comments |

Permalink |

FAI,

deployment

Saturday

Mar082008

Product: DRBD - Distributed Replicated Block Device

Saturday, March 8, 2008 at 2:17AM

From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal

Linux-HA Press Room with many excellent high availability articles.

Sync Data on All Servers thread.

Wikipedia on DRBD

Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in ONLamp.com

DRBD for MySQL High Availability

8 Comments |

Permalink |

DRBD,

mirror

Wednesday

Feb272008

Product: System Imager - Automate Deployment and Installs

Wednesday, February 27, 2008 at 3:10PM

From their website: SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution. SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another! It can also be used to ensure safe production deployments. By saving your current production image before updating to your new production image, you have a highly reliable contingency mechanism. If the new production enviroment is found to be flawed, simply roll-back to the last production image with a simple update command! Some typical environments include: Internet server farms, database server farms, high performance clusters, computer labs, and corporate desktop environments.

Cluster Admin's article Installing and updating your nodes is an excellent introduction to SystemImager. He says it's fast, scalable, simple, makes it easy to install on running nodes, allows management of different OS images and remote installation on any given group of nodes.

Automate Linux installation and recovery with SystemImager by Paul Virijevich

4 Comments |

Permalink |

operations

Thursday

Feb212008

Product: Capistrano - Automate Remote Tasks Via SSH

Thursday, February 21, 2008 at 2:35PM

Update: Deployment with Capistrano by Charles Max Wood. Nice simple step-by-step for using Capistrano for deployment.

From their website:
Simply put, Capistrano is a tool for automating tasks on one or more remote servers. It executes commands in parallel on all targeted machines, and provides a mechanism for rolling back changes across multiple machines. It is ideal for anyone doing any kind of system administration, either professionally or incidentally.

* Great for automating tasks via SSH on remote servers, like software installation, application deployment, configuration management, ad hoc server monitoring, and more.
* Ideal for system administrators, whether professional or incidental.
* Easy to customize. Its configuration files use the Ruby programming language syntax, but you don't need to know Ruby to do most things with Capistrano.
* Easy to extend. Capistrano is written in the Ruby programming language, and may be extended easily by writing additional Ruby modules.

One of the original use cases for Capistrano was for deploying web applications. (This is still by far its most popular use case.) In order to make deploying these applications reliable, Capistrano needed to ensure that if something went wrong during the deployment, changes made to that point on the other servers could be rolled back, leaving each server in its original state.

If you ever need similar functionality in your own recipes, you can introduce a transaction:

task :deploy do
transaction do
update_code
symlink
end
end

task :update_code do
on_rollback { run "rm -rf #{release_path}" }
source.checkout(release_path)
end

task :symlink do
on_rollback { run "rm #{current_path}; ln -s #{previous_release} #{current_path}" }
run "rm #{current_path}; ln -s #{release_path} #{current_path}"
end

The first task, “deploy” wraps a transaction around its invocations of “update_code” and “symlink”. If an error happens within that transaction, the “on_rollback” handlers that have been declared are all executed, in reverse order.

This does mean that transactions aren’t magical. They don’t really automatically track and revert your changes. You need to do that yourself, and register on_rollback handlers appropriately, that take the necessary steps to undo the changes that the task has made. Still, even as lo-fi as Capistrano transactions are, they can be quite powerful when used properly.

From the Ruby on Rail manual:

Ultimately, Capistrano is a utility that can execute commands in parallel on multiple servers. It allows you to define tasks, which can include commands that are executed on the servers. You can also define roles for your servers, and then specify that certain tasks apply only to certain roles.

Capistrano is very configurable. The default configuration includes a set of basic tasks applicable to web deployment. (More on these tasks will be said later.)

Capistrano can do just about anything you can write shell script for. You just run those snippets of shell script on remote servers, possibly interacting with them based on their output. You can also upload files, and Capistrano includes some basic templating to allow you to dynamically create and deploy things like maintenance screens, configuration files, shell scripts, and more.

Friends for Sale uses Capistrano for deployment.

2 Comments |

Permalink |