Entries in Product (120)

Tuesday
Jul152008

ZooKeeper - A Reliable, Scalable Distributed Coordination System 

ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, and configuration maintenance. In addition ZooKeeper can be used for event notification, locking, and as a priority queue mechanism. It's a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and events are the hormones and neurotransmitters used for messaging. Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use. The target market for ZooKeeper are multi-host, multi-process C and Java based systems that operate in a data center. ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space that is modeled after a file system. Data is kept in memory and is backed up to a log for reliability. By using memory ZooKeeper is very fast and can handle the high loads typically seen in chatty coordination protocols across huge numbers of processes. Using a memory based system also mean you are limited to the amount of data that can fit in memory, so it's not useful as a general data store. It's meant to store small bits of configuration information rather than large blobs. Replication is used for scalability and reliability which means it prefers applications that are heavily read based. Typical of hierarchical systems you can add nodes at any point of a tree, get a list of entries in a tree, get the value associated with an entry, and get notification of when an entry changes or goes away. Using these primitives and a little elbow grease you can construct the higher level services mentioned above. Why would you ever need a distribute coordination system? It sounds kind of weird. That's more the question I'll be addressing in this post rather than how it works because the slides and the video do a decent job explaining at a high level what ZooKeeper can do. The low level details could use another paper however. Reportedly it uses a version of the famous Paxos Algorithm to keep replicas consistent in the face of the failures most daunting. What's really missing is a motivation showing how you can use a coordination service in your own system and that's what I hope to provide... Kevin Burton wants to use ZooKeeper to to configure external monitoring systems like Munin and Ganglia for his Spinn3r blog indexing web service. He proposes each service register its presence in a cluster with ZooKeeper under the tree "/services/www." A Munin configuration program will add a ZooKeeper Watch on that node so it will be notified when the list of services under /services/www changes. When the Munin configuration program is notified of a change it reads the service list and automatically regenerates a munin.conf file for the service. Why not simply use a database? Because of the guarantees ZooKeeper makes about its service:

  • Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.
  • A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
  • The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service. You can't get these guarantees from an event system plopped on top of a database and these are the sort of guarantees you need in a complex distributed system where connections drop, nodes, fail, retransmits happen, and chaos rules the day. What rules the night is too terrible to contemplate. For example, it's important that a service-up event is seen after the service-down or you may unnecessarily drop revenue producing work because of an event out-of-order issue. Not that I would know anything about this mind you :-) A weakness of ZooKeeper is the fact that changes happened are dropped: Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.) This means that ZooKeeper is a state based system more than an event system. Watches are set as a side-effect of getting data so you'll always have a valid initial state and on any subsequent change events you'll refresh to get new values. If you want to use events to log when and how something changed, for example, then you can't do that. You would have to include change history in the data itself. Let's take a look at another example of where ZooKeeper could be useful. Picture a complex backend system running on, let's say, a 100 nodes (maybe a lot less, maybe a lot more) in a data center. For example purposes assume the system is an ad system for serving advertisements to web sites. Ad systems are complex beasts that require a fair bit of coordination. Imagine all the subsystems needing to run on those 100 nodes: database, monitoring, fraud detectors, beacon servers, web server event log processors, failover servers, customer dashboards, targeting engines, campaign planners, campaign scenario testers, upgrades, installs, media managers, and so on. There's a lot going on. Now imagine the power in the data center flips and all the machines power on. How do all the processes across all the hosts know what to do? Now imagine everything is up and a few machines go down. How do all the processes know what to do in this situation? This is where a coordination service comes in. A coordination service acts as the backplane over which all these subsystems figure out what they should do relative to all the other subsystems in a product. How, for example, do the ad servers know which database to use? Clearly there are many options for solving this problem. Using standard DNS naming conventions is one. Configuration files is another. Hard coding is still a favorite. Using a bootstrap service locator service is yet another (assuming you can bootstrap the bootstrap service). Ideally any solution must work just as well during unit testing, system testing, and deployment. In this scenario ZooKeeper acts as the service locator. Each process goes to ZooKeeper and finds out which is the primary database. If a new primary is elected, say because a host fails, then ZooKeeper sends an event that allows everyone dependent on the database to react by getting the new primary database. Having complicated retry logic in application code to fail over to another database server is simply a disaster as every programmer will mess it up in their own way. Using a coordination service nicely deals with the problem of services locating other services in all scenarios. Of course, using a proxy like MySQL Proxy would remove even more application level complexity in dealing with failover and request routing. How did the database servers decide which role they'll would play in the first place? All the database servers boot up and say "I'm a database server brave and strong, what's my role in life? Am I a primary or a secondary server? Or if I'm a shard what key range do I serve?" If 10 servers are database servers negotiating roles can be a very complicated and error prone process. A declarative approach through configuration files that specify a failover ring in configuration files is a good approach, but it's a pain to get to work in local development and test environments as the machines are always changing. It's easier to let the database servers come up and self organize themselves on initial role election and in failure scenarios. The advantage of this system is that it can run locally on one machine or on a dozen machines in the data center with very little effort. ZooKeeper supports this type of coordination behavior. Now let's say I want to change the configuration of ad targeting state machines currently running in 40 processes on 40 different hosts. How do I do that? The first approach is no approach. Most systems make it so a new code release has to happen, which is very sloooow. Another approach is a configuration file. Configuration is put in a distribution package and pushed to all nodes. Each process then periodically checks to see if the configuration file has changed and if it has then the new configuration read. That's the basics. Infinite variations can follow. You can have configuration for different subsystems. There's complexity because you have to know what packages are running on which nodes. You have to deal with rollback in case all packages don't push correctly. You have to change the configuration, make a package, test it, then push it to the data center operations team which may take a while to perform the upgrade. It's a slow process. And when you get into changes that impact multiple subsystems it gets even more complicated. Another approach I've taken is to embed a web server in each process so you can see the metrics and change the configuration for each process on the fly. While powerful for a single process it's harder to manipulate sets of processes across a data center using this approach. Using ZooKeeper I can store my state machine definition as a node which is loaded from the static configuration collected from every distribution package in a product. Every process dependent on that node can register as a watcher when they initially read the state machine. When the state machine is updated all dependent entities will get an event that causes them reload the state machine into the process. Simple and straightforward. All processes will eventually get the change and any rebooting processes will pick up the new state machine on initialization. A very cool way to reliably and centrally control a large distributed application. One caveat is I don't see much activity on the ZooKeeper forum. And the questions that do get asked largely go unanswered. Not a good sign when considering such a key piece of infrastructure. Another caveat that may not be obvious on first reading is that your application state machine using ZooKeeper will have to be intimately tied to ZooKeeper's state machine. When a ZooKeeper server dies, for example, your application must process that event and reestablish all your watches on a new server. When a watch event comes your application must handle the event and set new watches. The algorithms to carry out higher level operations like locks and queues are driven by multi-step state machines that must be correctly managed by your application. And as ZooKeeper deals with state that is probably stored in your application it's important to worry about thread safety. Callbacks from the ZooKeeper thread could access shared data structures. An Actor model where you dump ZooKeeper events into your own Actor queues could be a very useful application architecture here for synthesizing different state machines in a thread safe manner.

    Some Fast Facts

  • How data are partitioned across multiple machines? Complete replication in memory. (yes this is limiting)
  • How does update happen (interaction across machines)? All updates flow through the master and are considered complete when a quorum confirms the update.
  • How does read happen (is getting a stale copy possible) ? Reads go to any member of the cluster. Yes, stale copies can be returned. Typically, these are very fresh, however.
  • What is the responsibility of a leader? To assign serial id's to all updates and confirm that a quorum has received the update.
  • There are several limitations that stand out in this architecture: - complete replication limits the total size of data that can be managed using Zookeeper. This is acceptable in some applications, not acceptable in others. In the original domain of Zookeeper (management of configuration and status), it isn't an issue, but zookeeper is good enough to encourage creative misuse where this can become a problem. - serializing all updates through a single leader can be a performance bottleneck. On the other hand, it is possible to push 50K updates per second through a leader and the associated quorum so this limit is pretty high. - the data storage model is completely non relational. These answers were provided by Ted Dunning on the Cloud Computing group.

    Related Articles

  • An Introduction to ZooKeeper Video (Hadoop and Distributed Computing at Yahoo!) (PDF)
  • ZooKeeper Home, Email List, and Recipes (which has some odd conotations for a Zoo).
  • The Chubby Lock Service for Loosely-Coupled Distributed Systems from Google
  • Paxos Made Live – An Engineering Perspective by Tushar Chandra, Robert Griesemer, and Joshua Redstone from Google.
  • Updates on Open Source Distributed Consensus by Kevin Burton
  • Using ZooKeeper to configure External Monitoring Systems by Kevin Burton
  • Paxos Made Simple by Leslie Lamport
  • Hyperspace - API description of Hyperspace, a Chubby-like service
  • Notes on ZooKeeper at the Hadoop Summit by James Hamilton.

    Click to read more ...

  • Sunday
    May252008

    Product: Condor - Compute Intensive Workload Management

    From their website: Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

    Related Articles

  • High Throughput Computing by Miron Livny
  • Condor Presentations
  • (my) Principles of Distributed Computing by Miron Livny

    Click to read more ...

  • Monday
    May052008

    HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy

    Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results. Update: A new presentation at An Introduction to HSCALE. After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots. One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding? In the first article four approaches to sharding were identified:

  • Using MySQL Cluster
  • Using MySQL Proxy with transparent query rewriting and load balancing
  • Implement it into a JDBC driver
  • Implement it into the application data access layer. The proxy solution was selected because it's transparent to the application layer. Applications need not know about the partitioning scheme to make it work. Not mucking with apps is a big win. The downside is implementation complexity. How do you parse a query and and map it correctly to the right server? Will this cause a big performance degradation? How is this new more complex and dynamic system to be tested? Can we run the same queries they did before or will they have to rewrite parts of their application? A lot of questions to be worked out. The second article starts working out those problems using MySQL Proxy. The process was broken into a few steps:
  • Analyze the query to find out which tables are involved and what the parition key would be.
  • Validate the query and reject queries that cannot be analyzed.
  • Determine the partition table / database. This could be done by a simple lookup, a hashing function or anything else.
  • Rewrite the query and replace the table names with the partition table names.
  • Execute the query on the correct database server and return the result back to the client. Some of the comments were concerned that a modulus scheme was being used to identify a partition. The recommendation was to use a directory service for mapping to partitions instead. A directory service allows you to logically map partitions behind the scenes and doesn't tie you to a deterministic physical mapping. After getting all this working they generously released it to the world as HSCALE - Transparent MySQL Partitioning: HSCALE is a plugin written for MySQL Proxy which allows you to transparently split up tables into multiple tables called partitions. In later versions you will be able to put each partition on a different MySQL server. Application based partitioning means that your split up your data logically and rewrite your application to select the right piece of data (i.e. partition) at any given time. More on application based partitioning. Read here some more about what could be done with HSCALE. HSCALE helps in application based partitioning. Using the MySQL Proxy it sits between your application and the database server. Whenever a sql statement is sent to the server HSCALE analyzes it to find out whether a partitioned table is used. It then tries to find out which partition the sql statement should go to. Access release .1 at HSCALE 0.1 released - Partitioning Using MySQL Proxy. The transparent proxy ability is very powerful, but what we are lacking that various companies have created internally is a partition management layer. How do you move partitions? How do you split partitions when a table outgrows the shard or performance declines? Lots of cool tools still to build.

    Related Articles

  • HSCALE - Transparent MySQL Partitioning
  • Pero: HSCALE 0.1 released - Partitioning Using MySQL Proxy
  • Pero: MySQL Partitioning on Application Side
  • Pero: Progress on MySQL Proxy Partitioning
  • HighScalability: Flickr Architecture - more information on partitioning.
  • Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • HighScalabilty: An Unorthodox Approach to Database Design : The Coming of the Shard.

    Click to read more ...

  • Saturday
    May032008

    Product: nginx

    Update 6: nginx_http_push_module. Turn nginx into a long-polling message queuing HTTP push server.

    Update 5: In Load Balancer Update Barry describes how WordPress.com moved from Pound to Nginx and are now "regularly serving about 8-9k requests/second and about 1.2Gbit/sec through a few Nginx instances and have plenty of room to grow!".
    Update 4: Nginx better than Pound for load balancing. Pound spikes at 80% CPU, Nginx uses 3% and is easier to understand and better documented.
    Update 3: igvita.com combines two cool tools together for better performance in Nginx and Memcached, a 400% boost!.
    Update 2: Software Project on Installing Nginx Web Server w/ PHP and SSL. Breaking away from mother Apache can be a scary proposition and this kind of getting started article really helps easy the separation.
    Update: Slicehost has some nice tutorials on setting up Nginx.

    From their website:
    Nginx ("engine x") is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3/SMTP proxy server. Nginx was written by Igor Sysoev for Rambler.ru, Russia's second-most visited website, where it has been running in production for over two and a half years. Igor has released the source code under a BSD-like license. Although still in beta, Nginx is known for its stability, rich feature set, simple configuration, and low resource consumption.



    Bob Ippolito says of Nginx:

    The only solution I know of that's extremely high performance that offers all of the features that you want is Nginx... I currently have Nginx doing reverse proxy of over tens of millions of HTTP requests per day (thats a few hundred per second) on a single server. At peak load it uses about 15MB RAM and 10% CPU on my particular configuration (FreeBSD 6).

    Under the same kind of load, Apache falls over (after using 1000 or so processes and god knows how much RAM), Pound falls over (too many threads, and using 400MB+ of RAM for all the thread stacks), and Lighty leaks more than 20MB per hour (and uses more CPU, but not significantly more).

    See Also

     

  • Nginx vs. Lighttpd for a small VPS
  • nginx: high performance smpt/pop/imap proxy
  • Light Weight Web Server
  • Nginx and Mirror on Demand
  • Running Drupal with Clean URL on Nginx or Lighttpd
  • Goodbye Pound, Hello Nginx
  • Using Nginx, SSI and Memcache to Make Your Web Applications Faster
  • Ruby on Rails hosting with Nginx
  • NGINX Tutorial: Developing Modules
  • Wednesday
    Apr022008

    Product: Supervisor - Monitor and Control Your Processes

    It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all. This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit. Adapted from their website: Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.

  • It is often inconvenient to need to write "rc.d" scripts for every single process instance. rc.d scripts are a great lowest-common-denominator form of process initialization/autostart/management, but they can be painful to write and maintain. Additionally, rc.d scripts cannot automatically restart a crashed process and many programs do not restart themselves properly on a crash. Supervisord starts processes as its subprocesses, and can be configured to automatically restart them on a crash. It can also automatically be configured to start processes on its own invocation.
  • It's often difficult to get accurate up/down status on processes on UNIX. Pidfiles often lie. Supervisord starts processes as subprocesses, so it always knows the true up/down status of its children and can be queried conveniently for this data.
  • Users who need to control process state often need only to do that. They don't want or need full-blown shell access to the machine on which the processes are running. Supervisorctl allows a very limited form of access to the machine, essentially allowing users to see process status and control supervisord-controlled subprocesses by emitting "stop", "start", and "restart" commands from a simple shell or web UI.
  • Users often need to control processes on many machines. Supervisor provides a simple, secure, and uniform mechanism for interactively and automatically controlling processes on groups of machines.
  • Processes which listen on "low" TCP ports often need to be started and restarted as the root user (a UNIX misfeature). It's usually the case that it's perfectly fine to allow "normal" people to stop or restart such a process, but providing them with shell access is often impractical, and providing them with root access or sudo access is often impossible. It's also (rightly) difficult to explain to them why this problem exists. If supervisord is started as root, it is possible to allow "normal" users to control such processes without needing to explain the intricacies of the problem to them.
  • Processes often need to be started and stopped in groups, sometimes even in a "priority order". It's often difficult to explain to people how to do this. Supervisor allows you to assign priorities to processes, and allows user to emit commands via the supervisorctl client like "start all", and "restart all", which starts them in the preassigned priority order. Additionally, processes can be grouped into "process groups" and a set of logically related processes can be stopped and started as a unit. Supervisor also has a web interface and an XMP-RPC interface:
  • A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an internet socket. Visit the server URL (e.g. http://localhost:9001/) to view and control process status through the web interface after activating the configuration file's [inet_http_server] section. XML-RPC Interface
  • The same HTTP server which serves the web UI serves up an XML-RPC interface that can be used to interrogate and control supervisor and the programs it runs. To use the XML-RPC interface, connect to supervisor's http port with any XML-RPC client library and run commands against it. An example of doing this using Python's xmlrpclib client library is as follows.

    Related Articles

  • PyCon Presentation: Supervisor as a Platform
  • Monitor Pylons application with supervisord
  • Supervisor Manual

    Click to read more ...

  • Sunday
    Mar162008

    Product: GlusterFS

    Adapted from their website: GlusterFS is a clustered file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. Storage bricks can be made of any commodity hardware such as x86-64 server with SATA-II RAID and Infiniband HBA). Cluster file systems are still not mature for enterprise market. They are too complex to deploy and maintain though they are extremely scalable and cheap. Can be entirely built out of commodity OS and hardware. GlusterFS hopes to solves this problem. GlusterFS achieved 35 GBps read throughput. The GlusterFS Aggregated I/O Benchmark was performed on 64 bricks clustered storage system over 10 Gbps Infiniband interconnect. A cluster of 220 clients pounded the storage system with multiple dd (disk-dump) instances, each reading / writing a 1 GB file with 1MB block size. GlusterFS was configured with unify translator and round-robin scheduler. The advantages of GlusterFS are: * Designed for O(1) scalability and feature rich. * Aggregates on top of existing filesystems. User can recover the files and folders even without GlusterFS. * GlusterFS has no single point of failure. Completely distributed. No centralized meta-data server like Lustre. * Extensible scheduling interface with modules loaded based on user's storage I/O access pattern. * Modular and extensible through powerful translator mechanism. * Supports Infiniband RDMA and TCP/IP. * Entirely implemented in user-space. Easy to port, debug and maintain. * Scales on demand.

    Related Articles

  • Technical Presentation on GlusterFS
  • Open Fest 5th Annual Conference
  • Zresearch
  • GlusterFS FAQ

    Click to read more ...

  • Saturday
    Mar082008

    Product: FAI - Fully Automatic Installation

    From their website: FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system. You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment. FAI's target group are system administrators who have to install Linux onto one or even hundreds of computers. It's not only a tool for doing a Cluster installation but a general purpose installation tool. It can be used for installing a Beowulf cluster, a rendering farm, a web server farm, or a linux laboratory or a classroom. Even installing a HPC cluster or a GRID and fabric management can be realized by FAI. Large-scale linux networks with different hardware and different installation requirements are easy to establish using FAI and its class concept. Remote OS installations, Linux rollout, mass unattended installation and automated server provisioning are other topics for FAI. The city of Munich is using the combination of GOsa and FAI for their Limux project. Features: * Boot methods: network boot (PXE), CD-ROM, USB stick, floppy disk * Installs Debian, Ubuntu, SuSe, CentOS, Mandriva, Solaris, ... * Centralized installation and configuration management * Installs XEN domains and Vserver

    Related Articles

  • FAI wiki
  • FAI the fully automated installation framework for linux from Debian Administration
  • Fully Automatic Installation (FAI) Video Interviewby http://www.perspektive89.com/
  • Rolling Out Unattended Debian Installations by Carla Schroder from LinuxPlanet
  • A talk on fai and debian

    Click to read more ...

  • Saturday
    Mar082008

    Product: DRBD - Distributed Replicated Block Device

    From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

    Related Articles

  • How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal
  • Linux-HA Press Room with many excellent high availability articles.
  • Sync Data on All Servers thread.
  • MySQL clustering strategies and comparisions
  • Wikipedia on DRBD
  • Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in ONLamp.com
  • DRBD for MySQL High Availability

    Click to read more ...

  • Wednesday
    Feb272008

    Product: System Imager - Automate Deployment and Installs

    From their website: SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution. SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another! It can also be used to ensure safe production deployments. By saving your current production image before updating to your new production image, you have a highly reliable contingency mechanism. If the new production enviroment is found to be flawed, simply roll-back to the last production image with a simple update command! Some typical environments include: Internet server farms, database server farms, high performance clusters, computer labs, and corporate desktop environments.

    Related Articles

  • Cluster Admin's article Installing and updating your nodes is an excellent introduction to SystemImager. He says it's fast, scalable, simple, makes it easy to install on running nodes, allows management of different OS images and remote installation on any given group of nodes.
  • Automate Linux installation and recovery with SystemImager by Paul Virijevich

    Click to read more ...

  • Thursday
    Feb212008

    Product: Capistrano - Automate Remote Tasks Via SSH

    Update: Deployment with Capistrano  by Charles Max Wood.  Nice simple step-by-step for using Capistrano for deployment.

    From their website:
    Simply put, Capistrano is a tool for automating tasks on one or more remote servers. It executes commands in parallel on all targeted machines, and provides a mechanism for rolling back changes across multiple machines. It is ideal for anyone doing any kind of system administration, either professionally or incidentally.

    * Great for automating tasks via SSH on remote servers, like software installation, application deployment, configuration management, ad hoc server monitoring, and more.
    * Ideal for system administrators, whether professional or incidental.
    * Easy to customize. Its configuration files use the Ruby programming language syntax, but you don't need to know Ruby to do most things with Capistrano.
    * Easy to extend. Capistrano is written in the Ruby programming language, and may be extended easily by writing additional Ruby modules.


    One of the original use cases for Capistrano was for deploying web applications. (This is still by far its most popular use case.) In order to make deploying these applications reliable, Capistrano needed to ensure that if something went wrong during the deployment, changes made to that point on the other servers could be rolled back, leaving each server in its original state.

    If you ever need similar functionality in your own recipes, you can introduce a transaction:

    task :deploy do
    transaction do
    update_code
    symlink
    end
    end

    task :update_code do
    on_rollback { run "rm -rf #{release_path}" }
    source.checkout(release_path)
    end

    task :symlink do
    on_rollback { run "rm #{current_path}; ln -s #{previous_release} #{current_path}" }
    run "rm #{current_path}; ln -s #{release_path} #{current_path}"
    end

    The first task, “deploy” wraps a transaction around its invocations of “update_code” and “symlink”. If an error happens within that transaction, the “on_rollback” handlers that have been declared are all executed, in reverse order.

    This does mean that transactions aren’t magical. They don’t really automatically track and revert your changes. You need to do that yourself, and register on_rollback handlers appropriately, that take the necessary steps to undo the changes that the task has made. Still, even as lo-fi as Capistrano transactions are, they can be quite powerful when used properly.


    From the Ruby on Rail manual:

    Ultimately, Capistrano is a utility that can execute commands in parallel on multiple servers. It allows you to define tasks, which can include commands that are executed on the servers. You can also define roles for your servers, and then specify that certain tasks apply only to certain roles.

    Capistrano is very configurable. The default configuration includes a set of basic tasks applicable to web deployment. (More on these tasks will be said later.)

    Capistrano can do just about anything you can write shell script for. You just run those snippets of shell script on remote servers, possibly interacting with them based on their output. You can also upload files, and Capistrano includes some basic templating to allow you to dynamically create and deploy things like maintenance screens, configuration files, shell scripts, and more.

    Related Articles

     

  • Friends for Sale uses Capistrano for deployment.

  • Page 1 ... 5 6 7 8 9 ... 12 Next 10 Entries »