High Scalability -

2 Comments |

Permalink |

MySQL clustering strategies and comparisions

FAI,

Product,

automation,

deployment

Saturday

Mar082008

Product: DRBD - Distributed Replicated Block Device

Saturday, March 8, 2008 at 2:17AM

From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal

Linux-HA Press Room with many excellent high availability articles.

Sync Data on All Servers thread.

Wikipedia on DRBD

Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in ONLamp.com

DRBD for MySQL High Availability

8 Comments |

Permalink |

DRBD,

mirror

Thursday

Mar062008

Announce: First Meeting of Boston Scalability User Group

Thursday, March 6, 2008 at 2:27PM

The first meeting will take place on Wednesday March 26 at 6 p.m. in the IBM Innovation Center (Waltham, MA). The first speaker will be Patrick Peralta of Oracle! Patrick will be presenting: Orchestrating Messaging, Data Grid and Database for Scalable Performance. Important Note: There will be pizza at this meeting! The site is at: http://www.bostonsug.org/

Oprah is the Real Social Network

Wednesday, March 5, 2008 at 12:55AM

A lot of new internet TV station startups are in the wind these days and there's a question about how they can scale their broadcasts. Today's state of the art shows you can't yet mimic the reach of broadcast TV with internet tech. But as Oprah proves, you can still capture a lot of eyeballs, if you are Oprah... Oprah drew a stunning 500,000 simultaneous viewers for an Eckhart Tolle webcast. Move Networks and Limelight Networks hosted the "broadcast" where traffic peaked at 242Gbps. A variable bitrate scheme was used so depending on their connection, a viewer could have seen 150Kbps or as high as 750Kbps. Dan Rayburn thinks The big take away from this webcast is that it shows proof that the Internet is not built to handle TV like distribution and those who think that live TV shows will be broadcast on the Internet with millions and millions of people watching, it's just not going to happen. To handle more users comments suggested capping the bitrate at 300K, using P2P streaming, or using a CDN more specialized in live streaming. I went to Oprah's website and was a bit shocked to find she didn't have full blown social network available. Can you imagine if she did? Oprah's army would seem to be a highly desirable bunch to monetize.

6 Comments |

Permalink |

CDN,

Internet TV,

Oprah

Tuesday

Mar042008

Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

Tuesday, March 4, 2008 at 1:17AM

Update: VcubeV - an OpenVPN-based solution designed to build and operate a multisourced infrastructure. True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers. The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN. Dmitriy's idea is to create a secure UDP tunnel between different data centers over public internet links so your application sees a flat virtual network even though the machines run in different data centers. Your machines think they are on the same local network when in reality clusters of machines are maintained in multiple locations communicating over the internet. This impossible sounding task is well described in his article and involves setting up OpenVPN and a lot of tricky bits of configuration. Your reward? Geographical redundancy, encrypted communications, higher fault tolerance, nearest resource routing, better horizontal scalability, and greater vendor independence. Dmitriy points out there are some potential issues with this architecture:

Broadcasting and multicasting will not work over the tunnel.

Latency over the public network is higher over the public network than it is with your local Ethernet.

Tunnels tend to go up and down more than an Ethernet network. Having used a setup like this before it's quite possible to have very fast backbone links connecting data centers so the latency, bandwidth, and connection quality issues can be a lot less than you might think, or they could be an absolute killer. The broadcast/multicast problem did come up, but there always alternative approaches that don't require this ability.

A Few Questions for Dmitriy

I asked Dmitriy a few questions and he was kind enough to respond with the following answers: 1. Why would I want to create a virtual LAN rather than create a service layer and access services over http? This depends on what kind of services we are talking about. With hosts in 2 different datacenters which are operated by different hosting companies, and assuming no private connectivity (like a private T1 which you pay for and support), the only way for hosts to talk to each other is via public Internet. If the data your services will be exchanging do not need to be protected from external eyes and you don't need to restrict access directly to services from Internet, then service layer and access over http would definitely be easier. However, if you don't want public access to those services, the first thing we did was have a firewall and restrict who can access which service by IP. For example, we provision machines as needed at Server Beach, one machine at a time (as I said, our operation is currently relatively small). And we handle user auth from LDAP. Whenever we get a new machine, we adjust its firewall and adjust firewalls on all other machines which it's going to communicate with. In our case, we adjusted firewall on LDAP server so a new host could talk to LDAP. With time this peer-to-peer firewall adjusting became too error prone and time consuming as the number of hosts you have goes up. Besides, it breaks change isolation to a certain extent - when bringing up a new host, I have to adjust existing production. In our example - we set up LDAP replica and now all hosts needed to be reconfigured to failover to replica if the primary was not reachable - which meant a lot of firewall changes on multiple hosts. With more services and more hosts, I was dreading we'd end up with a pile of unmanageable firewall rules. Another aspect missing was data encryption when data pass on public Internet links. Was no big deal for us at the moment, but sooner or later everybody starts worrying about this so I took a preemptive shot. Vanilla OpenVPN helped us kill these 2 birds with one stone. We got encryption and once a server has a virtual IP, it's easier to manage firewalls - I choose to manage it on server side (so in our example, on LDAP server). Our dynamic routing script allowed us to have a pair of active-active OpenVPN servers, lack of which would have been a show stopper for me. There are also 2 key benefits of OpenVPN that I like a lot: a. passes through NAT and firewalls (since it's UDP). I can have a machine behind all sorts of firewalls and on 192.168.1.0/24 network and I still can ssh to it from anywhere in the cloud (using its virtual IP). Works great for VMs with NAT networking type. b. you can assign static virtual IPs to hosts based on ssl key/cert pairs. This comes very handy when you start thinking about Amazon EC2 and their lack of static IP addrs at the moment. 2. Can I connect more than two data center in a pairwise configuration? Yes you can, provided all your hosts that need to connect to VcubeV have physical network connectivity to at least one OpenVPN server (either over LAN or WAN). Plus, at least one OpenVPN server needs to be accessible by the other OpenVPN server. Please see my terrible diagram within the article at http://www.linuxjournal.com/article/9915 . If you want more than 2 OpenVPN servers, please see my (4) below. 3. You mention the downsides are manageable by making certain architectural choices. Could you please describe these? Sure, it's pretty much what I said in the conclusion section in the article. Primarily it's "don't multisource if an app delivers better value when singlesourced." Term "better value" will vary from architect to architect. All of these solutions would require further experimentation. 1. No broadcast or multicast. Solution: look into using OpenVPN on top of `tap' devices instead of `tun'. I personally would not multisource an app that does broadcast or multicast, since it's too low level and imho is likely to have other issues with being deployed in environment which is drastically different from what its designers had in mind. 2. Latency. One depends on public Internet links, so latency can't be controlled. Solution: anticipate latency, application retry logic, adjustable timeouts. If latency is a key aspect of application (trading, for example), don't multisource or at least think twice. 3. Link flapping. Solution: retry logic, avoid long-running TCP connections, forcefully break and re-establish TCP connections regularly, application level heartbeats, use TCP tunnels instead of UDP tunnels, consider data caches (memcached). 4. No more than 2 OpenVPN servers. It's a design limitation of current version of cube-routed. Solution: rewrite cube-routed to share route information using a more advanced protocol that allows many-to-many sharing.

What Will the Future Look Like?

It seems clear to me we are going to need a whole new set of tools and infrastructure for managing, deploying, creating, expanding, upgrading, and monitoring applications across multiple clouds. The advantages of multi-cloud deployment are too great to ignore. We need a Data Center API so we can treat all the different clouds as peers and operate on them like one big exposed object instead of individually specialized niches. Will we see real-time markets develop where clouds bid for your network/CPU/storage business and you can dynamically allocate applications to cloud vendors in order to minimize costs?

3 Comments |

Permalink |

Geo-distributed Clusters,

Strategy,

multisourcing,

virtual LAN

Monday

Mar032008

Read This Site and Ace Your Next Interview!

Monday, March 3, 2008 at 4:58AM

Paul Tyma published a massive and massively good 96 page insider's manual on How to Pass a Silicon Valley Software Engineering Interview. My eyes immediately latched on to one of his key example scenarios, which involves scaling Facebook:

Facebook

● What was Facebook day 1? – A database with a PHP front-end ● In PHP, Java, C#, whatever – How long would it take you to reproduce Facebook's first incarnation? ● A single MySQL instance with some simple queries probably used to happily query the whole userbase.

Facebook

● What is it today? ● Its not about “that stuff you learned in school” – Its about what a company with thousands of (possibly conflicting) queries per second operating on a directed-graph with 50 million nodes ● And of course a few Petabytes of data ● And 99.99% uptime ● Design decision? A Facebook user is (or recently was) currently limited to 5000 friends. If you've been reading all the wisdom contributed to and referenced by this website you might just rock this interview and put a little more money in your pocket. So this site isn't a total waste of time :-) Yet I wonder how we can have 96 pages on interviewing and still not talk about software development at all?

2 Comments |

Permalink |

Monday

Mar032008

Two data streams for a happy website

Monday, March 3, 2008 at 2:19AM

One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is "when" in this case, not "if").

1 Comment |

Permalink |

Strategy,

architecture partitioning scaling

Wednesday

Feb272008

Product: System Imager - Automate Deployment and Installs

Wednesday, February 27, 2008 at 3:10PM

From their website: SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution. SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another! It can also be used to ensure safe production deployments. By saving your current production image before updating to your new production image, you have a highly reliable contingency mechanism. If the new production enviroment is found to be flawed, simply roll-back to the last production image with a simple update command! Some typical environments include: Internet server farms, database server farms, high performance clusters, computer labs, and corporate desktop environments.

Cluster Admin's article Installing and updating your nodes is an excellent introduction to SystemImager. He says it's fast, scalable, simple, makes it easy to install on running nodes, allows management of different OS images and remote installation on any given group of nodes.

Automate Linux installation and recovery with SystemImager by Paul Virijevich

4 Comments |

Permalink |

operations

Tuesday

Feb262008

Architecture to Allow High Availability File Upload

Tuesday, February 26, 2008 at 4:47AM

Hi, I was wondering if anyone has found any information on how to architect a system to support high availability file uploads. My scenario: I have an Apache server proxying requests to a bunch of Tomcat Java application servers. When I need to upgrade my site, I stop and upgrade each of the Tomcat servers one at a time. This seems to work well as Apache automatically routes subsequent requests for the stopped app server to the remaining app servers that are up. The problem is that if a user is uploading a file when the app server is stopped, the upload fails and the user has to upload the file again. This is problematic as uploading files is an integral feature of the site and it's frustrating for the users to have to restart their uploads every time I upgrade the site (which I want to be able to do frequently). Has anyone seen any information on how this can be done or have ideas on how this can be architected? I imagine sites like Flickr must have a solution to this problem as I have seen presentations they say that they are able to upgrade their site several times a day without the users noticing. Thanks! Tuyen

HighScalability Team |

7 Comments |

Permalink |

Monday

Feb252008

Make Your Site Run 10 Times Faster

Monday, February 25, 2008 at 11:43PM

This is what Mike Peters says he can do: make your site run 10 times faster. His test bed is "half a dozen servers parsing 200,000 pages per hour over 40 IP addresses, 24 hours a day." Before optimization CPU spiked to 90% with 50 concurrent connections. After optimization each machine "was effectively handling 500 concurrent connections per second with CPU at 8% and no degradation in performance." Mike identifies six major bottlenecks:

Database write access (read is cheaper)

Database read access

PHP, ASP, JSP and any other server side scripting

Client side JavaScript

Multiple/Fat Images, scripts or css files from different domains on your page

Slow keep-alive client connections, clogging your available sockets Mike's solutions:

Switch all database writes to offline processing

Minimize number of database read access to the bare minimum. No more than two queries per page.

Denormalize your database and Optimize MySQL tables

Implement MemCached and change your database-access layer to fetch information from the in-memory database first.

Store all sessions in memory.

If your system has high reads, keep MySQL tables as MyISAM. If your system has high writes, switch MySQL tables to InnoDB.

Limit server side processing to the minimum.

Precompile all php scripts using eAccelerator

If you're using WordPress, implement WP-Cache

Reduce size of all images by using an image optimizer

Merge multiple css/js files into one, Minify your .js scripts

Avoid hardlinking to images or scripts residing on other domains.

Put .css references at the top of your page, .js scripts at the bottom.

Install FireFox FireBug and YSlow. YSlow analyze your web pages on the fly, giving you a performance grade and recommending the changes you need to make.

Optimize httpd.conf to kill connections after 5 seconds of inactivity, turn gzip compression on.

Configure Apache to add Expire and ETag headers, allowing client web browsers to cache images, .css and .js files

Consider dumping Apache and replacing it with Lighttpd or Nginx. Find more details in Mike's article.

12 Comments |

Permalink |