Entries in Strategy (358)

Tuesday
Mar042008

Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

Update: VcubeV - an OpenVPN-based solution designed to build and operate a multisourced infrastructure. True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers. The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN. Dmitriy's idea is to create a secure UDP tunnel between different data centers over public internet links so your application sees a flat virtual network even though the machines run in different data centers. Your machines think they are on the same local network when in reality clusters of machines are maintained in multiple locations communicating over the internet. This impossible sounding task is well described in his article and involves setting up OpenVPN and a lot of tricky bits of configuration. Your reward? Geographical redundancy, encrypted communications, higher fault tolerance, nearest resource routing, better horizontal scalability, and greater vendor independence. Dmitriy points out there are some potential issues with this architecture:

  • Broadcasting and multicasting will not work over the tunnel.
  • Latency over the public network is higher over the public network than it is with your local Ethernet.
  • Tunnels tend to go up and down more than an Ethernet network. Having used a setup like this before it's quite possible to have very fast backbone links connecting data centers so the latency, bandwidth, and connection quality issues can be a lot less than you might think, or they could be an absolute killer. The broadcast/multicast problem did come up, but there always alternative approaches that don't require this ability.

    A Few Questions for Dmitriy

    I asked Dmitriy a few questions and he was kind enough to respond with the following answers: 1. Why would I want to create a virtual LAN rather than create a service layer and access services over http? This depends on what kind of services we are talking about. With hosts in 2 different datacenters which are operated by different hosting companies, and assuming no private connectivity (like a private T1 which you pay for and support), the only way for hosts to talk to each other is via public Internet. If the data your services will be exchanging do not need to be protected from external eyes and you don't need to restrict access directly to services from Internet, then service layer and access over http would definitely be easier. However, if you don't want public access to those services, the first thing we did was have a firewall and restrict who can access which service by IP. For example, we provision machines as needed at Server Beach, one machine at a time (as I said, our operation is currently relatively small). And we handle user auth from LDAP. Whenever we get a new machine, we adjust its firewall and adjust firewalls on all other machines which it's going to communicate with. In our case, we adjusted firewall on LDAP server so a new host could talk to LDAP. With time this peer-to-peer firewall adjusting became too error prone and time consuming as the number of hosts you have goes up. Besides, it breaks change isolation to a certain extent - when bringing up a new host, I have to adjust existing production. In our example - we set up LDAP replica and now all hosts needed to be reconfigured to failover to replica if the primary was not reachable - which meant a lot of firewall changes on multiple hosts. With more services and more hosts, I was dreading we'd end up with a pile of unmanageable firewall rules. Another aspect missing was data encryption when data pass on public Internet links. Was no big deal for us at the moment, but sooner or later everybody starts worrying about this so I took a preemptive shot. Vanilla OpenVPN helped us kill these 2 birds with one stone. We got encryption and once a server has a virtual IP, it's easier to manage firewalls - I choose to manage it on server side (so in our example, on LDAP server). Our dynamic routing script allowed us to have a pair of active-active OpenVPN servers, lack of which would have been a show stopper for me. There are also 2 key benefits of OpenVPN that I like a lot: a. passes through NAT and firewalls (since it's UDP). I can have a machine behind all sorts of firewalls and on 192.168.1.0/24 network and I still can ssh to it from anywhere in the cloud (using its virtual IP). Works great for VMs with NAT networking type. b. you can assign static virtual IPs to hosts based on ssl key/cert pairs. This comes very handy when you start thinking about Amazon EC2 and their lack of static IP addrs at the moment. 2. Can I connect more than two data center in a pairwise configuration? Yes you can, provided all your hosts that need to connect to VcubeV have physical network connectivity to at least one OpenVPN server (either over LAN or WAN). Plus, at least one OpenVPN server needs to be accessible by the other OpenVPN server. Please see my terrible diagram within the article at http://www.linuxjournal.com/article/9915 . If you want more than 2 OpenVPN servers, please see my (4) below. 3. You mention the downsides are manageable by making certain architectural choices. Could you please describe these? Sure, it's pretty much what I said in the conclusion section in the article. Primarily it's "don't multisource if an app delivers better value when singlesourced." Term "better value" will vary from architect to architect. All of these solutions would require further experimentation. 1. No broadcast or multicast. Solution: look into using OpenVPN on top of `tap' devices instead of `tun'. I personally would not multisource an app that does broadcast or multicast, since it's too low level and imho is likely to have other issues with being deployed in environment which is drastically different from what its designers had in mind. 2. Latency. One depends on public Internet links, so latency can't be controlled. Solution: anticipate latency, application retry logic, adjustable timeouts. If latency is a key aspect of application (trading, for example), don't multisource or at least think twice. 3. Link flapping. Solution: retry logic, avoid long-running TCP connections, forcefully break and re-establish TCP connections regularly, application level heartbeats, use TCP tunnels instead of UDP tunnels, consider data caches (memcached). 4. No more than 2 OpenVPN servers. It's a design limitation of current version of cube-routed. Solution: rewrite cube-routed to share route information using a more advanced protocol that allows many-to-many sharing.

    What Will the Future Look Like?

    It seems clear to me we are going to need a whole new set of tools and infrastructure for managing, deploying, creating, expanding, upgrading, and monitoring applications across multiple clouds. The advantages of multi-cloud deployment are too great to ignore. We need a Data Center API so we can treat all the different clouds as peers and operate on them like one big exposed object instead of individually specialized niches. Will we see real-time markets develop where clouds bid for your network/CPU/storage business and you can dynamically allocate applications to cloud vendors in order to minimize costs?

    Click to read more ...

  • Monday
    Mar032008

    Two data streams for a happy website

    One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is "when" in this case, not "if").

    Click to read more ...

    Monday
    Feb252008

    Make Your Site Run 10 Times Faster

    This is what Mike Peters says he can do: make your site run 10 times faster. His test bed is "half a dozen servers parsing 200,000 pages per hour over 40 IP addresses, 24 hours a day." Before optimization CPU spiked to 90% with 50 concurrent connections. After optimization each machine "was effectively handling 500 concurrent connections per second with CPU at 8% and no degradation in performance." Mike identifies six major bottlenecks:

  • Database write access (read is cheaper)
  • Database read access
  • PHP, ASP, JSP and any other server side scripting
  • Client side JavaScript
  • Multiple/Fat Images, scripts or css files from different domains on your page
  • Slow keep-alive client connections, clogging your available sockets Mike's solutions:
  • Switch all database writes to offline processing
  • Minimize number of database read access to the bare minimum. No more than two queries per page.
  • Denormalize your database and Optimize MySQL tables
  • Implement MemCached and change your database-access layer to fetch information from the in-memory database first.
  • Store all sessions in memory.
  • If your system has high reads, keep MySQL tables as MyISAM. If your system has high writes, switch MySQL tables to InnoDB.
  • Limit server side processing to the minimum.
  • Precompile all php scripts using eAccelerator
  • If you're using WordPress, implement WP-Cache
  • Reduce size of all images by using an image optimizer
  • Merge multiple css/js files into one, Minify your .js scripts
  • Avoid hardlinking to images or scripts residing on other domains.
  • Put .css references at the top of your page, .js scripts at the bottom.
  • Install FireFox FireBug and YSlow. YSlow analyze your web pages on the fly, giving you a performance grade and recommending the changes you need to make.
  • Optimize httpd.conf to kill connections after 5 seconds of inactivity, turn gzip compression on.
  • Configure Apache to add Expire and ETag headers, allowing client web browsers to cache images, .css and .js files
  • Consider dumping Apache and replacing it with Lighttpd or Nginx. Find more details in Mike's article.

    Click to read more ...

  • Wednesday
    Jan162008

    Strategy: Asynchronous Queued Virus Scanning

    Atif Ghaffar has a nice strategy to deal with virus checking uploads:

  • Upload item into a safe area. If necessary, the uploader blocks waiting for a result.
  • Queue a work order into a job system so all the work can be distributed throughout your cluster.
  • A service in your cluster performs the virus scan and informs the uploader of the result.
  • Move the vetted item into your system. This removes the CPU bottleneck from your web servers and distributes it through your cluster. Keep your web servers providing prompt service to users. Let your cluster do the heavy lifting. This minimizes response time and maximizes throughput. A similar system can be used for creating thumbnails, transcoding, copyright checks, updating indexes, event notification or any other kind of intensive work.

    Click to read more ...

  • Thursday
    Jan102008

    Letting Clients Know What's Changed: Push Me or Pull Me?

    I had a false belief I thought I came here to stay We're all just visiting All just breaking like waves The oceans made me, but who came up with me? Push me, pull me, push me, or pull me out . So true Perl Jam (Push me Pull me lyrics), so true. I too have wondered how web clients should be notified of model changes. Should servers push events to clients or should clients pull events from servers? A topic worthy of its own song if ever there was one. To pull events the client simply starts a timer and makes a request to the server. This is polling. You can either pull a complete set of fresh data or get a list of changes. The server "knows" if anything you are interested in has changed and makes those changes available to you. Knowing what has changed can be relatively simple with a publish-subscribe type backend or you can get very complex with fine grained bit maps of attributes and keeping per client state on what I client still needs to see. Polling is heavy man. Imagine all your clients hitting your servers every 5 seconds even if there are no updates. And if every poll request ends up in a flurry of database requests your database can be hammered. Of course, caching can smooth out this jagged trip, but if you keep per client state you need more clever per client cache views. The overhead of polling can be mitigated somewhat by piggy backing updates on replies to client requests. So if polling has a high overhead then it makes sense to only send data when there's an update the client should see. That is, we push data to the client. The current push model favorite is Comet: a World Wide Web application architecture in which a web server sends data to a client program (normally a web browser) asynchronously without any need for the client to explicitly request it. It allows creation of event-driven web applications, enabling real-time interaction otherwise impossible in a browser. Nothing comes for free however and pushing has a surprising amount of overhead too. A connection has to be kept open between the client and server for the new data to pushed over. Typically servers don't handle large tables of connections very well so this approach hasn't worked well. You had to spread the connections over multiple servers. Fortunately operating systems are getting better at handling large numbers of connections. For every connection you also have to store the data to push to the client and you need a thread to send it. It's easy to see how this could go bad with naive architectures. Architecturally I've always sided on polling for complete datasets rather than pushing or polling just for changes. This is the simplest and best self-healing architecture. Machines can go up and down at will and your client will always be correct and consistent. There's no chance for the stream of changes to get out of sync. Your client view will always be correct. The server side doesn't have to do anything too special. Clients already know how to do it. And you use client resources to do the polling and the update on the client side. All you have to do to scale polling is have enough machines, smart caching to handle the load, enough bandwidth to handle larger datasets, and a problem where low latency isn't required. That's all :-) The Comet Daily, not affiliated with Super Man I hear, is making a strong case for push in their articles Comet is Always Better Than Polling and 20,000 Reasons Why Comet Scales. Special application server software is needed because your typical app server can't handle lots of persistent connections. They tend to run out of threads and memory. Greg Wilkins talks about these and other issues in Blocking Servlets, Asynchronous Transport. This is all pretty standard stuff when you build your own messaging system, but I guess it has taken a while to move into the web infrastructure. With Comet they found: The key result is that sub-second latency is achievable even for 20,000 users. There is an expected latency vs. throughput tradeoff: for example, for 5,000 users, 100ms latency is achievable up to 2,000 messages per second, but increases to over 250ms for rates over 3,000 messages per second. Interesting results, especially if your application requires low latency updates. Most people haven't deployed or even considered push based architectures. With Comet it's at least something to think about. I can't resist adding this cute animation of a llama push me pull me.

    Click to read more ...

    Tuesday
    Dec252007

    IBMer Says LAMP Can't Scale

    A very entertaining and somewhat educational article on IBM Poopheads say LAMP Users Need to "grow up". The physical three tier architecture turns out to be the root of all evil and shared nothing architectures brings simplicity and light. In the comments Simon Willison makes an insightful comment on why fine grained caching works for personalized pages and proxy's don't: Great post, but I have to disagree with you on the finely grained caching part. If you look at big LAMP deployments such as Flickr, LiveJournal and Facebook the common technology component that enables them to scale is memcached - a tool for finely grained caching. That's not to say that they aren't doing shared-nothing, it's just that memcached is critical for helping the database layer scale. LiveJournal serves around 50% of its page views "permission controlled" (friends only) so an HTTP proxy on the front end isn't the right solution - but memcached reduces their database hits by 90%.

    Click to read more ...

    Friday
    Dec212007

    Strategy: Limit Result Sets

    Release It! author Michael Nygard tells a tale of two web sites, both brought low by unexpectedly huge unbounded results sets that slowed down their sites to the speed of a Christmas checkout line. I've committed this error more than a few times. During testing the results sets are often small, so you don't see problems. Or when a product is new you don't have a lot of data so everything is fine, until some magic line is crossed and you get that dreaded 2AM fix it call. My most embarrassing bug of this type caused a rather spectacular failure at a customer site as the variance in response times was out of spec and this kicked in penalty clauses. What happened was the customer had a larger network than we could even test (customers always get the good stuff). I took a lock and went to get all the data. Because the result set was so much larger in their larger system I took the lock for many more milliseconds than I should have. Unknown to me a chunk of code on the critical path also was in the lock path and all hell broke loose. I had to change the logic to process the result set in fixed size deterministic chunks, releasing locks as I went. I even had to measure CPU usage and back off after a certain amount of CPU was used. But all was well again. I then hunted down every other place I made the same mistake. And there were a few. To solve this problem in general I developed an architecture supporting scheduling work by CPU usage. A common theme in many of the profiles on this site is protecting your system from requests that can bring down the system. Mailinator has a lot resource exhaustion problems and does a good job solving them. Ebay has an interesting strategy of doing as little work as possible in the database which leads them to do joins in application space. Which is exactly the opposite of this strategy's conclusion. But I think this may be going too far. With proper indexes performing selects in the database to minimize the result sets would seem to be a win as databases are good at this sort of thing. Yah, relational databases suck at doing top 10 type of logic, so calculate that on the fly and cache it. How can you bound results sets?

  • Michael's strategy: You should make your apps be paranoid about their data. If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do. But if your app that turns rows into objects, then it had better be very selective about its SELECTs. The relationships might not be what you expect. The data producer might have changed in a surprising way, particularly if it's not under your control. Purging routines might not be in place, or might have gotten broken. Definitely don't trust some other application or batch job to load your data in a safe way.
  • A diagnostic tactic is to benchmark all page response times and look for pages that take too long. This should be an automated system that emails you or shows alerts in your dashboard system. And because you logged everything you can figure out why the response was slow and then take measures to fix it.

    Click to read more ...

  • Wednesday
    Dec192007

    How can I learn to scale my project?

    This is a question asked on the ycombinator list and there are some good responses. I gave a quick response, but I particularly like neilk's knock out of the park insightful answer:

  • Read Cal Henderson's book. (I'd add in Theo's book and Release It! too)
  • The center of your design should be the data store, not a process. You transition the data store from state to state, securely and reliably, in small increments.
  • Avoid globals and session state. The more "pure" your function is, the easier it will be to cache or partition.
  • Don't make your data store too smart. Calculations and renderings should happen in a separate, asynchronous process.
  • The data store should be able to handle lots of concurrent connections. Minimize locking. (Read about optimistic locking).
  • Protect your algorithm from the implementation of the data store, with a helper class or module or whatever. But don't (DO NOT) try to build a framework for any conceivable query. Just the ones your algorithm needs. Viewing an application as a series of state transitions instead of a blizzard of actions and events is a way under appreciated design perspective. This is one of they key design approaches for making robust embedded systems. A great paper talking about this sort of stuff is Mission Planning and Execution Within the Mission Data System - an effort to make engineering flight software more straightforward and less prone to error through the explicit modeling of spacecraft state. Another interesting paper is CLEaR: Closed Loop Execution and Recovery High-Level Onboard Autonomy for Rover Operations. In general I call these Fact Based Architectures. I'm really glad neilk brought it up.

    Click to read more ...

  • Friday
    Nov302007

    Strategy: Efficiently Geo-referencing IPs

    A lot of apps need to map IP addresses to locations. Jeremy Cole in On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS succinctly explains the many uses for such a feature: Geo-referencing IPs is, in a nutshell, converting an IP address, perhaps from an incoming web visitor, a log file, a data file, or some other place, into the name of some entity owning that IP address. There are a lot of reasons you may want to geo-reference IP addresses to country, city, etc., such as in simple ad targeting systems, geographic load balancing, web analytics, and many more applications. This is difficult to do efficiently, at least it gives me a bit of brain freeze. In the same post Jeremy nicely explains where to get the geo-rereferncing data, how to load data, and the performance of different approaches for IP address searching. It's a great practical introduction to the subject.

    Click to read more ...

    Tuesday
    Nov272007

    Solving the Client Side API Scalability Problem with a Little Game Theory

    Now that the internet has become defined as a mashup over a collection of service APIs, we have a little problem: for clients using APIs is a lot like drinking beer through a straw. You never get as much beer as you want and you get a headache after. But what if I've been a good boy and deserve a bigger straw? Maybe we can use game theory to model trust relationships over a life time of interactions over many different services and then give more capabilities/beer to those who have earned them? Let's say Twitter limits me to downloading only 20 tweets at a time through their API. But I want more. I may even want to do something so radical as download all my tweets. Of course Twitter can't let everyone do that. They would be swamped serving all this traffic and service would be denied. So Twitter does that rational thing and limits API access as a means of self protection. As does Google, Yahoo, Skynet, and everyone else. But when I hit the API limit I think, but hey it's Todd here, we've been friends a long time now and I've never hurt you. Not once. Can't you just trust me a little? I promise not to abuse you. I never have and won't in the future. At least on purpose, accidents do happen. Sometimes there's a signal problem and we'll misunderstand each other, but we can work that out. After all, if soldiers during WW1 can learn how to stop the killing through forgiveness, so can we. The problem is Twitter doesn't know me so we haven't built up trust. We could replace trust with money, as in a paid service where I pay for each batch of downloads, but we're better friends than that. Money shouldn't come between us. And if Twitter knew what a good guy I am I feel sure they would let me download more data. But Twitter doesn't know me and that's the problem. How could they know me? We could set up authority based systems like the ones that let certain people march ahead through airport security lines, but that won't scale and I have feeling we all know how that strategy will work out in the end. Another approach to trust is a game theoretic perspective for assessing a user's trust level. Take the iterated prisoner's dilemma problem where variations on the tit for tat strategy are surprisingly simple ways cooperation could evolve in API world. We start out cooperating and if you screw me I'll screw you right back. In a situation where communication is spotty (like through an API) there can be bad signals sent so if people have trusted before then they'll wait for another iteration to see if the other side defects again, in which case they retaliate. Perhaps if services modeled API limits like a game and assessed my capabilities by how we've played the game together, then capabilities could be set based on earned and demonstrated trust rather than simplistic rules. A service like Mashery could takes us even further by moving us out of the direct reciprocity model, where we judge each other on our one on one interactions, and into a more sophisticated indirect reciprocity model, where agents can make decisions to help those who have helped others. Mashery can take a look at how API users act in the wider playing of multiple services and multiple agents. If you are well behaved using many different services, shouldn't you earn more trust and thus more capabilities? In the real world if someone vouches for you to a friend then you will likely get more slack because you have some of the trust from your friend backing you. This doesn't work in one on one situation because there's no way to establish your reputation. Mashery on the other hand knows you and knows which APIs you are using and how you are using them. Mashery could vouch for you if they detected you were playing fair so you get more capabilities initially and transit the capability scale faster if you continued to behave. You can obviously go on and on imaging how such a system might work. Of course, there's a dark side. Situations are possible like on Ebay where people spend eons setting up a great reputation only to later cash in their reputations in some fabulous scam. That's what happens in a society though. We all get more capabilities at the price of some extra risk.

    Click to read more ...