Entries in Strategy (358)

Saturday
Sep082007

MP3.com Web Templating Architecture (March, 2000)

In March, 2000, I did a talk about how we scaled with semi-static files while splitting data from presentation. For dynamic pages we used mod_perl doing an internal redirect with the XML on the style templates. Since then Apache 2.0 contains the concept of filters to allow for similar functionality.

Click to read more ...

Thursday
Sep062007

Scaling IMAP and POP3

Another scalability strategy brought to you by Erik Osterman: Just thought I'd drop a brief suggestion to anyone building a large mail system. Our solution for scaling mail pickup was to develop a sharded architecture whereby accounts are spread across a cluster of servers, each with imap/pop3 capability. Then we use a cluster of reverse proxies (Perdition) speaking to the backend imap/pop3 servers . The benefit of this approach is you can use simply use round-robin or HA load balancing on the perdition servers that end users connect to (e.g. admins can easily move accounts around on the backend storage servers without affecting end users). Perdition manages routing users to the appropriate backend servers and has MySQL support. What we also liked about this approach was that it had no dependency on a distributed or networked file system, so less chance of corruption or data consistency issues. When an individual server reaches capacity, we just off load users to a less used server. If any server goes offline, it only affects the fraction of users assigned to that server.

Click to read more ...

Thursday
Aug302007

Log Everything All the Time

This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong? All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all. What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did. So there are really no logging levels. You need to log everything that will help you diagnose any future problem. What you really need is a time machine, but you don't have one. But if you log enough state you can mimic a time machine. This is what will allow you to follow a request from start to finish and see if what you expect to be happening is actually happening. Did an interface drop a packet? Did a reply timeout? Is a mutex on perma-lock? So many things can go wrong. Over time systems usually evolve to the point of logging everything. They start with little or no logging. Then problem by problem they add more and more logging. But the problem is the logging isn't systematic or well thought out, which leads to poor coverage and poor performance. Logs are where you find anomalies. An anomaly is something unexpected, like operations happening that you didn't expect, in a different order than expected, or taking longer than expected. Anomalies have always driven science forward. Finding and fixing them will help make your system better too. They expose flaws you might not otherwise see. They broaden you understanding of how your system really responds to the world. So step back and take a look at what you need to debug problems in the field. Don't be afraid to add what you need to see how your system actually works. For example, every request needs to have assigned to it a globally unique sequence number that is passed with every operation related to the request so all work for a request can be tied together. This will allow you to trace the comment add from the client all the way through the system. Usually when looking at log data you have no idea what work maps to which request. Once you know that debugging becomes a lot easier. Every hop a request takes should log meta information about how long the request took to process, how big the request was, what the status of the request was. This will help you pinpoint latency issues and any outliers that happen with big messages. If you do this correctly you can simulate the running of system completely from this log data. I am not being completely honest when I say there are no debugging levels. There are two levels: system and developer. System is logging everything you need to log to debug the system. It is never turned off. There is no need. System logging is always on. Developers can add more detailed log levels for their code that can be turned on and off on a module by module basis. For example, if you have a routing algorithm you may only want to see the detailed logging for that on occasion. The trick is there are no generic info type debug levels. You create a named module in your software with a debug level for tracing the routing algorithm. You can turn that on when you want and only that feature is impacted. I usually have a configuration file with initial debug levels. But then I make each process have a command port hosting a simple embedded web server and telnet processor so you can change debug levels and other setting on the fly through the web or telnet interface. This is pretty handy in the field and during development. I can hear many of you saying this is too inefficient. We could never log all that data! That's crazy! No true. I've worked on very sensitive high performance real-time embedded systems where every nanosecond was dear and they still had very high levels of logging, even in driver land. It's in how you do it. You would be right if you logged everything within the same thread directly to disk. Then you are toast. It won't ever work. So don't do that. There are lots of tricks you can use to make logging fast enough that you can do it all the time:

  • Make logging efficient from the start so you aren't afraid to use it.
  • Create a dead simple to use log library that makes logging trivial for developers. Document it. Provide example code. Check for it during code reviews.
  • Log to a separate task and let the task push out log data when it can.
  • Use a preallocated buffer pool for log messages so memory allocation is just pop and push.
  • Log integer values for very time sensitive code.
  • For less time sensitive code sprintf'ing into a preallocated buffer is usually quite fast. When it's not you can use reference counted data structures and do the formatting in the logging thread.
  • Triggering a log message should take exactly one table lookup. Then the performance hit is minimal.
  • Don't do any formatting before it is determined the log is needed. This removes constant overhead for each log message.
  • Allow fancy stream based formatting so developers feel free to dump all the data they wish in any format they wish.
  • In an ISR context do not take locks or you'll introduce unbounded variable latency into the system.
  • Directly format data into fixed size buffers in the log message. This way there is no unavoidable overhead.
  • Make the log message directly queueable to the log task so queuing doesn't take more memory allocations. Memory allocation is a primary source of arbitrary latency and dead lock because of the locking. Avoid memory allocation in the log path.
  • Make the logging thread a lower priority so it won't starve the main application thread.
  • Store log messages in a circular queue to limit resource usage.
  • Write log messages to disk in big sequential blocks for efficiency.
  • Every object in your system should be dumpable to a log message. This makes logging trivial for developers.
  • Tie your logging system into your monitoring system so all the logging data from every process on every host winds its way to your centralized monitoring system. At the same time you can send all your SLA related metrics and other stats. This can all be collected in the back ground so it doesn't impact performance.
  • Add meta data throughout the request handling process that makes it easy to diagnose problems and alert on future potential problems.
  • Map software components to subsystems that are individually controllable, cross application trace levels aren't useful.
  • Add a command ports to processes that make it easy to set program behaviors at run-time and view important statistics and logging information.
  • Log information like task switch counts and times, queue depths and high and low watermarks, free memory, drop counts, mutex wait times, CPU usage, disk and network IO, and anything else that may give a full picture of how your software is behaving in the real world. In large scale distributed systems logging data is all you have to debug most problems. So log everything all the time and you may still get that call at 2AM, but at least you'll know you'll have a fighting chance to fix any problems that do come up.

    Click to read more ...

  • Wednesday
    Aug222007

    How many machines do you need to run your site?

    Amazingly TechCrunch runs their website on one web server and one database server, according to the fascinating survey What the Web’s most popular sites are running on by Pingdom, a provider of uptime and response time monitoring. Early we learned PlentyOfFish catches and releases many millions of hits a day on just 1 web server and three database servers. Google runs a Dalek army full of servers. YouSendIt, a company making it easy to send and receive large files, has 24 web servers, 3 database servers, 170 storage servers, and a few miscellaneous servers. Vimeo, a video sharing company, has 100 servers for streaming video, 4 web servers, and 2 database servers. Meebo, an AJAX based instant messaging company, uses 40 servers to handle messaging, over 40 web servers, and 10 servers for forums, jabber, testing, and so on. FeedBurner, a news feed management company, has 70 web servers, 15 database servers, and 10 miscellaneous servers. Now multiply FeedBurner's server count by two because they maintain two geographically separate sites, in an active-passive configuration, for high availability purposes. How many servers will you need and how can you trick yourself into using fewer?

    Find Someone Like You and Base Your Resource Estimates Off Them

    We see quite a disparity in the number of servers needed for popular web sites. It ranges from just a few servers to many hundreds. Where do you fit? The easiest approach to figuring out how many servers you'll need is to find a company similar to yours and look how many they need. You won't need that many right away, but as you grow it's something to think about. Can your data center handle your growth? Do they have enough affordable bandwidth and rack space? How will you install and manage all the machines? Who will do the work? And a million other similar questions that might be better handled if you had some idea where you are going.

    Get Someone Else to Do it

    Clearly content sites end up needing a lot of servers. Videos, music, pictures, blogs, and attachments all eat up space and since that's your business you have no alternative but to find a way to store all that data. This is unstructured data that can be stored outside the database in a SAN or NAS. Or, rather that building your own storage infrastructure, you can follow the golden rule of laziness: get someone else to do it. That's what SmugMug, an image sharing company did. They use S3 to store many hundreds of terabytes of data. This drops the expense of creating a large highly available storage infrastructure so much that it creates a whole new level of competition for content rich sites. At one time expertise in creating massive storage farms would have been enough to keep competition away, but no more. These sorts of abilities are becoming commoditized, affordable, and open. PlentyOfFish and YouTube make use of CDNs to reduce the amount of infrastructure they need to create for themselves. If you need to stream video why not let a CDN do it instead of building out your own expensive infrastructure? You can take a "let other people do it approach" for services like email, DNS, backup, forums, and blogs too. These are all now outsourcable. Does it make sense to put these services in your data center if you don't need to? If you have compute intensive tasks you can use Amazon services without needing to perform your own build out. And an approach I am really excited to investigate in the future is a new breed of grid based virtual private data centers like 3tera and mediatemple. Their claim to fame is that you can componetize your infrastructure in such a way that you can scale automatically and transparently using their grid as demand fluctuates. I don't have any experience with this approach yet, but it's interesting and probably where the world is heading. If your web site is relatively simple blog then with mostly static content then you can get away with far fewer servers. Even a popular site like Digg has only 30GB of data to store.

    How do your resources scale with the number of users?

    A question you have to ask also is do your resources scale linearly, exponentially, or not much at all with the number of users. A blog site may not scale much with the number of users. Some sites scale linearly as users are added. And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added. Getting a feel for the type of site you have can help more realistic numbers pop up on your magic server eight-ball.

    What's your caching strategy?

    A lot of sites use Memcached and Squid for caching. You can fill up a few racks with caching servers. How many servers will you need for caching? Or can you get away with just beefing up the database server cache?

    Do you need servers for application specific tasks?

    Servers aren't just for storage, database, and the web servers. You may have a bit of computation going on. YouTube offloads tag calculations to a server farm. GoogleTalk has to have servers for handling presence calculations. PlentyOfFish has servers to handle geographical searches because they are so resource intensive. GigaVox needs servers to transcode podcasts into different formats and include fresh commercial content. If you are a calendar service you may need servers to calculate more complicated schedule availability schemes and to sync address books. So depending on your site, you may have to budget for many application related servers like these. The Pingdom folks also created a sweet table on what technologies the companies profiled on this site are using. You can find it at What nine of the world’s largest websites are running on. I'm very jealous of their masterful colorful graphics-fu style. Someday I hope rise to that level of presentation skill.

    Click to read more ...

    Thursday
    Aug162007

    Scaling Secret #2: Denormalizing Your Way to Speed and Profit

    Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with. The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again. Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data.

    Information Sources

  • Normalization Is for Sissies by Pat Helland
  • Data normalization, is it really that good? by Arnon Rotem-Gal-Oz
  • When Not to Normalize your SQL Database by Dare Obasanjo
  • MegaData by Joe Gregorio
  • Audio of talk by Adam Bosworth at the MySQL Users Conference 2005 We normalize data to prevent anomalies. Anomalies are bad things like forgetting to update someone's address in an all the places its been stored when they move. This anomaly happens because the address has been duplicated. So to prevent the anomaly we don't duplicate data. We split everything up so it is stored once and exactly once. Bad things are far less likely to happen if we follow this strategy. And that's a good thing. The process of getting rid of all potential bad things is called normalization and we have a bunch of rules to follow to normalize our data. The price of normalization is that when we want a person's address we have to go find the person and their address in separate operations and bring the data together again. This is called a join. The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again. Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update. If you say your database is the bottleneck then the finger is pointed back and you and you are asked what you are doing wrong. Have you created proper indexes? Is your schema design good? Is your database efficient? Are you tuning your queries? Have you cached in the database? Have you used views? Have you cached complicated queries in memcached? Can you get more parallel IO out of your database? And all these are valid and good questions. For your typical transactional database these would be your normal paths of attack. But we aren't talking about your normal database. We are talking about web scale services that have to process loads higher than any database can scale to. At some point you need a different approach. Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier. We sees this with Ebay who moved all significant functionality out of the database and into applications. Flickr shards and replicates their data to reach high performance levels. For Flickr this moves transaction logic back into their application layer, but the win is higher scalability. Joe Gregorio has identified some common themes across these new mega-data systems:
  • Distributed - The data has to be distributed across multiple machines.
  • Joinless - No joins, and no referential integrity, at least at the data store level.
  • De-Normalized - De-normalization is needed if you are avoiding joins.
  • Transcationless - No transactions It's the web model pushed to the data tier. Ironically, it may take a web model on the back-end to support a web model on the front-end.

    The Great Data Ownership Wars: The Database vs. The Application

    A not so subtle clue as to who won the data wars is to look at the words used. Data that are split up are considered "normal." Those who keep their data whole are considered "de-normal." All right, that's not what those words mean, but it was to good to pass up. :-) Traditionally the database owns the data. Referential integrity, triggers, stored procedures, and everything else that keeps the data safe and whole is in the database. Applications are prevented from screwing up the data. And this makes sense until you scale. Centralizing all behavior in the database won't mega-scale as the web does, which is why Ebay went completely the other way. Ebay maintains data integrity through a service layer that encapsulates all data access. The service layer handles referential integrity, managing replicated copies, doing joins, and so on. It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle. All this sharding and denormalization and duplicating at one levels feels so wrong because it's so different than we were all taught. And unless you are a really large website you probably don't need to worry about this level of complexity. But it's a really fascinating and unexpected evolution in design. Scaling to handle the world wide web requires techniques and strategies that are often at odds with our years of experience. It will be fun to see where it all leads.

    Related Articles

  • Flickr both denormalizes and duplicates data. Horror!
  • Ebay is the most radical in moving almost all functionality out of the database and into the application.
  • Plenty of Fish also advocates denormalization as a key strategy.
  • Hadoop - a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce.

    Click to read more ...

  • Tuesday
    Aug072007

    Can you profit from the coming Content Delivery Network wars?

    Playing like the big boys may be getting cheaper. The big boys, like YouTube, farm the serving of their most popular videos to a third party CDN. A lot of people were surprised YouTube didn't serve all their content themselves, but it makes sense. It allows them to keep up with demand without a large hit for infrastructure build out, much like leasing computers instead of buying them. The problem has been CDNs are expensive. Om Malik reports in Akamai & the CDN Price Wars that may be changing. CDN service could be becoming affordable enough that you might consider using them as part of your scaling strategy. Akamai, once the clear leader in the CDN field, is facing strong competition from the likes of Limelight Networks, Level 3, Internap, CDNetworks, Panther Express and EdgeCast Networks. This commoditization may be bad for their stock prices, but it's good for website builders looking for new scaling strategies. EdgeCast, for example, passes on the cost savings when when their bandwidth costs drop. Other services lock you into fix cost contracts. So competition is good. New cheaper, faster, and easier possibilities for scaling your website are coming on line. Maybe CDNs can help you.

    Related Articles

  • Akamai & the CDN Price Wars
  • Are CDNs Becoming Commoditized, Again?
  • YouTube Architecture
  • EdgeCast Ready To Take On Akamai, Limelight

    Click to read more ...

  • Saturday
    Aug042007

    Try Squid as a Reverse Proxy

    This scalability strategy is brought to you by Erik Osterman: My recommendations for anyone dealing with explosive growth on a limited budget with lots of cachable content (e.g. content capable of returning valid expiration headers) is employ a reverse proxy as mentioned in this article. In the last week, we had a site get AP'd, triggering 100K unique visitors to a single IIS server in under 5 hours. It took out the IIS server. Placing a single squid infront of the server handled the entire onslaught with a max server load of 0.10 on a modest Intel IV 3Ghz. It's trivial to implement for anyone interested...

    Click to read more ...

    Wednesday
    Jul252007

    Paper: Lightweight Web servers

    This paper is a great overview of different lightweight web servers. A lot of websites use lightweight web servers to serve images and static content. YouTube is one example: http://highscalability.com/youtube-architecture. So if you need to improve performance consider changing over a different web server for some types of content. Overview: Recent years have enjoyed a florescence of interesting implementations of Web servers, including lighttpd, litespeed, and mongrel, among others. These Web servers boast different combinations of performance, ease of administration, portability, security, and related values. The following engineering study surveys the field of lightweight Web servers to help you find one likely to meet the technical requirements of your next project. "Lightweight" Web servers like lighttpd, litespeed, and mongrel can offer dramatic benefits for your projects. This article surveys the possibilities and shows how they apply to you. Important dimensions for evaluation of a Web server include: * Performance: How fast does it respond to requests? * Scalability: Does the server continue to behave reliably when many users simultaneously access it? * Security: Does the server do only the operations it should? What support does it offer for authenticating users and encrypting its traffic? Does its use make nearby applications or hosts more vulnerable? * Availability: What are the failure modes and incidences of the server? * Compliance to standards: Does the server respect the pertinent RFCs? * Flexibility: Can the server be tuned to accommodate heavy request loads, or computationally demanding dynamic pages, or expensive authentication, or ...? * Platform requirements: On what range of platforms is the server available? Does it have specific hardware needs? * Manageability: Is the server easy to set up and maintain? Is it compatible with organizational standards for logging, auditing, costing, and so on?

    Click to read more ...

    Page 1 ... 32 33 34 35 36