Entries by Todd Hoff (380)

Thursday
Oct012009

Moving Beyond End-to-End Path Information to Optimize CDN Performance

You go through the expense of installing CDNs all over the globe to make sure users always have a node close by and you notice something curious and furious: clients still experience poor latencies. What's up with that? What do you do to find the problem? If you are Google you build a tool (WhyHigh) to figure out what's up. This paper is about the tool and the unexpected problem of high latencies on CDNs. The main problems they found: inefficient routing to nearby nodes and packet queuing. But more useful is the architecture of WhyHigh and how it goes about identifying bottle necks. And even more useful is the general belief in creating sophisticated tools to understand and improve your service. That's what professionals do. From the abstract:
Replicating content across a geographically distributed set of servers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. In this paper, we analyze latencies measured from servers in Google’s content distribution network (CDN) to clients all across the Internet to study the effectiveness of latency-based server selection. Our main result is that redirecting every client to the server with least latency does not suffice to optimize client latencies. First, even though most clients are served by a geographically nearby CDN node, a sizeable fraction of clients experience latencies several tens of milliseconds higher than other clients in the same region. Second, we find that queueing delays often override the benefits of a client interacting with a nearby server.
To help the administrators of Google’s CDN cope with these problems, we have built a system called WhyHigh. First, WhyHigh measures client latencies across all nodes in the CDN and correlates measurements to identify the prefixes affected by inflated latencies. Second, since clients in several thousand prefixes have poor latencies, WhyHigh prioritizes problems based on the impact that solving them would have, e.g., by identifying either an AS path common to several inflated prefixes or a CDN node where path inflation is widespread. Finally, WhyHigh diagnoses the causes for inflated latencies using active measurements such as traceroutes and pings, in combination with datasets such as BGP paths and flow records. Typical causes discovered include lack of peering, routing misconfigurations, and side-effects of traffic engineering. We have used WhyHigh to diagnose several instances of inflated latencies, and our efforts over the course of a year have significantly helped improve the performance offered to clients by Google’s CDN.

Related Articles

  • Product: Akamai
  • Tuesday
    Sep222009

    How Ravelry Scales to 10 Million Requests Using Rails

    Tim Bray has a wonderful interview with Casey Forbes, creator of Ravelry, a Ruby on Rails site supporting a 400,000+ strong community of dedicated knitters and crocheters.

    Casey and his small team have done great things with Ravelry. It is a very focused site that provides a lot of value for users. And users absolutely adore the site. That's obvious from their enthusiastic comments and rocket fast adoption of Ravelry.

    Ten years ago a site like Ravelry would have been a multi-million dollar operation. Today Casey is the sole engineer for Ravelry and to run it takes only a few people. He was able to code it in 4 months working nights and weekends. Take a look down below of all the technologies used to make Ravelry and you'll see how it is constructed almost completely from free of the shelf software that Casey has stitched together into a complete system. There's an amazing amount of leverage in today's ecosystem when you combine all the quality tools, languages, storage, bandwidth and hosting options.

    Now Casey and several employees makes a living from Ravelry. Isn't that the dream of any small business? How might you go about doing the same thing?

    Site: http://www.ravelry.com

    Statistics

  • 10 million requests a day hit Rails (AJAX + RSS + API)
  • 3.6 million pageviews per day
  • 430,000 registered users. 70,000 active each day. 900 new sign ups per day.
  • 2.3 million knitting/crochet projects, 50,000 new forum posts each day, 19 million forum posts, 13 million private messages, 8 million photos (the majority are hosted by Flickr).
  • Started on a small VPS and demand exploded from the start.
  • Monetization: advertisers + merchandise store + pattern sales

    Platform

  • Ruby on Rails (1.8.6, Ruby GC patches)
  • Percona build of MySQL
  • Gentoo Linux
  • Servers: Silicon Mechanics (owned, not leased)
  • Hosting: Colocation with Hosted Solutions
  • Bandwidth: Cogent (very cheap)
  • Capistrano for deployment.
  • Nginx is much faster and less memory hungry than Apache.
  • Xen for virtualization
  • HAproxy for load balancing.
  • Munin for monitoring.
  • Tokyo Cabinet/Tyrant for large object caching
  • Nagios for alerts
  • HopToad for exception notifications.
  • NewRelic for tuning
  • Syslog-ng for log aggregation
  • S3 for storage
  • Cloudfront as a CDN
  • Sphinx for the search engine
  • Memcached for small object caching

    Architecture

  • 7 Servers (Gentoo Linux). Virtualization (Xen) creates 13 virtual servers.
  •  Front end uses Nginx and HAproxy. The request flow: nginx -> haproxy -> (load balanced) -> apache + mod_passenger. Nginx is first so it can provide functions like serving static files and redirects before passing a request to HAproxy for load balancing. Apache is probably used because it is more configurable than Nginx.
  •  One small backup server.
  • One small utility server for non-critical processes and staging.
  •  2 32 GB of RAM servers for the master database, slave database, Sphinx search engine.
  •  3 application servers running 6 Apache Passenger and Ruby instances, each capped at a pool size of 20. 6 quad core processors and 40 GB of RAM total. There's RAM to spare.
  • 5 terabytes of storage on Amazon S3. Cloudfront is used as a CDN.
  • Tokyo Cabinet/Tyrant is used instead of memcached in some places for caching larger objects. Specifically markdown text that has been converted to HTML.
  • HAproxy and Capistrano are used for rolling deploys of new versions of the site without affecting performance/traffic.

    Lessons Learned

  • Let your users create the site for you. Iterate and evolve. Start with something that works, get people in it, and build it together. Have a slow beta. Invite new people on slowly. Talk to the users about what they want every single day. Let your users help build your site. The result will be more reassuring, comforting, intuitive, and effective.
  • Let your users fund you. Ravelry was funded in part from users who donated $71K. That's a gift. Not stock. Don't give up equity in your company. It took 6 months of working full time and bandwidth/server costs before they started making a profit and this money helped bridge that gap. They key is having a product users feel passionate about and being the kind of people users feel good about supporting. That requires love and authenticity.
  • Become the farmer's market of your niche. Find an under serviced niche. Be anti-mass market. You don't always have to create something for the millions. The millions will likely yawn. Create something and do a good job for a smaller passionate group and that passion will transfer over to you.
  • Success is not about scale, it’s about sustainable execution. This lovely quote is from Jeff Putz.
  • The database is always the problem. Nearly all of the scaling/tuning/performance related work is database related. For example, MySQL schema changes on large tables are painful if you don’t want any downtime. One of the arguments for schemaless databases.
  • Keep it fun. Casey switched to Ruby on Rails because he was looking to make programming fun again. That reenchantment helped make the site possible.
  • Invent new things that delight your users. Go for magic. Users like that. This is one of Costco's principles too. This link, for example, describes some very innovative approaches to forum management.
  • Ruby rocks. It's a fun language and allowed them to develop quickly and release the site twice a day during beta.
  • Capture more profit using low margin services. Ravelry has their own merchandise store, wholesale accounts, printers, and fulfillment company. This allows them to keep all their costs lower so their profits aren't going third party services like CafePress.
  • Going from one server to many servers is the hardest transition to make. Everything changes and becomes more difficult. Have this transition in mind when you are planning your architecture.
  • You can do a lot with a little in today's ecosystem. It doesn't take many people or much money anymore to build a complex site like Ravelry. Take a look at all the different programs Ravelry uses to build there site and how few people are needed to run the site.

    Some people complain that there aren't a lot of nitty gritty details about how Raverly works. I think it should be illuminating that a site of this size doesn't need to have a lavish description of arcane scaling strategies. It can now be built from off the shelf parts smartly put together. And that's pretty cool.

    Related Articles

  • Ravelry gets funding from its own community.
  • Appache/Passenger vs Nginx/Mongrel by Matt Darby
  • The Ravelry Blog (note the number of comments on posts).
  • Podcast - Episode 4: Y Ravelry (featuring Jess & Casey)
  • Beta testing and beyond
  • Hacker News Thread - I included the reasoning from a user named Brett for why the HTTP request path is "Nginx out front passing requests to HAProxy and THEN to Apache + mod_rails."
  • Thursday
    Sep172009

    Hot Links for 2009-9-17 

  • Save 25% on Hadoop Conference Tickets
    Apache Hadoop is a hot technology getting traction all over the enterprise and in the Web 2.0 world. Now, there's going to be a conference dedicated to learning more about Hadoop. It'll be Friday, October 2 at the Roosevelt Hotel in New York City.

    Hadoop World, as it's being called, will be the first Hadoop event on the east coast. Morning sessions feature talks by Amazon, Cloudera, Facebook, IBM, and Yahoo! Then it breaks out into three tracks: applications, development / administration, and extensions / ecosystems. In addition to the conference itself, there will also be 3 days of training prior to the event for those looking to go deeper. In addition to general sessions speakers, presenters include Hadoop project creator Doug Cutting, as well as experts on large-scale data from Intel, Rackspace, Softplayer, eHarmony, Supermicro, Impetus, Booz Allen Hamilton, Vertica, About.com, and other companies.

    Readers get a 25% discount if you register by Sept. 21: http://hadoop-world-nyc.eventbrite.com/?discount=hadoopworld_promotion_highscalability.

  • Essential storage tradeoff: Simple Reads vs. Simple Writes by Stephan Schmidt. Data in denormalized chunks is easy to read and complex to write.
  • Kickfire's approach to parallelism by DANIEL ABADI. Kickfire uses column-oriented storage and execution to address I/O bottlenecks and FPGA-based data-flow architecture to address processing and memory bottlenecks.
  • "Just in Time" Decompression in Analytic Databases by Michael Stonebraker. A DBMS that is optimized for compression through and through--especially with a query executor that features just in time decompression will not just reduce IO and storage overhead, but also offer better query performance with lower CPU resource utilization.
  • Reverse Proxy Performance – Varnish vs. Squid (Part 2) by Bryan Migliorisi. My results show that in raw cache hit performance, Varnish puts Squid to shame.
  • Building Scalable Databases: Denormalization, the NoSQL Movement and Digg by Dare Obasanjo. As a Web developer it's always a good idea to know what the current practices are in the industry even if they seem a bit too crazy to adopt…yet.
  • How To Make Life Suck Less (While Making Scalable Systems) by Bradford Stephens. Scalable doesn’t imply cheap or easy. Just cheaper and easier.
  • Some perspective to this DIY storage server mentioned at Storagemojo by by Joerg Moellenkamp. It's about making decision. Application and hardware has to be seen as one. When your application is capable to overcome the limitations and problems of such ultra-cheap storage
  • Wednesday
    Sep162009

    Paper: A practical scalable distributed B-tree

    We've seen a lot of NoSQL action lately built around distributed hash tables. Btrees are getting jealous. Btrees, once the king of the database world, want their throne back. Paul Buchheit surfaced a paper: A practical scalable distributed B-tree by Marcos K. Aguilera and Wojciech Golab, that might help spark a revolution.

    From the Abstract:

    We propose a new algorithm for a practical, fault tolerant, and scalable B-tree distributed over a set of servers. Our algorithm supports practical features not present in prior work: transactions that allow atomic execution of multiple operations over multiple B-trees, online migration of B-tree nodes between servers, and dynamic addition and removal of servers. Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic concurrency control, so that B-tree nodes are not locked during transaction execution, only during commit. This well-known technique works well because B-trees have little contention on update. (2) We replicate inner nodes at clients. These replicas are lazy, and hence lightweight, and they are very helpful to reduce client-server communication while traversing the B-tree. (3)We replicate version numbers of inner nodes across servers, so that clients can validate their
    transactions efficiently, without creating bottlenecks at the root node and other upper levels in the tree.

    Distributed hash tables are scalable because records area easily distributed across a cluster which gives the golden ability to perform many writes in parallel. The problem is keyed access is very limited.

    A lot of the time you want to iterate through records or search records in a sorted order. Sorted could mean time stamp order, for example, or last name order as another example.

    Access to data in sorted order is what btrees are for. But we simply haven't seen distributed btree systems develop. Instead, you would have to use some sort of map-reduce mechanism to efficiently scan all the records or you would have to maintain the information in some other way.

    This paper points the way to do some really cool things at a system level:

  • It's distributed so it can scale dynamically in size and handle writes in parallel.
  • It supports adding and dropping servers dynamically, which is an essential requirement for architectures based on elastic cloud infrastructures.
  • Data can be migrated to other nodes, which is essential for maintenance.
  • Multiple records can be involved in transactions which is essential for the complex data manipulations that happen in real systems. This is accomplished via a version number mechanism that looks something like MVCC.
  • Optimistic concurrency, that is, the ability to change data without explicit locking, makes the job for programmers a lot easier.

    These are the kind of features needed for systems in the field. Hopefully we'll start seeing more systems offering richer access structures while still maintaining scalability.
  • Saturday
    Sep122009

    How Google Taught Me to Cache and Cash-In

    A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies.

    To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same:

    1. Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized).
    2. Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more).
    3. Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the front page, forums, etc. This was sufficient to handle a site with a million hits a day on a couple of servers. The site was sold for $100K.
    4. Cache the users' subreddits. Blow out the cache on update.
    5. Cache the top links per subreddit. Blow out cache on update.
    6. Combine the previous two steps to generate a menu from cached blocks.
    7. Cache the last links. Blow out the cache on each outlink click.
    8. Cache the user's friends. Append 3 characters to their name.
    9. Cache the user's karma. Blow out on up/down vote.
    10. Filter via conditional formatting, CSS, and an ajax update.
    11. Decouple selection/ranking algorithm(s) from display.
    12. Use Google or something like Xapian or Lucene for search.
    13. Cache "for as long as memcached will stay up." That depends on how many instances you're running, what else is running, how stable the Python memcached hooks are, etc.
    14. The golden rule of website engineering is that you don't try to enforce partial ordering simultaneously with your updates.
    15. When running a search engine operate the crawler separately from the indexer.
    16. Ranking scores are used as necessary from the index, usually cached for popular queries.
    17. Re-rank popular subreddits or the front page once a minute. Tabulate votes and pump them through the ranker.
    18. Cache the top 100 per subreddit. Then cache numbers 100-200 when someone bothers to visit the 5th page of a subreddit, etc.
    19. For less-popular subreddits, you cache the results until an update comes in.
    20. With enough horsepower and common sense, almost any volume of data can be managed, just not in realtime.
    21. Never ever mix your reads and writes if you can help it.
    22. Merge all the normalized rankings and cache the output every minute or so. This avoids thousands of queries per second just for personalization.
    23. It's a lot cheaper to merge cached lists than build them from scratch. This delays the crushing read/write bottleneck at the database. But you have to write the code.
    24. Layering caches is a clasisc strategy for milking your servers as much as possilbe. First look for an exact match. If that's not found, look for the components and build an exact match.
    25. The majority of traffic on almost all websites comes from the default, un-logged-in front page or from random forum/comment/result pages. Make sure those are cached as much as possible.. If one or more of the components aren't found, regenerate those from the DB (now it's cached!) and proceed. Never hit the database unless you have to.
    26. You (almost) always have to hit the database on writes. The key is to avoid hitting it for reads until you're forced to do so.
    Wednesday
    Sep092009

    GridwiseTech revolutionizes data management

    GridwiseTech has developed AdHoc, an advanced framework for sharing geographically distributed data and compute resources. It simplifies the resource management and makes cooperation secure and effective.
    The premise of AdHoc is to enable each member of the associated institution to control access to his or her resources without an IT administrator’s help, and with high security level of any exposed data or applications assured.
    It takes 3 easy steps to establish cooperation within AdHoc: create a virtual organization, add resources and share them. The application can be implemented within any organization to exchange data and resources or between institutions to join forces for more efficient results.
    AdHoc was initially created for a consortium of hospitals and institutions to share medical data sets. As a technical partner in that project, GridwiseTech implemented the Security Framework to provide access to that data and designed a graphical tool to facilitate the administration of the entire system.

    Every participant agreed to grant access to its resources to other partners in the project. Analysis of more patients’ records meant bigger samples and, potentially, better research. As most of these data are subject to a strict privacy policy, they could only be accessible for specific research purposes within defined time periods. In each case, patients’ identity remained anonymous and they provided consent to use their data for experiments. AdHoc enabled easy dynamic access rights management and, at the same time, prevented unauthorized access to sensitive information.
    “Advanced international scientific consortia need to set up ad-hoc collaborations. For this reason, we used the concept of Virtual Organizations, introduced by international Grid projects. However, to create such a VO and grant people access to different resources, a lot of administrative effort is needed, including admins’ time and paperwork. GridwiseTech's AdHoc software is the first application I know of truly dynamic Virtual Organizations, where users themselves are responsible for their resources and can share them easy in real time without involving an administrator” said Andrea De Luca, Clinician and Researcher at the Institute of Clinical Infectious Diseases, Catholic University of Rome, Italy.
    In this critical domain, the GridwiseTech software system proved to be versatile. Its combination of security and simplicity makes it a unique tool for rapid collaborations and modern e-Science.
    Read more at www.gridwisetech.com/adhoc

    Acknowledgments
    -AdHoc bases on open–source components such as Shibboleth from Internet2.
    -AdHoc was used within the ViroLab,project, an EU-funded research initiative in the scope of the 6th Framework Programme. ViroLab’s main objective is to develop a “Virtual Laboratory” for medical experts enabling clinical studies, medical knowledge discovery, and decision support for HIV drug resistance.

    Monday
    Sep072009

    Product: Infinispan - Open Source Data Grid

    Infinispan is a highly scalable, open source licensed data grid platform in the style of GigaSpaces and Oracle Coherence.

    From their website:

    The purpose of Infinispan is to expose a data structure that is highly concurrent, designed ground-up to make the most of modern multi-processor/multi-core architectures while at the same time providing distributed cache capabilities. At its core Infinispan exposes a JSR-107 (JCACHE) compatible Cache interface (which in turn extends java.util.Map). It is also optionally is backed by a peer-to-peer network architecture to distribute state efficiently around a data grid.

    Offering high availability via making replicas of state across a network as well as optionally persisting state to configurable cache stores, Infinispan offers enterprise features such as efficient eviction algorithms to control memory usage as well as JTA compatibility.

    In addition to the peer-to-peer architecture of Infinispan, on the roadmap is the ability to run farms of Infinispan instances as servers and connecting to them using a plethora of clients - both written in Java as well as other popular platforms.

    A few observations:

  • Open source is an important consideration, depending on your business model. As you scale out your costs don't go up. The downside is you'll likely put in more programming effort to implement capabilities the commercial products have already solved.
  • It's from the makers of Jboss Cache so it's likely to have a solid implmentation, even so early in it's development cycle. The API looks very well thought out.
  • Java only. Plan is to add more bindings in the future.
  • Distributed hash table only. Commercial products have very advanced features like distributed query processing which can make all the difference during implementation. We'll see how the product expands from its caching roots into a full fledged data manipulation platform.
  • MVCC and a STM-like approach provide lock- and synchronization-free data structures. This means dust off all those non-blocking algorithms you've never used before. It will be very interesting to see how this approach performs under real-life loads programmed by real-life programmers not used to such techniques.
  • Data is made safe using a configurable degree of redundancy. State is distributed across a cluster. And it's peer-to-peer, there's no central server.
  • API based (put and get operations). XML, bytecode manipulation and JVM hooks aren't used.
  • Future plans call for adding a compute-grid for map-reduce style operations.
  • Distributed transactions across multiple objects are supported. It also offers eviction strategies to ensure individual nodes do not run out of memory and passivation/overflow to disk. Warm-starts using preloads are also supported.

    It's exciting to have an open source grid alternative. It will be interesting to see how Infinispan develops in quality and its feature set. Making a mission critical system of this type is no simple task.

    I don't necessarily see Infinispan as just a competitor for obvious players like GigaSpaces and Coherence, it may play even more strongly in the NoSQL space. For people looking for a reliable, highly performant, scalable, transaction aware hash storage system, Ininispan may look even more attractive than a lot of the disk based systems.

    Related Articles

  • Video Interview with Manik Surtani, Founder & Project Lead at JBoss Cache, Infinispan Data Grid
  • Infinispan Interview by Mark Little on InfoQ.
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Infinispan - data grids meets open source on TheServerSide.com
  • Technical FAQs
  • Anti-RDBMS: A list of distributed key-value stores
  • Infinispan Wiki
  • Distribution instead of Buddy Replication
  • Friday
    Sep042009

    Hot Links for 2009-9-4 

  • A tour through hybrid column/row-oriented DBMS schemes by DANIEL ABADI. Approaches: PAX, Fractured Mirrors, and Fine-grained hybrids.
  • The Future of Database Clustering by ROBERT HODGES. Simple management and monitoring, Fast, flexible replication, Top-to-bottom data protection, Partition management, Cloud and virtualized operation, Transparent application access, Open source.
  • Some perspective to this DIY storage server mentioned at Storagemojo by Joerg Moellenkamp. Quality costs. Period.
  • Turn up the volume: API Scalability with Caching by Scott.
  • Disk I/O Bottlenecks by Ryan Thiessen. My first approach to diagnosing a performance problem is to start by trying to find the system’s bottleneck.
  • Patterns for Cloud Computing by Simon Guest. Using the Cloud for Scale, Using the Cloud for Multi-Tenancy, Using the Cloud for Compute, Using the Cloud for Storage, Using the Cloud for Communications
  • Server Processor Roadmaps Show Change in Direction By Michael J. Miller. What fascinates me is the big change in direction we're seeing on server chips...The focus seemed to be on putting more cores on a chip, something we're still seeing with these new 8-, 12-, and 16-core chips. But now a lot of focus seems to be going into increasing memory bandwidth and new cache architectures, as designers are addressing the memory issues that are often the bottleneck in a multicore system, as well as core-to-core communications.
  • Confronting the Data Center Crisis: A Cost - Benefit Analysis of the IBM
    Computing on Demand (CoD) Cloud Offering

  • Azul's Experiences With Hardware / Software Co-Design by Dr. Cliff Click. Owning whole stack allows progress, Some really hard HW problems “solved” in SW, GC is “solved” w/HW Read Barrier, Simple HTM can do Lock Elision, Huge count of simple cores really useful in production.
  • Java Memory Problems - Memory problems in Java applications are manifold und easily lead to performance and scalability problems. Especially in J EE applications with a high number of parallel users memory management must be a central part of the application architecture.
  • Noob question: how do you [Reddit] join on so much data?
  • Transactional Memory versus Locks -
    A Comparative Case Study
    by Victor Pankratius. TM alone is no silver bullet.
  • Looking at Redis by Peter Zaitsev. With Redis I got about 3 times more updates/sec – close to 100.000 updates/sec with about 1.5 core being used.


    The fantasy sponsor for this post are those little food kiosks outside Home Depot stores. I love their Fire Dogs. Hot and yummy. I bet most home improvement projects in America are inspired by cravings for one of these little beauties.
  • Monday
    Aug312009

    Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month 

    I first heard an enthusiastic endorsement of Squarespace streaming from the ubiquitous Leo Laporte on one of his many Twit Live shows. Squarespace as a fully hosted, completely managed environment for creating and maintaining a website, blog or portfolio was of interest to me because they promise scalability and this site doesn't have enough of that. But sadly, since they don't offer a link preserving Drupal import our relationship was not meant to be.

    When a fine reader of High Scalability, Brian Egge, (and all my readers are thrifty, brave, and strong) asked me how Squarespace scaled I said I didn't know, but I would try and find out. I emailed Squarespace a few questions and founder Anthony Casalena and Director of Technical Operations Rolando Berrios were kind enough to reply in some detail. The questions were both from Brian and myself. Answers can be found below.

    Two things struck me most about Squarespace's approach:

  • They based their system on a memory grid, in this case Oracle Coherence. I'm not aware of too many customer facing systems that have moved to a grid as the backbone of their scalability strategy. It's good to see a successful system visible out in the wild.
  • They use a sort of Private Cloud internally. Everything is highly automated and easy to expand. They scale by adding additional resources like CPUs and disks and the system just adapts without a lot of human fussing involved. Now that's scaling with gas.

    Learn more about how Squarespace has learned how to scale to tens of thousands of customers, hundreds of thousands of signups, and serve hundreds of millions of hits per month.

    Site: http://www.squarespace.com

    The Stats

  • Tens of thousands of customers.
  • Hundreds of thousands of signups.
  • Serves hundreds of millions of hits per month.

    Platform

  • Java - well supported and an advanced language to work in, and the components out there (Apache Foundation, etc.) are second to none.
  • Tomcat - the stability of the server is extremely impressive.
  • Grid - Oracle Coherence for the re-balancing and caching layers.
  • Storage - Isilon Cluster. This allows them to treat their storage like another "grid" as the storage pool is easily scaled by adding more diskspace.
  • Monetiziation Strategy - charge money. No free customers. Pricing starts at $8/month.
  • Uptime - 99.98%
  • Hosting - Peer1, they do not yet operate in multiple datacenters.
  • Competitors - TypePad and WordPress
  • Hardware - they don't use "commodity nodes" or low cost hardware units. These end up costing more in the long run as datacenter power is extremely expensive.
  • Cacti - a cacti instance is used to graph statistical data which helps see trends over time, predict when a hardware upgrade is necessary, and troubleshoot any problems that do show up.

    Lessons Learned

  • Cache as much as you can and load balance requests intelligently across a cluster.
  • Use an infrastructure that scales automatically merely by adding more resources (CPU, disk).
  • Build a scalable design up front. Make scaling easy by designing the application and infrastructure with scaling in mind.
  • Build a hands-off capable maintenance system. Automate processes. Make them as simple as possible. Monitor programatically so people don't have to.
  • Release code early and often. Running on the latest code means problems can be detected quickly when the problem are small.
  • Keep things simple. Apply simplicity to every part of your infrastructure, including both your software and those of your outside vendors. Examples of this are: Grid for the application infrastructure, Isilon cluster for storage, automation, creating their own tools.
  • Use as few technologies as possible by selecting or building simple, powerful and robust tools.
  • Don't be afraid to implement your own code to ensure simplicity. Build or buy is a huge balancing act.
  • Don't be afraid to spend money on technology that helps you get where you need to go. It can save you months and months of headaches that would have prevented you from working on core functionality.

    Interview Questions and Responses

    They say they run on a grid. I'd be interested to know if they built their own grid?

    Partially. We rely on Oracle's Coherence product for the re-balancing
    and caching layers of our system -- which we consider a real workhorse
    for the "grid" aspects of the system. Each node in our infrastructure
    can handle a hit for any single site on the system. This means that in order to increase capacity, we just increase node count. No site is handled by a single node.

    2. How much traffic they can really handle?

    We've had several customer sites on the front page of Digg on multiple
    occasions, and didn't notice any performance degradation for any of our
    sites. In fact, we didn't even realize the surge happened until we reviewed our traffic reports a few hours later. For 99% of sites out there, Squarespace is going to be sufficient. Even larger sites with millions of inbound hits per day are servable, as the bulk of the traffic serving on those sites is in the media being served.

    3. How do they scale up, and allow for certain sites to become quite busy?

    We've tried to make scaling easy, and the application and infrastructure
    have been designed with scaling in mind. Because of this, we're luckily not
    in a situation where we need to keep getting bigger and beefier hardware to handle more and more traffic -- we try to scale out by supplementing the
    grid. Since we try to cache as much as we can and every server
    participates in handling requests for every site, it's generally just a
    matter of adding another node to the environment.

    We try to apply this simplicity to every part of our infrastructure, both
    with our own software and when deciding on purchases from outside vendors. For instance, we just increased the amount of available storage another few terabytes by adding another node to our Isilon cluster.

    4. Are there any stats you can share about how many customers, how many users, how many requests served, how many servers, how much disk, how fast, how reliable?

    We, unfortunately, can't share these numbers as we're a private company
    -- but we can say we have tens of thousands of customers, hundreds
    of thousands of signups, and serve hundreds of millions of hits per
    month. The server types and disk configurations (RAID, etc) are a bit
    irrelevant, as the clustering we implement provides redundancy -- not
    anything implemented into a particular single machine. Nothing in
    hardware is too particular to our setup. I will say we don't purchase
    "commodity nodes" or other low cost hardware units, as we find these
    end up costing more in the long run as datacenter power is extremely
    expensive.

    5. What technology stack are you using and why did you make the choices you made?

    We currently use Java along with Tomcat as our web server. After
    trying a few other solutions, we really appreciated the ability to use
    as few technologies as possible, and have those always remain things
    that are understandable for us. Java is an incredibly well supported
    and advanced language to work in, and the components out there (Apache
    Foundation, etc.) are second to none. As for Tomcat, the stability of
    the server is extremely impressive. We've implemented our own
    controller mechanisms on top of Tomcat (instead of going with some
    other library) in order to ensure extreme simplicity.

    6. How are you handling...

    Multi-tenancy?

    As mentioned above, every web node handles traffic for all sites, so a
    customer doesn't have to worry about an underpowered server unable to handle their traffic, or a node going down.

    Backups?

    Backups are obviously important to us, and we have several copies of user
    and server data stored in multiple locations. We gather backups with a
    combination of various home-grown scripts customized for our environment.

    Failover? Monitoring?

    Since this company originally was solely maintained by Anthony when he
    first started it, things needed to be as simple and automated as possible.
    This includes failover and monitoring. Our monitoring systems check every
    aspect of our environment we can think of several times a minute, and can
    restart obviously dead services, or alert us if it's something an
    actual person needs to handle.

    Additionally, we've set up a cacti instance to graph as much statistical
    data as we can pull out of our servers, so we can see trends over time.
    This allows us to easily predict when a hardware upgrade is necessary. It also helps us troubleshoot any problems that do show up.

    Operations? Releases? Upgrades? Add new hardware?

    With our customer base constantly growing, it's getting tough to manage our systems and still keep our workload under control. There are some projects on the road map to move to a much more hands-off maintenance of our environment, including automatic code deployments and system software upgrades. Most operations can be done without taking the grid offline.

    Multiple data centers?

    We do not have multiple data centers, but have some plans in the works to
    roll one out within the next year.

    Development?

    This is a really broad question, so it's a bit hard to succinctly
    answer. One thing (amongst many) that has consistently served us very
    well is trying to ensure our development environment is always
    releasable into production. By ensuring we're always out there with
    our latest code, we can usually detect problems very rapidly, and
    as a result, those problems are generally extremely small. Everyone on our development team tends to be responsible for wide, sweeping aspects of the system -- which gives them a lot of flexibility to determine how
    their components should work as a whole. It's incredibly important
    that everything fits seamlessly together in the end, so we spend a lot
    of time iterating on things that other groups might consider finished.

    Support?

    Support is something we take extremely seriously. As we've grown from
    the ground up without an external investor, most of our team members
    are versed in support, and understand how critical this component is.
    Our support staff is completely hired from our community, and is
    incredibly passionate about their jobs. We try and get every single
    customer support inquiry answered within 15 minutes or less, and have all sorts of metrics related to our goals here.

    7. What have you done that's really cool that you think other people could learn from?

    We spend a lot of time internally writing scripts and other
    applications that simply run our business. For instance, our
    persistence layer configuration files are generated by applications
    we've written that read our database model directly from the database.
    We develop a lot of these programs, and a lot of "standard naming"--this, again, means that we can move very rapidly as we have less monotonous tasks and searching to think about.

    While this sort of thing is appropriate for small tasks, for the big
    ones, we also aren't afraid to spend money on well developed
    technology. Some of our choices for load balancing and storage are
    very costly, but end up saving us months and months of time in the
    long haul, as we've avoided having to "put out fires" generated by
    untested home grown solutions. It's a huge balancing act.

    The End

    Often the best way to judge a product is to peruse the developer forums. It's these people who know what's really happening. And when I look I see an almost complete absence of threads about performance, scalability, or reliability problems. Take a look at other CMSs and you'll see a completely different tenor of questions. That says something good about the strength of their scalability strategy.

    I'd really like to thank Squarespace for taking the time and making the effort to share they've learned with the larger community. It's an effort we all benefit from. If you would also like to share your knowledge and wisdom with the world please get in touch and let's get started!

    Related Articles

  • Implementation Focus: Squarespace
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Up and running on Squarespace by Peter Efland
  • Kevin Rose Comes to Squarespace by D. Atkinson
  • Squarespace Vs Wordpress a thread in their developer forum.
  • Friday
    Aug282009

    Strategy: Solve Only 80 Percent of the Problem

    Solve only 80% of a problem. That's usually good enough and you'll not only get done faster, you'll actually have a chance of getting done at all.

    This strategy is given by Amix in HOW TWITTER (AND FACEBOOK) SOLVE PROBLEMS PARTIALLY. The idea is solving 100% of a complex problem can be so hard and so expensive that you'll end up wasting all your bullets on a problem that could have been satisfactoraly solved in a much simpler way.

    The example given is for Twitter's real-time search. Real-time search almost by definition is focussed on recent events. So in the design should you be able to search historically back from the beginning of time or should you just be able to search for recent time periods? A complete historical search is the 100% solution. The recent data only search is the 80% solution. Which should you choose?


    The 100% solution is dramatically more difficult to solve. It requires searching disk in real-time which is a killer. So it makes more sense to work on the 80% problem because it will satisfy most of your users and is much more doable.

    By reducing the amount of data you need to search it's possible to make some simplifying design choices, like using fixed sized buffers that reside completely in memory. With that architecture your streaming searches can be blisteringly fast while returning the most relevant data. Users are happy and you are happy.

    It's not a 100% solution, but it's a good enough solution that works. Sometimes as programmers we are blinded by the glory of the challenge of solving the 100% solution when there's a more reasonable, rational alternative that's almost as good. Something to keep in mind when you are wondering how you'll possibly get it all done. Don't even try.

    Amix has a very good discussion of Twitter and this strategy on his blog.

    Worse is Better

    A Hacker News post discussing this article brought up that this strategy is the same as Richard Gabriel's famous Worse-is-Better paradox which holds: The right thing is frequently a monolithic piece of software, but for no reason other than that the right thing is often designed monolithically. That is, this characteristic is a happenstance. The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing.

    Unix, C, C++, Twitter and almost every product that has experienced wide adoption has followed this philosophy.

    Worse-is-Better solutions have the following characteristics:

  • Simplicity - The design must be simple, both in implementation and interface. It is more important for the implementation to be simpler than the interface. Simplicity is the most important consideration in a design.
  • Correctness - The design must be correct in all observable aspects. It is slightly better to be simple than correct.
  • Consistency - The design must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either implementational complexity or inconsistency.
  • Completeness - The design must cover as many important situations as is practical. All reasonably expected cases should be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

    In my gut I think Worse-is-Better is different than "Solve Only 80 Percent of the Problem" primarily because Worse-is-Better is more about product adoption curves and 80% is more a design heuristic. After some cogitating this seems a false distinction so I have to concluded I'm wrong and have added Worse-is-Better to this post.

    Related Articles

  • Worse Is Better Richard P. Gabriel
  • Lisp: Good News, Bad News, How to Win Big
  • Interesting Hacker News Thread
  • In Praise of Evolvable Systems by Clay Shirky
  • Big Ball of Mud by Brian Foote and Joseph Yoder