Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month

Monday

Aug312009

Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month

Monday, August 31, 2009 at 12:19AM

I first heard an enthusiastic endorsement of Squarespace streaming from the ubiquitous Leo Laporte on one of his many Twit Live shows. Squarespace as a fully hosted, completely managed environment for creating and maintaining a website, blog or portfolio was of interest to me because they promise scalability and this site doesn't have enough of that. But sadly, since they don't offer a link preserving Drupal import our relationship was not meant to be.

When a fine reader of High Scalability, Brian Egge, (and all my readers are thrifty, brave, and strong) asked me how Squarespace scaled I said I didn't know, but I would try and find out. I emailed Squarespace a few questions and founder Anthony Casalena and Director of Technical Operations Rolando Berrios were kind enough to reply in some detail. The questions were both from Brian and myself. Answers can be found below.

Two things struck me most about Squarespace's approach:

They based their system on a memory grid, in this case Oracle Coherence. I'm not aware of too many customer facing systems that have moved to a grid as the backbone of their scalability strategy. It's good to see a successful system visible out in the wild.

They use a sort of Private Cloud internally. Everything is highly automated and easy to expand. They scale by adding additional resources like CPUs and disks and the system just adapts without a lot of human fussing involved. Now that's scaling with gas.

Learn more about how Squarespace has learned how to scale to tens of thousands of customers, hundreds of thousands of signups, and serve hundreds of millions of hits per month.

Site: http://www.squarespace.com

The Stats

Tens of thousands of customers.

Hundreds of thousands of signups.

Serves hundreds of millions of hits per month.

Platform

Java - well supported and an advanced language to work in, and the components out there (Apache Foundation, etc.) are second to none.

Tomcat - the stability of the server is extremely impressive.

Grid - Oracle Coherence for the re-balancing and caching layers.

Storage - Isilon Cluster. This allows them to treat their storage like another "grid" as the storage pool is easily scaled by adding more diskspace.

Monetiziation Strategy - charge money. No free customers. Pricing starts at $8/month.

Uptime - 99.98%

Hosting - Peer1, they do not yet operate in multiple datacenters.

Competitors - TypePad and WordPress

Hardware - they don't use "commodity nodes" or low cost hardware units. These end up costing more in the long run as datacenter power is extremely expensive.

Cacti - a cacti instance is used to graph statistical data which helps see trends over time, predict when a hardware upgrade is necessary, and troubleshoot any problems that do show up.

Lessons Learned

Cache as much as you can and load balance requests intelligently across a cluster.

Use an infrastructure that scales automatically merely by adding more resources (CPU, disk).

Build a scalable design up front. Make scaling easy by designing the application and infrastructure with scaling in mind.

Build a hands-off capable maintenance system. Automate processes. Make them as simple as possible. Monitor programatically so people don't have to.

Release code early and often. Running on the latest code means problems can be detected quickly when the problem are small.

Keep things simple. Apply simplicity to every part of your infrastructure, including both your software and those of your outside vendors. Examples of this are: Grid for the application infrastructure, Isilon cluster for storage, automation, creating their own tools.

Use as few technologies as possible by selecting or building simple, powerful and robust tools.

Don't be afraid to implement your own code to ensure simplicity. Build or buy is a huge balancing act.

Don't be afraid to spend money on technology that helps you get where you need to go. It can save you months and months of headaches that would have prevented you from working on core functionality.

Interview Questions and Responses

They say they run on a grid. I'd be interested to know if they built their own grid?

Partially. We rely on Oracle's Coherence product for the re-balancing
and caching layers of our system -- which we consider a real workhorse
for the "grid" aspects of the system. Each node in our infrastructure
can handle a hit for any single site on the system. This means that in order to increase capacity, we just increase node count. No site is handled by a single node.

2. How much traffic they can really handle?

We've had several customer sites on the front page of Digg on multiple
occasions, and didn't notice any performance degradation for any of our
sites. In fact, we didn't even realize the surge happened until we reviewed our traffic reports a few hours later. For 99% of sites out there, Squarespace is going to be sufficient. Even larger sites with millions of inbound hits per day are servable, as the bulk of the traffic serving on those sites is in the media being served.

3. How do they scale up, and allow for certain sites to become quite busy?

We've tried to make scaling easy, and the application and infrastructure
have been designed with scaling in mind. Because of this, we're luckily not
in a situation where we need to keep getting bigger and beefier hardware to handle more and more traffic -- we try to scale out by supplementing the
grid. Since we try to cache as much as we can and every server
participates in handling requests for every site, it's generally just a
matter of adding another node to the environment.

We try to apply this simplicity to every part of our infrastructure, both
with our own software and when deciding on purchases from outside vendors. For instance, we just increased the amount of available storage another few terabytes by adding another node to our Isilon cluster.

4. Are there any stats you can share about how many customers, how many users, how many requests served, how many servers, how much disk, how fast, how reliable?

We, unfortunately, can't share these numbers as we're a private company
-- but we can say we have tens of thousands of customers, hundreds
of thousands of signups, and serve hundreds of millions of hits per
month. The server types and disk configurations (RAID, etc) are a bit
irrelevant, as the clustering we implement provides redundancy -- not
anything implemented into a particular single machine. Nothing in
hardware is too particular to our setup. I will say we don't purchase
"commodity nodes" or other low cost hardware units, as we find these
end up costing more in the long run as datacenter power is extremely
expensive.

5. What technology stack are you using and why did you make the choices you made?

We currently use Java along with Tomcat as our web server. After
trying a few other solutions, we really appreciated the ability to use
as few technologies as possible, and have those always remain things
that are understandable for us. Java is an incredibly well supported
and advanced language to work in, and the components out there (Apache
Foundation, etc.) are second to none. As for Tomcat, the stability of
the server is extremely impressive. We've implemented our own
controller mechanisms on top of Tomcat (instead of going with some
other library) in order to ensure extreme simplicity.

6. How are you handling...

Multi-tenancy?

As mentioned above, every web node handles traffic for all sites, so a
customer doesn't have to worry about an underpowered server unable to handle their traffic, or a node going down.

Backups?

Backups are obviously important to us, and we have several copies of user
and server data stored in multiple locations. We gather backups with a
combination of various home-grown scripts customized for our environment.

Failover? Monitoring?

Since this company originally was solely maintained by Anthony when he
first started it, things needed to be as simple and automated as possible.
This includes failover and monitoring. Our monitoring systems check every
aspect of our environment we can think of several times a minute, and can
restart obviously dead services, or alert us if it's something an
actual person needs to handle.

Additionally, we've set up a cacti instance to graph as much statistical
data as we can pull out of our servers, so we can see trends over time.
This allows us to easily predict when a hardware upgrade is necessary. It also helps us troubleshoot any problems that do show up.

Operations? Releases? Upgrades? Add new hardware?

With our customer base constantly growing, it's getting tough to manage our systems and still keep our workload under control. There are some projects on the road map to move to a much more hands-off maintenance of our environment, including automatic code deployments and system software upgrades. Most operations can be done without taking the grid offline.

Multiple data centers?

We do not have multiple data centers, but have some plans in the works to
roll one out within the next year.

Development?

This is a really broad question, so it's a bit hard to succinctly
answer. One thing (amongst many) that has consistently served us very
well is trying to ensure our development environment is always
releasable into production. By ensuring we're always out there with
our latest code, we can usually detect problems very rapidly, and
as a result, those problems are generally extremely small. Everyone on our development team tends to be responsible for wide, sweeping aspects of the system -- which gives them a lot of flexibility to determine how
their components should work as a whole. It's incredibly important
that everything fits seamlessly together in the end, so we spend a lot
of time iterating on things that other groups might consider finished.

Support?

Support is something we take extremely seriously. As we've grown from
the ground up without an external investor, most of our team members
are versed in support, and understand how critical this component is.
Our support staff is completely hired from our community, and is
incredibly passionate about their jobs. We try and get every single
customer support inquiry answered within 15 minutes or less, and have all sorts of metrics related to our goals here.

7. What have you done that's really cool that you think other people could learn from?

We spend a lot of time internally writing scripts and other
applications that simply run our business. For instance, our
persistence layer configuration files are generated by applications
we've written that read our database model directly from the database.
We develop a lot of these programs, and a lot of "standard naming"--this, again, means that we can move very rapidly as we have less monotonous tasks and searching to think about.

While this sort of thing is appropriate for small tasks, for the big
ones, we also aren't afraid to spend money on well developed
technology. Some of our choices for load balancing and storage are
very costly, but end up saving us months and months of time in the
long haul, as we've avoided having to "put out fires" generated by
untested home grown solutions. It's a huge balancing act.

The End

Often the best way to judge a product is to peruse the developer forums. It's these people who know what's really happening. And when I look I see an almost complete absence of threads about performance, scalability, or reliability problems. Take a look at other CMSs and you'll see a completely different tenor of questions. That says something good about the strength of their scalability strategy.

I'd really like to thank Squarespace for taking the time and making the effort to share they've learned with the larger community. It's an effort we all benefit from. If you would also like to share your knowledge and wisdom with the world please get in touch and let's get started!

Implementation Focus: Squarespace

Are Cloud Based Memory Architectures the Next Big Thing?

Up and running on Squarespace by Peter Efland

Kevin Rose Comes to Squarespace by D. Atkinson

Squarespace Vs Wordpress a thread in their developer forum.

Todd Hoff |

10 Comments |

Permalink |

Print Article

Email Article

Example,

memory-grid

Reader Comments (10)

Great post! If you want another link on Squarespace, we interviewed Tyler Thompson, the Squarespace Creative Director who talked about all of the design aspects. It was a fun show that your readers would appreciate. http://www.creativexpert.com/2009/07/29/tyler-thompson-32-squarespace/

December 31, 1999 |

Alan Houser

We are about to replace our blogging system with a solution at Squarespace. When you say Peer 1 has no multiple data centers do you mean Squarespace? I have servers in Texas and both coasts through Peer 1. I find it unusual as fast as Squarespace is growing they would have Peer 1 keep all their machines in one data center. Especially when the one data center they have that has measurable trouble is in VA. If I am going to commit to the Squarespace model I'd like some assurance they have gear in the other Peer 1 locations. Or at least don't have all their eggs in one basket. Tell me something good.

December 31, 1999 |

Kevin Adair

Sorry for the confusion, my understanding is Squarespace operates out of one datacenter at the moment, but plan on expanding in the future.

December 31, 1999 |

Todd Hoff

Thanks Alan. I missed that one. I'll take a listen...

December 31, 1999 |

Todd Hoff

Any comment as to why they went with a vendor like Oracle for distributed caching as opposed to open source options like memcached?

December 31, 1999 |

inquiry

This was a great article. As a customer of theirs (since mid-June) I have to say that performance has been amazing; and they truly do deliver incredible customer support. I have been impressed by the quick - and more importantly effective - approach the entire team uses. I, too, was turned onto this company via one of Leo Laporte's podcasts, and I have to say - this is the absolutely worth the monthly fee (and then some - but please don't change my rate lol)

December 31, 1999 |

Robert Brennan

When compared to a distributed cache system, such as memcached, Coherence offers very different functionality. Coherence can act as a distributed cache in either a replicated or partitioned mode. In partitioned mode data is split against the storage members of the cluster based on a hashing of the data's key. In replicated mode data is replicated to all storage members. In addition, Coherence can also be configured so that the clients of the cache (non-storage members) can have a local "near" cache to store oft used data.

Where Coherence starts to get interesting is in its ability to partition work requests to storage members. This functionality (similar to a map/reduce style system) allows work requests to be done by the nodes that own the data. Using this functionality Coherence provides filter, or query, mechanisms to allow developers to retrieve data based on more than just the key for that data. We've started to use it here at Edmunds and so far have found it to be a great alternative to RDBMS systems and so far we have not (and we've been looking) been able to find an open source project that comes close to the feature set. Voldemort is close, however, does not have the search or automated rebalancing features of Coherence.

December 31, 1999 |

Padraic Hannon

The difference between Memcached & Oracle Coherence is that you will be paying for the rest of your life for Oracle Coherence. No thanks.

June 18, 2010 |

Enzo

what's up with the timestamp on these messages .... article was written in 8/09 .... but responses say 1990 ???

I am researching how "elastic" a site hosted on SquareSpace can be .... any info on the highest traffic sites ???

thanks Steve

January 27, 2011 |

tsaltd

That was a bug in the import script I wrote to take the old side out of drupal and import it into Squarespace. I didn't notice until it was too late that the time stamps were wrong and decided a little time travel would be fine.

January 27, 2011 |

HighScalability Team

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month

The Stats

Platform

Lessons Learned

Interview Questions and Responses

They say they run on a grid. I'd be interested to know if they built their own grid?

2. How much traffic they can really handle?

3. How do they scale up, and allow for certain sites to become quite busy?

4. Are there any stats you can share about how many customers, how many users, how many requests served, how many servers, how much disk, how fast, how reliable?

5. What technology stack are you using and why did you make the choices you made?

6. How are you handling...

Multi-tenancy?

Backups?

Failover? Monitoring?

Operations? Releases? Upgrades? Add new hardware?

Multiple data centers?

Development?

Support?

7. What have you done that's really cool that you think other people could learn from?

The End

Related Articles

Reader Comments (10)

Post a New Comment