Entries in Squid (5)

Saturday
Dec202008

Second Life Architecture - The Grid

Update:Presentation: Second Life’s Architecture. Ian Wilkes, VP of Systems Engineering, describes the architecture used by the popular game named Second Life. Ian presents how the architecture was at its debut and how it evolved over years as users and features have been added. Second Life is a 3-D virtual world created by its Residents. Virtual Worlds are expected to be more and more popular on the internet so their architecture might be of interest. Especially important is the appearance of open virtual worlds or metaverses. What happens when video games meet Web 2.0? What happens is the metaverse.

Information Sources

Platform

  • MySQL
  • Apache
  • Squid
  • Python
  • C++
  • Mono
  • Debian

What's Inside?

The Stats

  • ~1M active users
  • ~95M user hours per quarter
  • ~70K peak concurrent users (40% annual growth)
  • ~12Gbit/sec aggregate bandwidth (in 2007)

Staff (in 2006)

  • 70 FTE + 20 part time
"about 22 are programmers working on SL itself. At any one time probably 1/3 of the team is on infrastructure, 1/3 is on new features and 1/3 is on various maintenance tasks (bug fixes, general stability and speed improvements) or improvements to existing features. But it varies a lot."

Software

Client/Viewer
  • Open Source client
  • Render the Virtual World
  • Handles user interaction
  • Handles locations of objects
  • Gets velocities and does simple physics to keep track of what is moving where
  • No collision detection
Simulator (Sim) Each geographic area (256x256 meter region) in Second Life runs on a single instantiation of server software, called a simulator or "sim." And each sim runs on a separate core of a server. The Simulator is the primary SL C++ server process which runs on most servers. As the viewer moves through the world it is handled off from one simulator to another.
  • Runs Havok 4 physics engine
  • Runs at 45 frames/sec. If it can't keep up, it will attempt time dialation without reducing frame rate.
  • Handles storing object state, land parcel state, and terrain height-map state
  • Keeps track of where everything is and does collision detection
  • Sends locations of stuff to viewer
  • Transmits image data in a prioritized queue
  • Sends updates to viewers only when needed (only when collision occurs or other changes in direction, velocity etc.)
  • Runs Linden Scripting Language (LSL) scripts
  • Scripting has been recently upgraded to the much faster Mono scripting engine
  • Handles chat and instant messages
    • Asset Server
      • One big clustered filesystem ~100TB
      • Stores asset data such as textures.
      MySQL database Second Life has started with One Database, and have subsequently been forced into clustering. They use a ton of MySQL databases running on Debian machines to handle lots of centralized services. Rather than attempt to build the one, impossibly large database – all hail the Central Database – or one, impossibly large central cluster – all hail the Cluster – Linden Lab instead adopted a divide and conquer strategy based around data partitioning. The good thing is that UUIDs– 128-bit unique identifiers – are associated with most things in Second Life, so partitioning is generally doable. Backbone Linden Lab has converted much of their backend architecture away from custom C++/messaging into web services. Certain services has been moved off of MySQL – or cached (Squid) between the queries and MySQL. Presence, in particular Agent Presence, ie are you online and where are you on the grid, is a particularly tricky kind of query to partition, so there is now a Python service running on the SL grid called Backbone. It proved to be easier to scale, develop and maintain than many of their older technologies, and as a result, it plays an increasingly important role in the Second Life platform as Linden Lab migrates their legacy code to web services. Two main components of the backbone are open source:
      • Eventlet is a networking library written in Python. It achieves high scalability by using non-blocking io while at the same time retaining high programmer usability by using coroutines to make the non-blocking io operations appear blocking at the source code level.
      • Mulib is a REST web service framework built on top of eventlet

      Hardware

      • 2000+ Servers in 2007
      • ~6000 Servers in early 2008
      • Plans to upgrade to ~10000 (?)
      • 4 sims per machine, for both class 4 and class 5
      • Used all-AMD for years, but are moving from the Opteron 270 to the Intel Xeon 5148
      • The upgrade to "class 5" servers doubled the RAM per machine from 2GB to 4GB and moved to a faster SATA disk
      • Class 1 - 4 are on 100Mb with 1Gb uplinks to the core. Class 5 is on pure 1Gb
      Do you have more details?

      Click to read more ...

Monday
Nov192007

Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for? Tailrank originally a memetracker to track the hottest news being discussed within the blogosphere. We started having a lot of requests to license our crawler and we shipped that in the form of Spinn3r about 8 months ago. Spinn3r is self contained crawler for companies that want to index the full blogosphere and consumer generated media. Tailrank is still a very important product alongside Spinn3r and we're working on Tailrank 3.0 which should be available in the future. No ETA at the moment but it's actively being worked on.
  • What particular design/architecture/implementation challenges does your system have? The biggest challenge we have is the sheer amount of data we have to process and keeping that data consistent within a distributed system. For example, we process 52TB of content per month. this has to be indexed in a highly available storage architecture so the normal distributed database problems arise.
  • What did you do to meet these challenges? We've spent a lot of time in building out a distributed system that can scale and handle failure. For example, we've built a tool called Task/Queue that is analogous to Google's MapReduce. It has a centralized queue server which hands out units of work to robots which make requests. It works VERY well for crawlers in that slower machines just fetch work at a slower rate while more modern machines (or better tuned machines) request work at a higher rate. This ends up easily solving one of the main distributed computing fallacies that the network is homogeneous. Task/Queue is generic enough that we could actually use it to implement MapReduce on top of the system. We'll probably open source it at some point. Right now it has too many tentacles wrapped into other parts of our system.
  • How big is your system? We index 24M weblogs and feeds per hour and process content at about 160-200Mbps. At the raw level we're writing to our disks at about 10-15MBps continuously.
  • How many documents, do you serve? How many images? How much data? Right now the database is about 500G. We're expecting it to grow well beyond this in 2008 as we expand our product offering.
  • What is your rate of growth? It's mostly a function of customer feature requests. If our customers want more data we sell it to them. In 2008 we're planning on expanding our cluster to index larger portions of the web and consumer generated media.
  • What is the architecture of your system? We use Java, MySQL and Linux for our cluster. Java is a great language for writing crawlers. The library support is pretty solid (though it seems like Java 7 is going to be killer when they add closures). We use MySQL with InnoDB. We're mostly happy with it though it seems I end up spending about 20% of my time fixing MySQL bugs and limitations. Of course nothing is perfect. MySQL for example was really designed to be used on single core systems. The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks. I recently blogged about how these the new multi-core machines should really be considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
  • How is your system architected to scale? We use a federated database system so that we can split the write load as we see more IO. We've released a lot of our code as Open Source a lot of our infrastructure and this will probably be released as Open Source as well. We've already opened up a lot of our infrastructure code:
  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
  • How many servers do you have? About 15 machines so far. We've spent a lot of time tuning our infrastructure so it's pretty efficient. That said, building a scalable crawler is not an easy task so it does take a lot of hardware. We're going to be expanding FAR past this in 2008 and will probably hit about 2-3 racks of machines (~120 boxes).
  • What operating systems do you use? Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know why more hardware vendors don't support Debian. Debian is the big secret in the valley that no one talks about. Most of the big web 2.0 shops like Technorati, Digg, etc use Debian.
  • Which web server do you use? Apache 2.0. Lighttpd is looking interesting as well.
  • Which reverse proxy do you use? About 95% of the pages of Tailrank are served from Squid.
  • How is your system deployed in data centers? We use ServerBeach for hosting. It's a great model for small to medium sized startups. They rack the boxes, maintain inventory, handle network, etc. We just buy new machines and pay a flat markup. I wish Dell, SUN, HP would sell directly to clients in this manner. One right now. We're looking to expand into two for redundancy.
  • What is your storage strategy? Directly attached storage. We buy two SATA drives per box and set them up in RAID 0. We use the redundant array of inexpensive databases solution so if an individual machine fails there's another copy of the data on another box. Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
  • Do you have a standard API to your website? Tailrank has RSS feeds for every page. The Spinn3r service is itself an API and we have extensive documentation on the protocol. It's also free to use for researchers so if any of your readers are pursuing a Ph.D and generally doing research work and needs access to blog data we'd love to help them out. We already have the Ph.D students at the University of Washington and University of Maryland (my Alma Matter) using Spinn3r.
  • Which DNS service do you use? PowerDNS. It's a great product. We only use the recursor daemon but it's FAST. It uses async IO though so it doesn't really scale across processors on multicore boxes. Apparenty there's a hack to get it to run across cores but it isn't very reliable. AAA caching might be broken though. I still need to look into this.
  • Who do you admire? Donald Knuth is the man!
  • How are you thinking of changing your architecture in the future? We're still working on finishing up a fully sharded database. MySQL fault tolerance and autopromotion is also an issue.

    Click to read more ...

  • Wednesday
    Aug222007

    Wikimedia architecture

    Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

    Information Sources

  • Wikimedia architecture
  • http://meta.wikimedia.org/wiki/Wikimedia_servers
  • scale-out vs scale-up in the from Oracle to MySQL blog.

    Platform

  • Apache
  • Linux
  • MySQL
  • PHP
  • Squid
  • LVS
  • Lucene for Search
  • Memcached for Distributed Object Cache
  • Lighttpd Image Server

    The Stats

  • 8 million articles spread over hundreds of language projects (english, dutch, ...)
  • 10th busiest site in the world (source: Alexa)
  • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
  • 30 000 HTTP requests/s during peak-time
  • 3 Gbit/s of data traffic
  • 3 data centers: Tampa, Amsterdam, Seoul
  • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
  • managed by ~ 6 people
  • 3 clusters on 3 different continents

    The Architecture

  • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
  • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
  • 55 Squid servers currently, plus 20 waiting for setup.
  • 1,000 HTTP requests/s per server, up to 2,500 under stress
  • ~ 100 - 250 Mbit/s per server
  • ~ 14 000 - 32 000 open connections per server
  • Up to 40 GB of disk caches per Squid server
  • Up to 4 disks per server (1U rack servers)
  • 8 GB of memory, half of that used by Squid
  • Hit rates: 85% for Text, 98% for Media, since the use of CARP.
  • PowerDNS provides geographical distribution.
  • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
  • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
  • One centrally managed & synchronized software installation for hundreds of wikis.
  • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
  • Hardware shared with External Storage and Memcached tasks
  • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
  • Actual revision text is stored as blobs in External storage
  • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
  • Separate database per wiki (not separate server!)
  • One master, many replicated slaves
  • Read operations are load balanced over the slaves, write operations go to the master
  • The master is used for some read operations in case the slaves are not yet up to date (lagged)
  • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

    Lessons Learned

  • Focus on architecture, not so much on operations or nontechnical stuff.
  • Sometimes caching costs more than recalculating or looking up at the data source...profiling!
  • Avoid expensive algorithms, database queries, etc.
  • Cache every result that is expensive and has temporal locality of reference.
  • Focus on the hot spots in the code (profiling!).
  • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis
  • Improve caching: temporal and spatial locality of reference and reduces the data set size per server
  • Text is compressed and only revisions between articles are stored.
  • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
  • Disk seek I/O limited, the more disk spindles, the better!
  • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
  • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
  • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

    Click to read more ...

  • Saturday
    Aug042007

    Try Squid as a Reverse Proxy

    This scalability strategy is brought to you by Erik Osterman: My recommendations for anyone dealing with explosive growth on a limited budget with lots of cachable content (e.g. content capable of returning valid expiration headers) is employ a reverse proxy as mentioned in this article. In the last week, we had a site get AP'd, triggering 100K unique visitors to a single IIS server in under 5 hours. It took out the IIS server. Placing a single squid infront of the server handled the entire onslaught with a max server load of 0.10 on a modest Intel IV 3Ghz. It's trivial to implement for anyone interested...

    Click to read more ...

    Tuesday
    Jul102007

    mixi.jp  Architecture

    Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp

    Information Sources

  • mixi.jp - scaling out with open source

    Platform

  • Linux
  • Apache
  • MySQL
  • Perl
  • Memcached
  • Squid
  • Shard

    What's Inside?

  • They grew to approximately 4 million users in two years and add over 15,000 new users/day.
  • Ranks 35th on Alexa and 3rd in Japan.
  • More than 100 MySQL servers
  • Add more than 10 servers/month
  • Use non-persistent connections.
  • Diary traffic is 85% read and 15% write.
  • Message traffic is is 75% read and 25% write.
  • Ran into replication performance problems so they had to split the database.
  • Considered splitting vertically by user or splitting horizontally by table type.
  • The ended up partitioning by table type and user. So all the messages for a group of users would be assigned to a particular database. Partitioning key is used to decide in which database data should be stored.
  • For caching they use memcached with 39 machines x 2 GB memory.
  • Stores more than 8 TB of images with about 23 GB added per day.
  • MySQL is only used to store metadata about the images, not the images themselves.
  • Images are either frequently accessed or rarely accessed.
  • Frequently accessed images are cached using Squid on multiple machines.
  • Rarely accessed images are served from the file system. There's no profit in caching them.

    Lessons Learned

  • When using dynamic partitioning it's difficult to pick keys and algorithms for where data should be stored.
  • Once you partition data you can no longer do joins and you have to open a lot of connections to different databases to merge the data back together.
  • It's hard to add new hosts and rearrange data when you partition. For example, let's say your partitioning algorithm stores all the messages for users 1-N on host 1. Now let's say host 1 becomes overburdened and you want to repartition users across more hosts. This is very difficult to do.
  • By using distributed memory caching they rarely hit the DB and there average page load time is about .02 seconds. This reduces the problems associated with partitioning.
  • You will often have to develop strategies based on the type of content. For example, image will be treated differently than short text posts.
  • Social networking sites are very time oriented, so it might be useful to partition data by time as well as user and type.

    Click to read more ...