Entries in Linux (20)

Tuesday
Jun082021

Linux Kernel vs. Memory Fragmentation (Part I)

This post introduces common methods to prevent Linux memory fragmentation, the principle of memory compaction, how to view the fragmentation index, etc.

Click to read more ...

Friday
Oct182019

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

A long time ago, in a galaxy far far away, ‘threads’ were a programming novelty rarely used and seldom trusted. In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice. It would be a shame if your database crashed, after all.

Since then, a lot of water has flown under that bridge, but the PostgreSQL community has stuck by their original decision. It is difficult to fault their argument – as it’s absolutely true that:

Click to read more ...

Thursday
Jan192012

Is it time to get rid of the Linux OS model in the cloud?

You program in a dynamic language, that runs on a JVM, that runs on a OS designed 40 years ago for a completely different purpose, that runs on virtualized hardware. Does this make sense? We've talked about this idea before in Machine VM + Cloud API - Rewriting The Cloud From Scratch, where the vision is to treat cloud virtual hardware as a compiler target, and converting high-level language source code directly into kernels that run on it.

As new technologies evolve the friction created by our old tool chains and architecture models becomes ever more obvious. Take, for example, what a team at UCSD is releasing: a phase-change memory prototype  - a solid state storage device that provides performance thousands of times faster than a conventional hard drive and up to seven times faster than current state-of-the-art solid-state drives (SSDs). However, PCM has access latencies several times slower than DRAM.

This technology has obvious mind blowing implications, but an interesting not so obvious implication is what it says about our current standard datacenter stack. Gary Athens has written an excellent article, Revamping storage performance, spelling it all out in more detail:

Click to read more ...

Saturday
Apr042009

Digg Architecture

Update 4:: Introducing Digg’s IDDB Infrastructure by Joe Stump. IDDB is a way to partition both indexes (e.g. integer sequences and unique character indexes) and actual tables across multiple storage servers (MySQL and MemcacheDB are currently supported with more to follow). Update 3:: Scaling Digg and Other Web Applications. Update 2:: How Digg Works and How Digg Really Works (wear ear plugs). Brought to you straight from Digg's blog. A very succinct explanation of the major elements of the Digg architecture while tracing a request through the system. I've updated this profile with the new information. Update: Digg now receives 230 million plus page views per month and 26 million unique visitors - traffic that necessitated major internal upgrades. Traffic generated by Digg's over 22 million famously info-hungry users and 230 million page views can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle billions of requests a month? Site: http://digg.com

Information Sources

  • How Digg Works by Digg
  • How Digg.com uses the LAMP stack to scale upward
  • Digg PHP's Scalability and Performance

    Platform

  • MySQL
  • Linux
  • PHP
  • Lucene
  • Python
  • APC PHP Accelerator
  • MCache
  • Gearman - job scheduling system
  • MogileFS - open source distributed filesystem
  • Apache
  • Memcached

    The Stats

  • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
  • Over 22 million users.
  • 230 million plus page views per month
  • 26 million unique visitors per month
  • Several billion page views per month
  • None of the scaling challenges faced had anything to do with PHP. The biggest issues faced were database related.
  • Dozens of web servers.
  • Dozens of DB servers.
  • Six specialized graph database servers to run the Recommendation Engine.
  • Six to ten machines that serve files from MogileFS.

    What's Inside

  • Specialized load balancer appliances monitor the application servers, handle failover, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. If you don't have the fancy load balancers take a look at Linux Virtual Server and Squid as a replacement.
  • Requests are passed to the Application Server cluster. Application servers consist of: Apache+PHP, Memcached, Gearman and other daemons. They are responsible for making coordinating access to different services (DB, MogileFS, etc) and creating the response sent to the browser.
  • Uses a MySQL master-slave setup. - Four master databases are partitioned by functionality: promotion, profiles, comments, main. Many slave databases hang off each master. - Writes go to the masters and reads go to the slaves. - Transaction-heavy servers use the InnoDB storage engine. - OLAP-heavy servers use the MyISAM storage engine. - They did not notice a performance degradation moving from MySQL 4.1 to version 5. - The schema is denormalized more than "your average database design." - Sharding is used to break the database into several smaller ones.
  • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
  • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
  • To lighten their database load they used the APC PHP accelerator MCache.
  • Memcached is used for caching and memcached servers seemed to be spread across their database and application servers. A specialized daemon monitors connections and kills connections that have been open too long.
  • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
  • MogileFS, a distributed file system, serves story icons, user icons, and stores copies of each story’s source. A distributed file system spreads and replicates files across a lot of disks which supports fast and scalable file access.
  • A specialized Recommendation Engine service was built to act as their distributed graph database. Relational databases are not well structured for generating recommendations so a separate service was created. LinkedIn did something similar for their graph.

    Lessons Learned

  • The number of machines isn't as important what the pieces are and how they fit together.
  • Don't treat the database as a hammer. Recommendations didn't fit will with the relational model so they made a specialized service.
  • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
  • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
  • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
  • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
  • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
  • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.

    Related Articles

    * LinkedIn Architecture * Live Journal Architecture * Flickr Architecture * An Unorthodox Approach to Database Design : The Coming of the Shard
  • Ebay Architecture

    Click to read more ...

  • Saturday
    Nov222008

    Google Architecture

    Update 2: Sorting 1 PB with MapReduce. PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

    Information Sources

  • Video: Building Large Systems at Google
  • Google Lab: The Google File System
  • Google Lab: MapReduce: Simplified Data Processing on Large Clusters
  • Google Lab: BigTable.
  • Video: BigTable: A Distributed Structured Storage System.
  • Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
  • How Google Works by David Carr in Baseline Magazine.
  • Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
  • Dare Obasonjo's Notes on the scalability conference.

    Platform

  • Linux
  • A large diversity of languages: Python, Java, C++

    What's Inside?

    The Stats

  • Estimated 450,000 low-cost commodity servers in 2006
  • In 2005 Google indexed 8 billion web pages. By now, who knows?
  • Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
  • Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
  • BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

    The Stack

    Google visualizes their infrastructure as a three layer stack:
  • Products: search, advertising, email, maps, video, chat, blogger
  • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
  • Computing Platforms: a bunch of machines in a bunch of different data centers
  • Make sure easy for folks in the company to deploy at a low cost.
  • Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

    Reliable Storage Mechanism with GFS (Google File System)

  • Reliable scalable storage is a core need of any application. GFS is their core storage platform.
  • Google File System - large distributed log structured file system in which they throw in a lot of data.
  • Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. - efficient distribution of operations across nodes to reduce bottlenecks
  • System has master and chunk servers. - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
  • A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
  • Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

    Do Something With the Data Using MapReduce

  • Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
  • MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
  • Why use MapReduce? - Nice way to partition tasks across lots of machines. - Handle machine failure. - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc. - Computation can automatically move closer to the IO source.
  • The MapReduce system has three different types of servers. - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
  • For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically. - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS. - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. - Shuffling aggregates key types. - The reductions sums up all the key value pairs and produces the final answer.
  • The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
  • Programs can be very small. As little as 20 to 50 lines of code.
  • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
  • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

    Storing Structured Data in BigTable

  • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
  • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
  • Commercial databases simply don't scale to this level and they don't work across 1000s machines.
  • By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
  • Machines can be added and deleted while the system is running and the whole system just works.
  • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
  • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
  • BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed. - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
  • A locality group can be used to physically store related bits of data together for better locality of reference.
  • Tablets are cached in RAM as much as possible.

    Hardware

  • When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
  • Use ultra cheap commodity hardware and built software on top to handle their death.
  • A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
  • Linux, in-house rack design, PC class mother boards, low end storage.
  • Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
  • Use a mix of collocation and their own data centers.

    Misc

  • Push changes out quickly rather than wait for QA.
  • Libraries are the predominant way of building programs.
  • Some are applications are provided as services, like crawling.
  • An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

    Future Directions for Google

  • Support geo-distributed clusters.
  • Create a single global namespace for all data. Currently data is segregated by cluster.
  • More and better automated migration of data and computation.
  • Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

    Lessons Learned

  • Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.
  • Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.
  • Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.
  • An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.
  • Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.
  • Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.
  • Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
  • Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.
  • Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

    Click to read more ...

  • Wednesday
    Mar122008

    YouTube Architecture

    Update 3: 7 Years Of YouTube Scalability Lessons In 30 Minutes and YouTube Strategy: Adding Jitter Isn't A Bug

    Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour. 

    Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

    YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

    Information Sources

  • Google Video

    Platform

  • Apache
  • Python
  • Linux (SuSe)
  • MySQL
  • psyco, a dynamic python->C compiler
  • lighttpd for video instead of Apache

    What's Inside?

    The Stats

  • Supports the delivery of over 100 million videos per day.
  • Founded 2/2005
  • 3/2006 30 million video views/day
  • 7/2006 100 million video views/day
  • 2 sysadmins, 2 scalability software architects
  • 2 feature developers, 2 network engineers, 1 DBA

    Recipe for handling rapid growth

    while (true) { identify_and_fix_bottlenecks(); drink(); sleep(); notice_new_bottleneck(); } This loop runs many times a day.

    Web Servers

  • NetScalar is used for load balancing and caching static content.
  • Run Apache with mod_fast_cgi.
  • Requests are routed for handling by a Python application server.
  • Application server talks to various databases and other informations sources to get all the data and formats the html page.
  • Can usually scale web tier by adding more machines.
  • The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
  • Python allows rapid flexible development and deployment. This is critical given the competition they face.
  • Usually less than 100 ms page service times.
  • Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
  • For high CPU intensive activities like encryption, they use C extensions.
  • Some pre-generated cached HTML for expensive to render blocks.
  • Row level caching in the database.
  • Fully formed Python objects are cached.
  • Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

    Video Serving

  • Costs include bandwidth, hardware, and power consumption.
  • Each video hosted by a mini-cluster. Each video is served by more than one machine.
  • Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups.
  • Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections.
  • Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
  • Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little.

    Serving Video Key Points

  • Keep it simple and cheap.
  • Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
  • Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
  • Use simple common tools. They use most tools build into Linux and layer on top of those.
  • Handle random seeks well (SATA, tweaks).

    Serving Thumbnails

  • Surprisingly difficult to do efficiently.
  • There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
  • Thumbnails are hosted on just a few machines.
  • Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
  • To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

    Databases

  • The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
  • The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily.

    Data Center Strategy

  • Used manage hosting providers at first. Living off credit cards so it was the only way.
  • Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
  • So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
  • Use 5 or 6 data centers plus the CDN.
  • Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
  • Video bandwidth dependent, not really latency dependent. Can come from any colo.
  • For images latency matters, especially when you have 60 images on a page.
  • Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.

    Lessons Learned

  • Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
  • Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.
  • Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
  • Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.
  • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.
  • Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID
  • You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

    Click to read more ...

  • Thursday
    Feb072008

    clusteradmin.blogspot.com - blog about building and administering clusters

    A blog about cluster administration. Written by a System Administrator working at HPC (High Performance Computing) data-center, mostly dealing with PC clusters (100s of servers), SMP machines and distributed installations. The blog concentrates on software/configuration/installation management systems, load balancers, monitoring and other cluster-related solutions.

    Click to read more ...

    Thursday
    Jan242008

    Mailinator Architecture

    Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder. Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator. Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-) I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form. How does Mailinator carry out its work as a spam busting super hero? Site: http://mailinator.com/

    Information Sources

  • The Architecture of Mailinator
  • Mailinator's 2006 Stats

    The Platform

  • Linux
  • Tomcat
  • Java

    The Stats

  • Will process an estimated 1.29 BILLION emails for 2007. 450.74 million in 2006. 280.68 million in 2005.
  • Peak rate of 6.5 million emails/day or 4513/min or 75/sec.
  • Mailinator runs on a very modest machine with an AMD 2Ghz Athlon processor, 1GB of RAM (much less is used), and a low-performance 80G IDE hard drive. And the machine is not very busy at all.
  • Mailinator runs for months unattended and very few emails are lost, even under constant spam attacks and high peak loads.

    The Architecture

  • Having a free system means the system doesn't have to be perfect. So the design goals are: - Design a system that values survival above all else, even users. Survival is key because Mailinator must fight off attacks on a daily basis. - Provide 99.99% uptime and accuracy for users. Higher uptime goals would be impractical and costly. And since the service is free this is just part of rules of the game for users. - Support the following service model: user signs-up for something, goes to Mailinator, clicks on the subscription link, and forgets about it. This means email doesn't have to be stored persistently on disk. Email can reside in RAM because it is temporary (3-4 hours). If you want a real mailbox then use another service.
  • The original flow of email handling was: - Sendmail received email in a single on-disk mailbox. - The Java based Mailinator grabbed emails using IMAP and/or POP (it changed over time) and deleted them. - The system then loaded all emails into memory and let them sit there. - The oldest email was pushed out once the 20,000 in memory limit was reached.
  • The original architecture worked well: - It was stable and stayed up for months at a time. - It used almost all the 1GB of RAM. - Problems started when the incoming email rate started surpassing 800,000 a day. The system broke down because of disk contention between Mailinator and the email subsystem.
  • The New Architecture: - The idea was to remove the path through the disk which was accomplished with a complete system rewrite. - The web application, the email server, and all email storage run in one JVM. - Sendmail was replaced with a custom built SMTP server. Because of the nature of Mailinator a full SMTP server was not necessary. Mailinator does not need to send email. And it's primary duty is to accept or reject email as fast as possible. This is the downside of layering. Layering is very often given as a key strategy in scaling, but it can kill performance because crucial decisions are best handled at the highest levels of the stack. So work flows through the system only to be dumped at the lower layers when many of the RAM and cycle stealing operations have already been accomplished. So the decision to go with a custome SMTP server is an interesting and brave decision. Most people at this point would just add more hardware. And they wouldn't be wrong, but it's interesting to see this path taken as well. Maybe with more DOM and AOP like architectures we can flatten the stack and get better performance when needed. - Now Mailinator receives an email directly, parses it, and stores it into memory. The disk is bypassed completely and the disk remains fairly idle. - Emails are written to disk when the system is coming down so they can be reloaded on startup. - Logging was shut-off to remove the risk of subpoenaes. When logging was performed log data was written in batches so several thousand logs lines would be written in one disk write. This minimized at disk contention at the risk of losing helpful diagnostic information. - The system uses under 300 threads. More aren't needed. - On arrival each email passes through a filter system and is stored in RAM if all filters are passed. - Every inbox is limited to only 10 emails so popular inboxes, like joe@mailinator.com, can't blow the system. - No incoming email can be over 100k and all attachments are immediately discarded. This saves on RAM.
  • Emails are compressed in RAM: - Since 99% of emails are never looked at, compressed email saves RAM. They are only ever decompressed when someone looks at them. - Mailinator can store about 80,000 emails in RAM, using under 300MB of RAM compared to the 20,000 emails which were stored in 1GB RAM in the original design. - With this pool the average email lifespan is about 3-4 hours. - It's likely 200,000 emails could fit in memory, but there hasn't been a real need. - This is one of the design details I love because it's based on real application usage patterns. RAM is precious and CPU is not, so use compression to save RAM at the expense of CPU, knowing you won't have to take the CPU hit twice, most of the time.
  • Mailinator does not guarantee anonymity and privacy: - There is no privacy. Anyone can read any inbox at anytime. - Relaxing these constrains, while shocking, makes the design much simpler. - For the user it is simple because there is no sign up needed. When a web site asks you for an email address you can just enter an mailinator address. You don't need to create a separate account. Typing in the email address effectively creates the mailinator account. Simple. - In practice users still get a high level of privacy.
  • Goal of survivability leads to aggressive SPAM filtering. - Mailinator doesn't have anything against SPAM, but because it gets so much SPAM, it must be filtered out when it threatens the up time of the system. - Which leads to this rule: If you do anything (spammer or not) that starts affecting the system - your emails will be refused and you may be locked out.
  • To be accepted an email must pass the following filter chain: - Bounce: all bounced emails are dropped. - IP: too much email from a single IP are dropped - Subject: too much email on the same subject is dropped - Potty: subjects containing words that indicate hate or crimes or just downright nastiness are dropped.
  • Surviving Email Floods from a Single IP Adress - An AgingHashmap is used to filter out spammers from a particular IP address. When an email arrives on a IP address the IP is put in the map and a counter is increased for all subsequent emails. - After a certain period of time with no emails the counter is cleared. - When a sender reaches a threshold email count the sender is blocked. This prevents a sender from flooding the system. - Many systems use this sort of logic to protect all sorts of resources, like comments. You can use memcached for the same purpose in a distributed system.
  • Protecting Against Zombie Attacks: - Spam can be sent from a large coordinates sets of different IP addresses, called zombie networks. The same message is sent from thousands of different IP addresses so the techniques for stopping email from a single IP address are not sufficient. - This filtering is a little more complex than IP blocking because you have to parse enough of the email to get the subject line and matching subject strings is a little more resource intensive. - When something like 20 emails with the same subject within 2 minutes, all emails with that subject are then banned for 1 hour. - Interestingly, subjects are not banned forever because that would mean Mailinator would have to track subjects forever and the system design is inherently transient. This is pretty clever I think. At the cost of a few "bad" emails getting through the system is much simpler because no persistent list must be managed and that list surely would become a bottleneck. A system with more stringent SPAM filtering goals would have to create a much more complex and less robust architecture. - Nealy 9% of emails are blocked with this filter. - From my reading Mailinator filters only on IP and subject, so it doesn't have to read the body of the email body to accept or reject the email. This minimizes resource usage when most email will be rejected.
  • To lessen the danger from DOS attacks: - All connections that are silent for a specific period of time are droped. - Mailinator sends replies to email senders very slowly, like 10 or 20 or 30 seconds, even for a very small amount of data. This slows down spammers who are trying to send out spam as fast as possible and may make them rethink sending email again to that address. The wait period is reduced during busy periods so email isn't dropped.

    Lessons Learned

  • Perfection is a trap. How many systems are made much more complicated by the drive to be 100% everything. If you've been in those meetings you know what they are like. Oh, we can't do it this way or that way because there's .01% chance of something going wrong. Instead ask: how imperfect can you be and be good enough?
  • What you throw out is as important as what you keep in. We have many preconceptions of how to design systems. We make take for granted that you need to scale-out, you need to have email accessible days later, and that you must provide private accounts for everyone. But you really need these things? What can you toss?
  • Know the purpose of your system and design accordingly. Being everything to everyone means you are nothing to nobody. Keeping emails for a short period of time, allowing some SPAM to get through, and accepting less than 100% uptime create a strong vision for the system that help drive the design in all areas. You would only build your own SMTP server if you had a very strong idea of what your system was about and what you needed. I know this would have never occurred to me as an idea. I would have added more hardware.
  • Fail fast for the common case before committing resources. A high percentage of email is rejected so it makes sense to reject it as early as possible in the stack to minimize resources to accomplish the task. Figure out how to short circuit frequently failed items as fast as possible. This is important and often over looked scaling strategy.
  • Efficiency often means build it yourself. Off the shelf tools tend to do the whole job. If you only need part of the job done you may be able to write a custom component that runs much faster.
  • Adaptively forget. A little failure is OK. All the blocked IP addresses don't need to be remembered forever. Let the block decisions build up from local data rather than global state. This is powerfully simple and robust architecture.
  • Java doesn't have to be slow. Enough said.
  • Avoid the disk. Many applications need to hit the disk, but the disk is always a bottleneck. Can you design around the disk using other creative strategies?
  • Constrain resource usage. Put in constraints, like inbox size, that will keep your system for spiking uncontrollably. Unconstrained resource usage must be avoided with limited resources.
  • Compress data. Compression can be a major win when trying to conserve RAM. I've seen memory usage drop by more than half when using compression with very little overhead. If you are communicating locally, just have the client encode the data and keep it encoded. Build APIs to access the data without have to decode the full message.
  • Use fixed size resource pools to handle load. Many applications don't control resource usage, like memory, and they crash when too much is used. To create a really robust system fix your resources and drop work when those resources are full. You can age resources, give priority access, give fair access, or use any other logic to arbitrate resource access, but because the resource will be limited, you will stay up under load.
  • If you don't keep data it can't be subpoenaed. Because Mailinator doesn't store email or logs on disk noting can be subpoenaed.
  • Use what you know. We've seen this lesson a few times. Paul knew Java better than anything else, so he used it, made it work, and he got the job got done.
  • Find your own Mailinators. Sure, Mailinator is a small system. In a large system it would just be a small feature, but your system is composed of many Mailinator sized projects. What if you developed some of those like Mailinator?
  • KISS exists, though it's rare. Keeping it simple is always talked about, but rarely are we shown real examples. It's mostly just your way is complex and my way is simple because it's my way. Mailinator is a good example of simple design.
  • Robustness is a function of architecture. To create a design that efficiently uses memory and survives massive spam attacks required an architectural approach that looked at the entire stack.

    Related Articles

  • PlentyOfFish champions straight forward bare bones simplicity.
  • Varnish smartly uses OS features to find incredible performance.
  • ThemBid gracefully pieces together open source components.

    Click to read more ...

  • Monday
    Nov192007

    Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

    Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

    Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for? Tailrank originally a memetracker to track the hottest news being discussed within the blogosphere. We started having a lot of requests to license our crawler and we shipped that in the form of Spinn3r about 8 months ago. Spinn3r is self contained crawler for companies that want to index the full blogosphere and consumer generated media. Tailrank is still a very important product alongside Spinn3r and we're working on Tailrank 3.0 which should be available in the future. No ETA at the moment but it's actively being worked on.
  • What particular design/architecture/implementation challenges does your system have? The biggest challenge we have is the sheer amount of data we have to process and keeping that data consistent within a distributed system. For example, we process 52TB of content per month. this has to be indexed in a highly available storage architecture so the normal distributed database problems arise.
  • What did you do to meet these challenges? We've spent a lot of time in building out a distributed system that can scale and handle failure. For example, we've built a tool called Task/Queue that is analogous to Google's MapReduce. It has a centralized queue server which hands out units of work to robots which make requests. It works VERY well for crawlers in that slower machines just fetch work at a slower rate while more modern machines (or better tuned machines) request work at a higher rate. This ends up easily solving one of the main distributed computing fallacies that the network is homogeneous. Task/Queue is generic enough that we could actually use it to implement MapReduce on top of the system. We'll probably open source it at some point. Right now it has too many tentacles wrapped into other parts of our system.
  • How big is your system? We index 24M weblogs and feeds per hour and process content at about 160-200Mbps. At the raw level we're writing to our disks at about 10-15MBps continuously.
  • How many documents, do you serve? How many images? How much data? Right now the database is about 500G. We're expecting it to grow well beyond this in 2008 as we expand our product offering.
  • What is your rate of growth? It's mostly a function of customer feature requests. If our customers want more data we sell it to them. In 2008 we're planning on expanding our cluster to index larger portions of the web and consumer generated media.
  • What is the architecture of your system? We use Java, MySQL and Linux for our cluster. Java is a great language for writing crawlers. The library support is pretty solid (though it seems like Java 7 is going to be killer when they add closures). We use MySQL with InnoDB. We're mostly happy with it though it seems I end up spending about 20% of my time fixing MySQL bugs and limitations. Of course nothing is perfect. MySQL for example was really designed to be used on single core systems. The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks. I recently blogged about how these the new multi-core machines should really be considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
  • How is your system architected to scale? We use a federated database system so that we can split the write load as we see more IO. We've released a lot of our code as Open Source a lot of our infrastructure and this will probably be released as Open Source as well. We've already opened up a lot of our infrastructure code:
  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
  • How many servers do you have? About 15 machines so far. We've spent a lot of time tuning our infrastructure so it's pretty efficient. That said, building a scalable crawler is not an easy task so it does take a lot of hardware. We're going to be expanding FAR past this in 2008 and will probably hit about 2-3 racks of machines (~120 boxes).
  • What operating systems do you use? Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know why more hardware vendors don't support Debian. Debian is the big secret in the valley that no one talks about. Most of the big web 2.0 shops like Technorati, Digg, etc use Debian.
  • Which web server do you use? Apache 2.0. Lighttpd is looking interesting as well.
  • Which reverse proxy do you use? About 95% of the pages of Tailrank are served from Squid.
  • How is your system deployed in data centers? We use ServerBeach for hosting. It's a great model for small to medium sized startups. They rack the boxes, maintain inventory, handle network, etc. We just buy new machines and pay a flat markup. I wish Dell, SUN, HP would sell directly to clients in this manner. One right now. We're looking to expand into two for redundancy.
  • What is your storage strategy? Directly attached storage. We buy two SATA drives per box and set them up in RAID 0. We use the redundant array of inexpensive databases solution so if an individual machine fails there's another copy of the data on another box. Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
  • Do you have a standard API to your website? Tailrank has RSS feeds for every page. The Spinn3r service is itself an API and we have extensive documentation on the protocol. It's also free to use for researchers so if any of your readers are pursuing a Ph.D and generally doing research work and needs access to blog data we'd love to help them out. We already have the Ph.D students at the University of Washington and University of Maryland (my Alma Matter) using Spinn3r.
  • Which DNS service do you use? PowerDNS. It's a great product. We only use the recursor daemon but it's FAST. It uses async IO though so it doesn't really scale across processors on multicore boxes. Apparenty there's a hack to get it to run across cores but it isn't very reliable. AAA caching might be broken though. I still need to look into this.
  • Who do you admire? Donald Knuth is the man!
  • How are you thinking of changing your architecture in the future? We're still working on finishing up a fully sharded database. MySQL fault tolerance and autopromotion is also an issue.

    Click to read more ...

  • Tuesday
    Nov132007

    Flickr Architecture

    Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

    Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

    Site: http://www.flickr.com

    Information Sources

  • Flickr and PHP (an early document)
  • Capacity Planning for LAMP
  • Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.
  • Building Scalable Web Sites by Cal Henderson from Flickr.
  • Database War Stories #3: Flickr by Tim O'Reilly
  • Cal Henderson's Talks. A lot of useful PowerPoint presentations.

    Platform

    Click to read more ...