Wednesday
Dec172008

Ringo - Distributed key-value storage for immutable data

Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. In addition to storing, Ringo should be able to retrieve individual or small sets of data items with low latencies (<10ms) and provide a convenient on-disk format for bulk data access. Ringo is compatible with the map-reduce framework Disco and it was started at Nokia Research Center Palo Alto.

Click to read more ...

Wednesday
Dec172008

Scalability Strategies Primer: Database Sharding

This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of database scalability strategies in the form of a broad introduction. More specifically, the intent is to elaborate on the majority of these implementations by example.

Click to read more ...

Tuesday
Dec162008

[ANN] New Open Source Cache System

The SHOP.COM Cache System is now available at http://code.google.com/p/sccache/ The SHOP.COM Cache System is an object cache system that... * is an in-process cache and external, shared Cache * is horizontally scalable * stores cached objects to disk * supports associative keys * is non-transactional * can have any size key and any size data * does auto-GC based on TTL * is container and platform neutral It was built in-house at SHOP.COM (by me) and has powered our website for years. We are open-sourcing it in the hope that it will be useful to others and to get some help in its maintenance. This is our first open source attempt and we'd appreciate any help and comments.

Click to read more ...

Tuesday
Dec162008

Facebook is Hiring

I thought with the job situation these days that people might be interested in some open jobs at Facebook. Here's what's available:


Facebook is hiring! We are looking for a Systems Engineer/Architect and Site Reliability Engineer. I have attached the job descriptions below. If you are interested, please contact Michelle Bostock mbostock-at-facebook.com. Thanks and Happy Holidays! Systems Architect Palo Alto, CA Description Facebook is seeking a seasoned Systems Architect to join the Operations team. The position is full-time and is based in our main office in downtown Palo Alto and will report to the Manager of Systems Operations. Responsibilities * Analyze application flow and infrastructure design to improve performance and scalability of the site * Collaborate on design of services infrastructure from servers to networking * Monitor, analyze, and make recommendations as appropriate to improve site stability and availability * Evaluate hardware and software technologies to improve site efficiency and performance * Troubleshoot and solve issues with hardware, applications, and network components * Lead team efforts from design to implementation, prioritize tasks and resources while interacting with Engineering and Operations * Document current and future configuration processes and policies * Participate in 24x7 on-call support Requirements * B.S. in Computer Science or equivalent experience * 4+ years of experience in Operations with large web farms * Extensive knowledge of web architecture and technologies, including Linux, Apache, MySQL, PHP, TCP/IP, security, HTTP, LDAP and MTAs * Strong background/interest in application and infrastructure design * Scripting and programming skills * Excellent verbal and written communication skills
Site Reliability Engineer Palo Alto, CA Description Facebook is seeking talented operations engineers to join the Site Reliability Engineering team. The ideal candidate will have strong communication skills, a passion for tinkering with Linux, and an almost insane fondness for fast-paced, seat-of-your-pants troubleshooting and crisis management. The position is full-time and is based in our main office in downtown Palo Alto. This position reports to the Manager of Site Reliability Engineering. Responsibilities * Monitor the stability and performance of the website * Remotely troubleshoot and diagnose hardware problems * Debug issues with Linux software, applications and network * Resolve technical challenges encountered in LAMP technologies * Develop and maintain monitoring tools and automation systems * Predict and respond to utilization variances across multiple datacenters * Identify and triage all outage related events * Facilitate communication, coordinate escalation, and work with subject matter experts to implement critical fixes * Automate and streamline processes * Track issues and run reports Requirements * 2-3 years+ Linux support/sys admin experience in an Internet operations environment * BA/BS in Computer Science or a related field, or equivalent experience * Working knowledge of Linux, Cisco, TCP/IP, Apache and mySQL * Experience working with network management systems and monitoring tools, such as Nagios, Ganglia and Cacti * Competency in Shell, PHP, Perl or Python. C is a plus * Solid understanding of web services architecture and commonly employed technologies * A sense of urgency in responding to and resolving critical issues that relate to the performance of the site and/or core infrastructure * Excellent verbal and written communication skills * Participation in a shifted coverage schedule, including working nights and on-call rotations

Click to read more ...

Sunday
Dec142008

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

How to scale MySQL on a 32 core system with 256 threads? Diagonal scalability in a box. An impressive benchmark that achieved more than 79,000 SQL queries per second on a single 4 RU server! Is this real? If so what is the role of good old horizontal scalability? The goals of the benchmark:

  1. Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440
  2. Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS
  3. Do it with minimal tuning i.e as close as possible as out-of-the-box

Click to read more ...

Saturday
Dec132008

Strategy: Facebook Tweaks to Handle 6 Time as Many Memcached Requests

Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:

...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.

Some of the problem Facebook faced and fixed:

  • Per connection consumption of resources. What works well at low number of inputs can totally kill a system as inputs grow. Memcached uses a per-connection buffer which adds up to a lot of memory that could be used to store data. Nothing wrong with this design choice, but Facebook made changes to use a per-thread shared connection buffer and reclaimed gigabytes of RAM on each server.
  • Kernel lock contention. Facebook discovered under load there was lock contention when transmitting through a single UDP socket from multiple threads. Sockets are data structures too and they are subject to the usual lock contention issues. Facebook got around this issue by maintaining separate reply sockets in different threads so they would not contend with the receive sockets. They found another bottleneck in Linux’s “netdevice” layer that sits in-between IP and device drivers. They changed the dequeue algorithm to batch dequeues so more work was done when they had the CPU.
  • Application lock contention. Nothing brings out lock issues like moving to more cores. Facebook found when they moved to 8 core machines a global lock protecting stats collection used 20-30% of CPU usage. In application that require little processing per request, as does memcached, this is not unexpected, but doing real work with your CPU is a better idea. So they collected stats on a per thread basis and then calculated a global view on demand.
  • Interrupt floods and starvation. With so much traffic directed at a single server the hardware can flood the CPU(s) with interrupts and keep the CPU from doing "real" work. To get around this problem Facebook implements some complicated strategies to load balance IO across all the cores. As I am less clever I might try more network cards with a TCP Offload engine.

    When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.

    A summary of potential strategies:
  • Profile everything. Problems are always specific. The understanding of the problem must be specific. The fix must be specific.
  • Burn profiling into your regression tests. Detect when and where performance tanks as a regular part of your build.
  • Use resources in proportion to what grows slowest. This requires multiplexing, but at least your resource usage is more predictable and bounded.
  • Batch work. When you have the CPU do all the work you possibly can in the quantum or the whole system grinds to a halt in processing overhead.
  • Do work and maintain resources per task. Otherwise locking for shared resources takes more and more time when there's less and less time to do the work that needs to be done.
  • Change algorithms. Sometimes you simply need to do things differently. Tweaking will only get you so far.

    You can find their changes on github, the hub that says "git."
  • Tuesday
    Dec092008

    Rules of Thumb in Data Engineering

    This is an interesting and still relevant research paper by Jim Gray, Prashant Shenoy at Microsoft Research that examines the rules of thumb for the design of data storage systems. It looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance. Jim Gray has an updated presentation on this interesting topic: Long Term Storage Trends and You. Robin Harris has a great post that reflects on the Rules of Thumb whitepaper on his StorageMojo blog: Architecting the Internet Data Center - Parts I-IV.

    Click to read more ...

    Saturday
    Dec062008

    Paper: Real-world Concurrency

    An excellent article by Bryan Cantrill and Jeff Bonwick on how to write multi-threaded code. With more processors and no magic bullet solution for how to use them, knowing how to write multiprocessor code that doesn't screw up your system is still a valuable skill. Some topics:

  • Know your cold paths from your hot paths.
  • Intuition is frequently wrong—be data intensive.
  • Know when—and when not—to break up a lock.
  • Be wary of readers/writer locks.
  • Consider per-CPU locking.
  • Know when to broadcast—and when to signal.
  • Learn to debug postmortem.
  • Design your systems to be composable.
  • Don't use a semaphore where a mutex would suffice.
  • Consider memory retiring to implement per-chain hash-table locks.
  • Be aware of false sharing.
  • Consider using nonblocking synchronization routines to monitor contention.
  • When reacquiring locks, consider using generation counts to detect state change.
  • Use wait- and lock-free structures only if you absolutely must.
  • Prepare for the thrill of victory—and the agony of defeat. While I don't agree that code using locks can be made composable, this articles covers a lot of very useful nitty-gritty details that will up your expert rating a couple points.

    Click to read more ...

  • Friday
    Dec052008

    Scalability Perspectives #4: Kevin Kelly – One Machine

    Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind. Warning #2: this post is wild.

    Kevin Kelly

    Kevin Kelly is Senior Maverick at Wired magazine. He helped launch Wired in 1993, and served as its Executive Editor until January 1999. He co-founded the ongoing Hackers' Conference, and was involved with the launch of the WELL, a pioneering online service started in 1985. He authored the best-selling New Rules for the New Economy and the classic book on decentralized emergent systems, Out of Control

    One Machine

    There is only one time in the history of each planet when its inhabitants first wire up its innumerable parts to make one large Machine. Later that Machine may run faster, but there is only one time when it is born. You and I are alive at this moment. Is this global web of computers, servers and trunk lines a mere mechanical circuit, a very large tool, or does it reach a threshold where something, well, different happens? Kevin Kelly's hypothesis is this: The rapidly increasing sum of all computational devices in the world connected online, including wirelessly, forms a superorganism of computation with its own emergent behaviors. I define the One Machine as the emerging superorganism of computers. It is a megasupercomputer composed of billions of sub computers. The sub computers can compute individually on their own, and from most perspectives these units are distinct complete pieces of gear. But there is an emerging smartness in their collective that is smarter than any individual computer. We could say learning (or smartness) occurs at the level of the superorganism.

    The Next 6500 Days of the Web

    Kevin Kelly recently gave a short talk on the upcoming Web 10.0 at the Web 2.0 Summit in San Francisco. It is like an update to his previous TED talk on Predicting the next 5000 days of the web. He makes us realize that the Web is only around 6500 days old and argues that the next 6500 days will be something entirely different.

    Dimensions of the One Machine

    Kevin Kelly's post on his blog The Technium back from 2007 shows us the dimensions of the One Machine: The next stage in human technological evolution is a single thinking/web/computer that is planetary in dimensions. This planetary computer will be the largest, most complex and most dependable machine we have ever built. It will also be the platform that most business and culture will run on. Today it contains approximately 1.2 billion personal computers, 2.7 billion cell phones, 1.3 billion land phones, 27 million data servers, and 80 million wireless PDAs. The processor chips of all these parts are feeding the computation of the internet/web/telecommunications system. A very rough estimate of the computing power of this Machine then is that it contains a billion times a billion, or one quintillion (10 ^ 18) transistors. There are about 100 billion neurons in the human brain. Today the Machine has as 5 orders more transistors than you have neurons in your head. And the Machine, unlike your brain, is doubling in power every couple of years at the minimum. If the Machine has 100 quadrillion transistors, how fast is it running? If we include spam, there are 196 billion emails sent every day. That's 2.2 million per second, or 2 megahertz. Every year 1trillion text messages are sent. That works out to 31,000 per second, or 31 kilohertz. Each day 14 billion instant messages are sent, at 162 kilohertz. The number of searches runs at 14 kilohertz. Links are clicked at the rate of 520,000 per second, or .5 megahertz. There are 20 billion visible, searchable web pages and another 900 billion dark, unsearchable, or deep web pages. The average number of links found on each searchable web page is 62. Assuming the same count for dynamic pages that means there's 55 trillion links in the full web. We could think of each link as a synapse -- a potential connection waiting to me made. There is roughly between 100 billion and 100 trillion synapses in the human brain, which puts the Machine in the same neighborhood as our brains. We could start by saying the Machine currently has 1 HB (Human Brain) equivalent. That measure might hold up for a decade or so, but after it gets to 100 HB, or 10,000 HB, it begins to feel like using inches to measure galactic space. Check out Kevin Kelly's blog for the conclusions and more (wild?) ideas. How do You see the future of the Web?

    Information Sources

    Click to read more ...

    Friday
    Dec052008

    Sprinkle - Provisioning Tool to Build Remote Servers

    At 37 Signals Joshua Sierles describes how 37 Signals uses Sprinkle to configure their servers within EC2. Sprinkle defines a domain specific meta-language for describing and processing the installation of software. You can find an interesting discussion of Sprinkle's creation story by the creator himself, Marcus Crafter, in Sprinkle Some Powder!. Marcus divides provisioning tools into two categories:

  • Task Based - the tool issues a list of commands to run on the remote system, either remotely via a network connection or smart client.
  • Policy/state Based - the tool determines what needs to be run on the remote system by examining its current and final state. Sprinkle combines both models together in a chocolate-in-my-peanut-butter approach using normal Ruby code as the DSL (domain specific language) to declaratively describe remote system configurations. 37 Signals likes the use of Ruby as the DSL because it makes learning a separate syntax unnecessary. I've successfully done similar things in Perl. You already have a scripting language, why layer another one on top? One reason not to is that you've now tied configuration and execution together so that only one tool can control the process, but the leverage is so high with this approach it's hard to ignore. There's all the usual bits about defining packages, dependencies, installation logic, pre and post actions, etc. The format is compact and clear because that's how Ruby is and the operations are task specific so there's no fluff. Capistrano is used to communicate with remote systems though that is pluggable. 37 Signals uses the EC2 security group as way to specify the role an instance should take on when it boots. A configuration script that can handle all roles is shipped with a near complete functional base image. Sprinkle then configures the system the rest of the way based on the passed in role. Joshua says they like this approach better than Puppet because it doesn't rely on a centralized configuration server or "pushing large sets of commands over SSH manually." There's always one more than one way to do "it" and Sprinkle carves out an interesting niche in the provisioning space. The 37 Signal's approach doesn't scale to a large organization with many different flavor of servers, but for a specific set of tightly cooperating servers it's a very simple, clean, and robust way of doing business. Related Articles: Product: Puppet the Automated Administration System.

    Click to read more ...