Entries by Todd Hoff (380)

Tuesday
Jan132009

Product: Gearman - Open Source Message Queuing System

Update: New Gearman Server & Library in C, MySQL UDFs. Gearman is an open source message queuing system that makes it easy to do distributed job processing using multiple languages. With Gearman you: farm out work to other machines, dispatching function calls to machines that are better suited to do work, to do work in parallel, to load balance lots of function calls, to call functions between languages, spread CPU usage around your network. Gearman is used by companies like LiveJournal, Yahoo!, and Digg. Digg, for example, runs 300,000 jobs a day through Gearman without any issues. Most large sites use something similar. Why would anyone ever even need a message queuing system? Message queuing is a handy way to move work off your web servers (like image manipulation), to generate thousands of documents in the background, to run the multiple requests in parallel needed to build a web page, or to perform tasks that can comfortably be run in the background and not part of the main request loop for servicing a web request. There's a gearmand server and clients written in Perl, Ruby, Python or C. Use at least two gearmand server daemons for higher availability. The tasks each client can perform are registered with gearman distributes requests for those functions to the client that can implement them. Gearman uses a very robust, if somewhat higher latency, signal-and-pull architecture.

  • According to dormando the flow goes like: * worker connects to all gearmand servers. * worker registers what functions it supports. * worker asks for jobs. * if no jobs, sends command 'pre_sleep' to all gearmand's and sleeps.
  • Client does: * Connect to gearmand. * submit's a job for a particular func.
  • Gearmand does: * Acks the job, finds all *sleeping workers* related to the function. * Sends them all a 'noop' command to wake them up.
  • Worker does: * Urk, I'm awake now. * Worker asks for jobs. * If jobs, do work. * If no jobs, sends command 'pre_sleep' to all gearmand's, etc. Gearman uses an efficient binary protocol and no XML. There's an a line-based text protocol for admin so you can use telnet and hook into Nagios plugins. The system makes no guarantees. If there's a failure the client is told about the failure and the client is responsible for retries. And the queue isn’t persistent. If gearman is restarted the queue is gone.

    Related Articles

  • Gearman Wiki
  • German Google Groups
  • Queue everything and delight everyone by Leslie Michael Orchard.
  • USENIX 2007. Starts at slide 83.
  • PEAR and Gearman by Daniel O'Connor.
  • Amazon Architecture

    Click to read more ...

  • Thursday
    Jan082009

    Paper: Sharding with Oracle Database

    The upshot of the paper is Oracle rules and MySQL sucks for sharding. Which is technically probable, if you don't throw in minor points like cost and ease of use. The points where they think Oracle wins: online schema changes, more robust replication, higher availability, better corruption handling, better use of large RAM and multiple cores, better and better tested partitioning features, better monitoring, and better gas mileage.

    Click to read more ...

    Monday
    Jan052009

    Lessons Learned at 208K: Towards Debugging Millions of Cores

    How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks. Abstract:

    Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

    Lessons Learned

    At the end of the paper they identify several insights they had about developing petascale tools:
  • We find that sequential daemon launching becomes a bottleneck at this scale. We improve both scalability and portability by eschewing ad hoc sequential launchers in favor of LaunchMON, a portable daemon spawner that integrates closely with native resource managers.
  • As daemons run, we find that it is critical that they avoid data structures that represent, or even reserve space to represent, a global view. Instead, we adopt a hierarchical representation that dramatically reduces data storage and transfer requirements at the fringes of the analysis tree.
  • We find that seemingly-independent operations across daemons can suffer scalability bottlenecks when accessing a shared resource, such as the file system. Our scalable binary relocation service is able to optimize the file operations and reduce file system accesses to constant time regardless of system size. Unsurprisingly these lessons aren't that much different than other builders of scalable programs have had to learn.

    Related Articles

  • Livermore Lab pioneers debugging tool by Jaob Jackson in Government Computer News.

    Click to read more ...

  • Sunday
    Jan042009

    Paper: MapReduce: Simplified Data Processing on Large Clusters

    Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster. From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Thanks to Kevin Burton for linking to the complete article.

    Related Articles

  • MapReducing 20 petabytes per day by Greg Linden
  • 2004 Version of the Article by Jeffrey Dean and Sanjay Ghemawat

    Click to read more ...

  • Friday
    Jan022009

    Strategy: Understanding Your Data Leads to the Best Scalability Solutions

    In article Building Super-Scalable Web Systems with REST Udi Dahan tells an interesting story of how they made a weather reporting system scale for over 10 million users. So many users hitting their weather database didn't scale. Caching in a straightforward way wouldn't work because weather is obviously local. Caching all local reports would bring the entire database into memory, which would work for some companies, but wasn't cost efficient for them. So in typical REST fashion they turned locations into URIs. For example: http://weather.myclient.com/UK/London. This allows the weather information to be cached by intermediaries instead of hitting their servers. Hopefully for each location their servers will be hit a few times and then the caches will be hit until expiry. In order to send users directly to the correct location an IP location check is performed on login and stored in a cookie. The lookup is done once and from then on out a GET is performed directly on the resource. There's no need to hit their servers and do a lookup on the user to get the location. That's all bypassed. I like Udi's summary of the approach and is why I think this is a good strategy : This isn’t a “cheap trick”. While being straight forward for something like weather, understanding the nature of your data and intelligently mapping that to a URI space is critical to building a scalable system, and reaping the benefits of REST.

    Click to read more ...

    Monday
    Dec292008

    100% on Amazon Web Services: Soocial.com - a lesson of porting your service to Amazon

    Simone Brunozzi, technology evangelist for Amazon Web Services in Europe, describes how Soocial.com was fully ported to Amazon web services.

    ---------------- This period of the year I decided to dedicate some time to better understand how our customers use AWS, therefore I spent some online time with Stefan Fountain and the nice guys at Soocial.com, a "one address book solution to contact management", and I would like to share with you some details of their IT infrastructure, which now runs 100% on Amazon Web Services!

    In the last few months, they've been working hard to cope with tens of thousands of users and to get ready to easily scale to millions. To make this possible, they decided to move ALL their architecture to Amazon Web Services. Despite the fact that they were quite happy with their previous hosting provider, Amazon proved to be the way to go.
    -----------------

    Read the rest of the article here.

    Click to read more ...

    Monday
    Dec292008

    Platform virtualization - top 25 providers (software, hardware, combined)

    In this article they present the companies which offers means (mainly, the software and hardware) which powers most of the cloud computing hosting providers, namely virtualization solutions.

    Read the entire article about Platform virtualization - top 25 providers (software, hardware, combined) at MyTestBox.com - web software reviews, news, tips & tricks.

    Click to read more ...

    Monday
    Dec292008

    Paper: Spamalytics: An Empirical Analysisof Spam Marketing Conversion

    Under the philosophy that the best method to analyse spam is to become a spammer, this absolutely fascinating paper recounts how a team of UC Berkely researchers went under cover to infiltrate a spam network. Part CSI, part Mission Impossible, and part MacGyver, the team hijacked the botnet so that their code was actually part of the dark network itself. Once inside they figured out the architecture and protocols of the botnet and how many sales they were able to tally. Truly elegant work.

    Two different spam campaigns were run on a Storm botnet network of 75,800 zombie computers. Storm is a peer-to-peer botnet that uses spam to creep its tentacles through the world wide computer network. One of the campains distributed viruses in order to recruit new bots into the network. This is normally accomplished by enticing people to download email attachments. An astonishing one in ten people downloaded the executable and ran it, which means we won't run out of zombies soon. The downloaded components include: Backdoor/downloader, SMTP relay, E-mail address stealer, E-mail virus spreader, Distributed denial of service (DDos) attack tool, pdated copy of Storm Worm dropper. The second campaign sent pharmacuticle spam ("libido boosting herbal remedy”) over the network.

    Haven't you always wondered who clicks on spam and how much could spammers possibly make? In the study only 28 sales resulted from 350 million spam e-mail messages sent over 26 days. A conversion rate of well under 0.00001% (typical advertising campaign might have a conversion of 2-3%). The average purchase price was about $100 for $2,731.88 in total revenue. The reserchers estimate total daily revenue attributable to Storm’s pharmacy campaign is about $7000 and that they pick up between 3500 and 8500 new bots per day through their Trojan distribution system. And this is with only 1.5% of the entire network in use.

    So, the spammers would take in total revenue about $3.5 million a year from one product from one network. Imagine the take with multiple products and multiple networks? That's why we still have spam. And since the conversion rate is already so low, it seems spam will always be with us.

    As fascinating as all the spamonomics are, the explanation of the botnet architecture is just as fascinating. Storm uses a three-level self-organizing hierarchy pictured here:

  • worker bots - make requests for work and upon receiving orders send spam as requested. Works pull work from higher layers.
  • proxy bots - act as coordinators between workers and master servers.
  • master servers - send commands to the workers and receive their status reports. There are small number of master servers hosted at “bullet-proof” hosting centers and are likely directly managed by the botmaster.

    A host selects its worker or proxy role automatically. If a firewall doesn't prevent inbound communication the infected host becomes a proxy, otherwise the host becomes a worker. As workers pull work from proxies there's no need to contact one directly. Proxies on the other hand are directly contacted by master servers so communication must be bidirectional.

    Storm communicates using two separate protocols:
  • An encrypted version of the UDP-based Overnet protocol and is used primarily as a directory service to find other nodes. Overnet is a peer-to-peer protocol that uses a distributed hash table mechanism to find peers.
  • A custom TCP-based protocol for masters sending command and control commands to proxies and workers. Command and control traffic to the worker bots is unecrypted which makes a man-in-the-middle attack possible and is how the researchers carried out their caper.

    According to Brandon Enright: When a peer wants to find content in the network, it computes (or is given) the hash of that content and then searches adjacent peers. Those peers respond with their adjacent peers that are closer. This is repeated until the searching peer gets close enough to the content that a node there will be able to provide a search result. This is a complicated and interesting process that the Spamalytics paper goes into in a lot more detail on as do some references at the end of this post.

    Storm harnesses a large, unreliable, constantly changing distributed system to do work. It's an architecture worth learning from and we'll explore some of those lessons in a later post.

    Related Articles

  • On the Spam Campaign Trail
  • Scaling Spam Eradication Using Purposeful Games: Die Spammer Die!
  • Can cloud computing smite down evil zombie botnet armies?
  • Inside the Storm: Protocols and Encryption of the Storm Botnet by Joe Stewart, GCIG Director of Malware Research, SecureWorks
  • Exposing Stormworm by Brandon Enright. A lot of excellent low level protocol details.
  • Storm Botnet
  • Global Guerrillas by John Robb - Networked tribes, systems disruption, and the emerging bazaar of violence. Resilient Communities, decentralized platforms, and self-organizing futures.
  • Sunday
    Dec212008

    The I.H.S.D.F. Theorem: A Proposed Theorem for the Trade-offs in Horizontally Scalable Systems

    Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.

    Click to read more ...

    Saturday
    Dec202008

    Second Life Architecture - The Grid

    Update:Presentation: Second Life’s Architecture. Ian Wilkes, VP of Systems Engineering, describes the architecture used by the popular game named Second Life. Ian presents how the architecture was at its debut and how it evolved over years as users and features have been added. Second Life is a 3-D virtual world created by its Residents. Virtual Worlds are expected to be more and more popular on the internet so their architecture might be of interest. Especially important is the appearance of open virtual worlds or metaverses. What happens when video games meet Web 2.0? What happens is the metaverse.

    Information Sources

    Platform

    • MySQL
    • Apache
    • Squid
    • Python
    • C++
    • Mono
    • Debian

    What's Inside?

    The Stats

    • ~1M active users
    • ~95M user hours per quarter
    • ~70K peak concurrent users (40% annual growth)
    • ~12Gbit/sec aggregate bandwidth (in 2007)

    Staff (in 2006)

    • 70 FTE + 20 part time
    "about 22 are programmers working on SL itself. At any one time probably 1/3 of the team is on infrastructure, 1/3 is on new features and 1/3 is on various maintenance tasks (bug fixes, general stability and speed improvements) or improvements to existing features. But it varies a lot."

    Software

    Client/Viewer
    • Open Source client
    • Render the Virtual World
    • Handles user interaction
    • Handles locations of objects
    • Gets velocities and does simple physics to keep track of what is moving where
    • No collision detection
    Simulator (Sim) Each geographic area (256x256 meter region) in Second Life runs on a single instantiation of server software, called a simulator or "sim." And each sim runs on a separate core of a server. The Simulator is the primary SL C++ server process which runs on most servers. As the viewer moves through the world it is handled off from one simulator to another.
    • Runs Havok 4 physics engine
    • Runs at 45 frames/sec. If it can't keep up, it will attempt time dialation without reducing frame rate.
    • Handles storing object state, land parcel state, and terrain height-map state
    • Keeps track of where everything is and does collision detection
    • Sends locations of stuff to viewer
    • Transmits image data in a prioritized queue
    • Sends updates to viewers only when needed (only when collision occurs or other changes in direction, velocity etc.)
    • Runs Linden Scripting Language (LSL) scripts
    • Scripting has been recently upgraded to the much faster Mono scripting engine
    • Handles chat and instant messages
      • Asset Server
        • One big clustered filesystem ~100TB
        • Stores asset data such as textures.
        MySQL database Second Life has started with One Database, and have subsequently been forced into clustering. They use a ton of MySQL databases running on Debian machines to handle lots of centralized services. Rather than attempt to build the one, impossibly large database – all hail the Central Database – or one, impossibly large central cluster – all hail the Cluster – Linden Lab instead adopted a divide and conquer strategy based around data partitioning. The good thing is that UUIDs– 128-bit unique identifiers – are associated with most things in Second Life, so partitioning is generally doable. Backbone Linden Lab has converted much of their backend architecture away from custom C++/messaging into web services. Certain services has been moved off of MySQL – or cached (Squid) between the queries and MySQL. Presence, in particular Agent Presence, ie are you online and where are you on the grid, is a particularly tricky kind of query to partition, so there is now a Python service running on the SL grid called Backbone. It proved to be easier to scale, develop and maintain than many of their older technologies, and as a result, it plays an increasingly important role in the Second Life platform as Linden Lab migrates their legacy code to web services. Two main components of the backbone are open source:
        • Eventlet is a networking library written in Python. It achieves high scalability by using non-blocking io while at the same time retaining high programmer usability by using coroutines to make the non-blocking io operations appear blocking at the source code level.
        • Mulib is a REST web service framework built on top of eventlet

        Hardware

        • 2000+ Servers in 2007
        • ~6000 Servers in early 2008
        • Plans to upgrade to ~10000 (?)
        • 4 sims per machine, for both class 4 and class 5
        • Used all-AMD for years, but are moving from the Opteron 270 to the Intel Xeon 5148
        • The upgrade to "class 5" servers doubled the RAM per machine from 2GB to 4GB and moved to a faster SATA disk
        • Class 1 - 4 are on 100Mb with 1Gb uplinks to the core. Class 5 is on pure 1Gb
        Do you have more details?

        Click to read more ...