High Scalability -

Entries by Todd Hoff (380)

Monday

Nov032008

How Sites are Scaling Up for the Election Night Crush

Monday, November 3, 2008 at 10:50PM

Election night is a big traffic boost for news and social sites. Yahoo expects up to 400 million page views on Election Day. Data Center Knowledge has an excellent article how various sites are preparing to handle spikes in election night traffic. Some interesting bits:

Prepare ahead. Don't wait to handle spikes, plan and prepare before the blessed event.

Use a CDN. Daily Kos puts images on a CDN, but the dynamic nature of their site means the can't use CDN for their other content.

Scale up. Daily Kos "to handle the traffic better, we moved to a cluster of six quad core Xeons with 8GB RAM for webheads that all boot off a central NFS (Network File System) root, with the capability of adding more webheads as needed,” . They also "added two 16GB eight-core Xeons and a 6×73GB RAID-10 array for database files running a MySQL master/slave setup."

Add Cache. Daily Kos added 1GB instances memcached to each webhead.

Change Caching Strategy. Daily Kos puts fully rendered pages into memcached.

Change Serving Strategy. Daily Kos directly serves cached pages from memcached directly to anonymous users from lighttpd running as the front end proxy. The moves a lot of work off the backend and distributes work on the new hefty webheads. Site performance has improved greatly.

Add Capacity. Limelight expanded its network capacity to over 2 Terabytes per second. Tonight is a big night for a lot of sites. It's interesting to see how some are responding to the challenge. A lot of what they are doing will work for you too.

Click to read more ...

Todd Hoff |

5 Comments |

Permalink |

Print Article

Email Article

Strategy

Sunday

Nov022008

Strategy: How to Manage Sessions Using Memcached

Sunday, November 2, 2008 at 1:40AM

Dormando shows an enlightened middle way for storing sessions in cache and the database. Sessions are a perfect cache candidate because they are transient, smallish, and since they are usually accessed on every page access removing all that load from the database is a good thing. But as Dormando points out session caches have problems. If you remove expiration times from the cache and you run out of memory then no more logins. If a cache server fails or needs to be upgrade then you just logged out a bunch of potentially angry users. The middle ground Dormando proposes is using both the cache and the database:

Reads: read from the cache first, then the database. Typical cache logic.

Writes: write to memcached every time, write to the database every N seconds (assuming the data has changed). There's a small chance of data loss, but you've still greatly reduced the database load while providing reliability. Nice solution.

Click to read more ...

Todd Hoff |

6 Comments |

Permalink |

Print Article

Email Article

Memcached,

Strategy,

sessions

Wednesday

Oct292008

CTL - Distributed Control Dispatching Framework

Wednesday, October 29, 2008 at 1:19AM

CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. From their website: CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. What does CTL do? CTL helps you leverage your current scripts and tools to easily automate any kind of distributed systems management or application provisioning task. Its good for simplifiying large-scale scripting efforts or as another tool in your toolbox that helps you speed through your daily mix of ad-hoc administration tasks. What are CTL's features? CTL has many features, but the general highlights are: * Execute sophisticated procedures in distributed environments - Aren't you tired of writing and then endlessly modifying scripts that loop over nodes and invoke remote actions? CTL dispatches actions to remote controllers with network transparency (over SSH), parallelism, and error handling already built in. * Comes with pre-built utilities - CTL comes with pre-built utilities so you don't have to script actions like file distribution or process and port checking. * Define your own automation using the tools/languages you already know - New controller modules are defined in XML and your scripting can be done in multiple scripting languages (Perl, Python, etc.), *nix shell, Windows batch, and/or Ant. * Cross platform administration - CTL is Java-based, works on *nix and Windows.

Blog.Control.Tier

Introduction to CTL - a very nice overview of features.

Interesting Puppet Thread Mentioning CTL

Click to read more ...

Todd Hoff |

2 Comments |

Permalink |

Print Article

Email Article

Product,

operations

Monday

Oct272008

Notify.me Architecture - Synchronicity Kills

Monday, October 27, 2008 at 12:05AM

What's cool about starting a new project is you finally have a chance to do it right. You of course eventually mess everything up in your own way, but for that one moment the world has a perfect order, a rightness that feels satisfying and good. Arne Claassen, the CTO of notify.me, a brand new real time notification delivery service, is in this honeymoon period now. Arne has been gracious enough to share with us his philosophy of how to build a notification service. I think you'll find it fascinating because Arne goes into a lot of useful detail about how his system works. His main design philosophy is to minimize the bottlenecks that form around synchronous access, that is when some resource is requested and the requestor ties up more resources, waiting for a response. If the requested resource can’t be delivered in a timely manner, more and more requests pile up until the server can’t accept any new ones. Nobody gets what they want and you have an outage. Breaking synchronous operations into asynchronous operations by separating request and response into separate message passing actions, stops the resource overload. Instead of a system going down from too many parallel requests, it can works its way through a backlog of requests as fast as it can. And in most cases the request/response cycles are so fast that they appear like a linear sequence of events. Notify.me is taking the innovative and risky strategy of using ejabberd, an XMPP based system, as their internal messaging and routing layer. Will Erlang and Mnesia (Erlang's database) be able to keep up with traffic and keep low latencies as traffic scales? It will be interesting to find out. If you are interested in notify.me they've kindly offered 500 beta accounts for HS readers: http://notify.me/user/account/create/highscale

Who are you?

My name is Arne Claassen, the CTO of notify.me. I've been working on highly scalable web based applications and services for the past decade. These sites have employed various combinations of traditional scaling techniques such as server farms, caching, content pregeneration and highly available databases using replication and clustering. All of these techniques are ways to mitigate scarce resources (generally the database) being in contention by many users. Knowing the benefits and pitfalls of these techniques, it has become my focus to architect systems that circumvent scarce resource scenarios.

What is notify.me, why did you make it, and why is it a good thing?

notify.me is the brainchild of Jason Wieland, our CEO. It's a near real time notification service that alerts users to new content published on the web. It was created to address the common user pain of staying up to date of time critical events that occur on the web. For instance, a user searching for an apartment on Craigslist would want to be alerted once a new one matching their search criteria is posted. notify.me does the grunt work of repeatedly checking Craigslist for new listings and alerts the user once a new one is posted. Notifications can be delivered to instant messenger, desktop application, mobile device, email, and web application. Our goal is to create and publish open APIs allowing people to build new and interesting applications for generating and delivering information.

How does your service compare to other services people might be familiar with? Like Twitter, Friend Feed, Gnip, or Yahoo Pipes?

There are quite a few companies that are in our competitive landscape, some of which are direct competitors, like yotify.com or alerts.com. The main difference is our approach. Yotify and alerts are focused on being notification portal sites for users to visit. notify.me is a utility, with a focus of offering all the functionality available on the website via XMPP and REST APIs, allowing users to interact with notifications from the application of their choosing. We also allow for escalation of messages to destinations. For example if a user is not logged into their IM or have a status of away, notifications can be escalated and routed to their mobile device. In the messaging arena, we are nearly the opposite of twitter. Twitter is inward facing publishing model based on its own user generated content. Someone makes a tweet and it gets published to their followers. notify.me is creating an externally facing message delivery system. Users add any website that supports web feed standards, or redirect existing notification emails to us. If anything, we are a messaging pipeline that is complementary to twitter (more on that below). Friendfeed does a great job at combining all your social networks together into one centralized area. They're primary focus is to build features and tools to interact with the mashed feed. This feed would be perfect to add as a source to notify.me, allowing a user to receive all social network updates over instant messenger. Being able to know in near real time when you have a new posting on your wall so that you can immediately respond is a feature the social addicts want. Yahoo Pipes would be considered as a possible partner, similar to how they upsell Netvibes and Newsgator. Their focus is to provide an intuitive programming interface to be able to manipulate feeds and create a useful mashup. For example the Hot Deals Search is a nifty pipe that searches over a collection of sites for the best deal. Users might not want to use Yahoo's own notification options due to the limited options of destinations. In our beta group we've seen similar activity with users adding ebay links. Ebay has a competitive notification pipeline to notify.me however, users still add ebay search links to our service. It turns out they they would like one central place to manage their various news feeds. Gnip is a pure infrastructure play. We have similar technology but we are going after completely different markets. An additional core feature of our product that we have not yet exposed is that our pipelines is bi-directional, i.e. any data source can also be a destination and vice-verse. The primary benefit of this is the ability of allowing messages to be responded to, such as acknowledging a support ticket you received. Bi-directional communication will require integration with the notify.me API through which a source can communicate reply options. We currently are developing a deep integration with the Twitter API to provide two-way capabilities for tweets via IM in the same channel that you already receive your other notifications.

Can you explain the different parts of notify.me and how they connect together?

In general terms our system consists of three subsystems, each of which has a number of implementations. 1. Ingestion consists of rss and email ingestors, which constantly check the user's email address (username@notify.me) and the user's feeds for new data. New data is turned into notifications which are propagated to routing. 2. Routing is responsible for getting the user's notifications to the right delivery components. Routing is the point in the system that the user interacts with for management, such as changing sources and destinations, and viewing history. Notification history is a specialized delivery component, allowing all messages to be perused via the website, even after they have gone through the entire pipeline. 3. Delivery is currently comprised of history (which always gets the messages), Xmpp IM, SMS and email, with private RSS, AIM and MSN in development. On a more technical level, the topology of this system is comprised of two separate message busses: 1. Store-and-forward queues (using simpleMQ) 2. XMPP (using ejabberd) Store and forward queues are used by the ingestion side to distribute the work of ingestion and generally anywhere where data is handled before it becomes visible to the routing rules of the user. This allows for scaling flexibility as well as process isolation during a component failure. The Xmpp bus is called the Avatar bus, named thusly because every data owning entity is represented by a daemon process that is the sole authority for that entities data. We have four types of avatars, Monitor, Agent, Source and User 1. Monitor avatars are simply the responsible parties for observing instance health and spinning up and shutting down additional computing nodes per demand. 2. Agent avatars are the delivery gateways that provide presence information of our users into the bus and deliver messages to our users. 3. Source avatars are ingestors, such as an RSS. This avatar pulls new messages from a store and forward queue and notifies its subscribers of the new message. 4. User avatars persists all the configuration and messaging data for a particular user. It is responsible for receiving new notifications from ingestion avatars, deciding on the routing and pushing messages to the appropriate delivery agent as that agent declares the ability to execute that delivery.

What particular challenges did you face and how did you overcome them? What options did you consider and why did you decide to do it another way?

From the onset, our primary goal was to avoid bottlenecks and hindrances to scaling horizontally. Initially we planned on building the entire system as a stateless flow of messages through queues, with each daemon along the line being responsible for the data flowing through it, merging, multiplexing and routing it to the next point until delivery was achieved. This meant no single part's failure would ever affect the whole, other than queues getting backed up. However early on we realized that once a message was designated for a user we needed the ability to track where it was and be able to re-route it dynamically depending on user configuration and presence. This lead us to add a bit of inappropriate coupling between daemons via REST apis. Some of this plumbing still exists, as we're still migrating processing over to the combination queue/async bus architecture. As we realized that pure message passing without state was not going to satisfy our dynamic needs, the easy solution was to return to the tried state keeper, the central relational database. Knowing our scaling goals, this would have introduced a point of failure that sooner or later we would not be able to mitigate. We decided to look at our state in a different way and instead of thinking of creating processing units based on function (i.e. ingestion, parsing, transformation, routing, delivery) that queried state based on the data flowing through it, we thought of the units in terms of data ownership, i.e. sources and destinations (users). Once on that track, there was precious little shared state and we were able to change our storage pattern to have each owner of data be responsible for its own data, allowing horizontal scaling of the persistence layer, as well as much more efficient caching. The remaining need for accessing data across owner is analytics. In many systems analytics is a primary reason for the existence of a central database, since too often facts and dimensions are kept intermingled in the production schema. For our purposes, this data is not a production concern and therefore should never affect live capacity. Usage and state changes are treated as immutable events, which are queued at the point of occurence into our store-and-forward system. The nature of our store-and-forward queue allows us to automatically gather all these events from all hosts to a central archive which can then be processed into fact and dimension data by ETL processes. This allows for near real-time tracking of usage without affecting user facing systems.

Could you explain your choice of XMPP a little more? Is it used mainly as a message bus between federated XMPP servers sitting on EC2 nodes? Is the XMPP queue used as the queue for each user's messages from all sources before they are pushed to users?

We have three different xmpp clusters which take advantage of federation for cross-chatter: user, agent and the avatar bus.

Users

This is a regular xmpp IM server on which we create accounts for each of our users, providing them an IM account that they can use from any Xmpp capable client. This account also serves as the user our desktop app signs in as and that will be the authentication for our API for third party message ingestion and distribution

Agents

The daemons connecting to this cluster serve as communication bridges between our internal Avatar bus and outside clients. Currently this is primarily for communicating with chat clients, as every user is assigned an agent that they communicate without us through, regardless of whether they use their default account or some third party account such as jabber.org, googletalk, etc. We also are testing a client API that uses Xmpp RPC via these agents for dedicated Apps. In the future we will also offer full XMPP and REST APIs for third party integration that will use the agents to communicate with the Avatar bus. I mentioned earlier that Agents are avatars as well, however they are a little special in that they do not have a user on the Avatar bus but talk to other avatars through cross server federation. We currently are also building agents for the Oscar and MSN networks that will sit directly on the avatar bus since their native transport is not federated. We also plan to evaluate other networks for possible future support.

Avatars

Avatars is our internal message bus that we use to route and process all commands and messages. We primarily use direct messaging and IQ based RPC stanzas between avatars, although we do take advantage of presence for monitoring. So what is an avatar? It's a daemon (where a single physical daemon process can host many avatar daemons) that is the authority for some external entity's data. I.e. every user registered with notify.me has an avatar that monitors agents for status changes, receives messages in care of that user and is responsible for routing those messages to the appropriate delivery channel. This means that every avatar is the single authority for all data about that user and is responsible to persisting the data. If some other part of the system wants to know something about that user or modify its data, it uses Xmpp RPC to talk to the avatar, rather than some central database. Right now, avatars persist to disk and SimpleDB, while keeping a ttl-regulated cache in process. Since only the avatar can write its own data, it does not need to check the DB but can treat its memory and disk cache as authoritative and SDB is used primarily for writes. Reads are needed only in the case of a node failure to bring up the avatar on another node. At the other end of the bus we have our ingestors. Ingestors are made up of a number of daemons, generally running on polling loops against external sources, queueing new data into our store-and-forward queues, where the appropriate ingestor avatar picks up new messages and distributes them to its subscribers. In the ingestor avatar scenario, it is the authority on subscription and routing data. Here's a typical use case: A user subscribes to an RSS feed via the web interface. The web interface sends the request to the user's avatar, which persists the subscription for reference and then requests the subscription from the rss ingestor. As new rss items arrive, the rss ingestor multiplexes items to all user avatars that subscribe to that feed. The user avatars in turn determine the appropriate delivery mechanism and schedule delivery. In general that means that the user avatar is subscribed to the user's Xmpp presence via the users' agent. Until the user is in the proper state for accepting messages, the user avatar queues the rss items. Once the user is ready to receive the notification, the presence change is propagated from the agent into the internal bus and the user avatar then sends the rss items to agent, which in turn delivers it to the user. Right now, all avatars are always online (even if mostly idle), which is fine for our present user base size. Our plan is to mod the offline storage module of ejabberd so that we can blind fire stanzas and have queued messages signal a monitor to spin up the appropriate avatar for the destination XmppId. Once this system is in place we will be able to spin up on avatars on demand and shut down them down on idle.

At what traffic load do you expect your current architecture to break and what's your plan?

Since our system is distributed and asynchronous by design, we should avoid systemwide failures under load. However, while avoiding all the usual bottlenecks, the reality is that our message bus, which makes all this possible, will likely become our limiting factor, either because it cannot handle the number of avatars (nodes on the bus) or because latency on the bus becomes unacceptable. We're only starting to use the avatar system as our backbone, so it's still a bit fragile and we're still doing load testing on ejabberd to determine at what point we run into limiting factors. While we are already clustering ejabberd, the load of mnesia database replication and cross node chatter means that either number of connections or latency will eventually cause the cluster to fail or simply consume too much memory to be managable. Since our messaging is primarily point-to-point, we anticipate that we can split our user base into avatar silos, each hosted on a dedicated avatar subdomain cluster, reducing message and connection load. As long as our silos are appropriately designed to keep crosssubdomain chatter to a minimum, we should be able to have n silos to keep on top of load. Our single greatest challenge to avoid this architecture failing is eternal vigilance against introducing features that create messaging bottlenecks. A significant amount of our message traffic passing through a single processor or family of processors, would introduce dependencies we cannot scale ourselves out of with subdomain division.

notify.me tech blog - INotification

Flickr - Do the Essential Work Up-front and Queue the Rest

Click to read more ...

Todd Hoff |

9 Comments |

Permalink |

xmpp

Sunday

Oct262008

Should you use a SAN to scale your architecture?

Sunday, October 26, 2008 at 1:52AM

This is a question everyone must struggle with when building out their datacenter. Storage choices are always the ones I have the least confidence in. David Marks in his blog You Can Change It Later! asks the question Should I get a SAN to scale my site architecture? and answers no. A better solution is to use commodity hardware, directly attach storage on servers, and partition across servers to scale and for greater availability. David's reasoning is interesting:

A SAN creates a SPOF (single point of failure) that is dependent on a vendor to fly and fix when there's a problem. This can lead to long down times during this outage you have no access to your data at all.

Using easily available commodity hardware minimizes risks to your company, it's not just about saving money. Zooming over to Fry's to buy emergency equipment provides the kind of agility startups need in order to respond quickly to ever changing situations. It's hard to beat the power and flexibility (backups, easy to add storage, mirroring, etc) of a good SAN, but Mark makes a good case.

Click to read more ...

Todd Hoff |

8 Comments |

Permalink |

Print Article

Email Article

Strategy,

storage

Saturday

Oct252008

Product: Puppet the Automated Administration System

Saturday, October 25, 2008 at 3:57AM

Update: Digg on their choice and use of Puppet. They chose puppet over cfengine, and bcfg2 because they liked Puppet's resource abstraction layer (RAL), the ability to implement configuration management incrementally, support for bundles, and the overall design philosophy. Puppet implements a declarative (what not how) configuration language for automating common administration tasks. It's the system every large site writes for themselves and it's already made for you! Ilike was able to "easily" scale from 0 to hundreds of servers using Puppet. I can't believe I've never seen this before. It looks really cool. What is Puppet and how can it help you scale your website operations? From the Puppet website: Puppet has been developed to help the sysadmin community move to building and sharing mature tools that avoid the duplication of everyone solving the same problem. It does so in two ways: * It provides a powerful framework to simplify the majority of the technical tasks that sysadmins need to perform * The sysadmin work is written as code in Puppet's custom language which is shareable just like any other code. This means that your work as a sysadmin can get done much faster, because you can have Puppet handle most or all of the details, and you can download code from other sysadmins to help you get done even faster. The majority of Puppet implementations use at least one or two modules developed by someone else, and there are already tens of recipes available in Puppet's CookBook. This sound good. But does it work in the field? HJK Solutions' Adam Jacob says it does: Puppet enables us to get a huge jump-start on building automated, scaleable, easy to manage infrastructures for our clients. Using puppet, we: 1. Automate as much of the routine systems administration tasks as possible. 2. Get 10 minute unattended build times from bare metal, most of which is data transfer. Puppet takes it the rest of the way, getting the machines ready to have applications deployed on them. It’s down to two and a half minutes for Xen. 3. Bootstrap our clients production environments while building their development environment. I can’t stress how cool this really is. Because we are expressing the infrastructure at a higher level, when it comes time to deploy your production systems, it’s really a non-event. We just roll out the Puppet Master and an Operating System auto-install environment, and it’s finished. 4. Cross-pollinate between clients with similar architectures. We work with several different shops using Ruby on Rails, all of whom have very similar infrastructure needs. By using Puppet in all of them, when we solve a problem for one client, we’ve effectively solved it for the others. I love being able to tell a client that we solved a problem for them, and all it’s going to cost is the time it takes for us to add the recipe. Puppet, today, is a tool that is good enough to handle the vast majority of issues encountered in building scalable infrastructures. Even the places where it falls short are almost always just a matter of it being less elegant than it could be, and the entire community is working on making those parts better.

Operations is a competitive advantage... (Secret Sauce for Startups!) by Jesse Robbins

Infrastructure 2.0 by John Willis

Puppet, iLike and Infrastructure 2.0 by John Willis

Why are people paying 3 to 5 million for configuration management software? by Adam Jacob

Click to read more ...

Todd Hoff |

13 Comments |

Permalink |

Print Article

Email Article

Product,

operations

Friday

Oct242008

11 Secrets of a Cloud Scale Consultant That They Dont' Want You to Know

Friday, October 24, 2008 at 12:11AM

OK, there is no "they" and "they" wouldn't care if you knew anyway. After all, this isn't a blog about really important stuff like investing, acne cures, or cheap natural cleansing products. But the secrets are real. Super cloud scaling consultant Kent Langley has put together a comprehensive checklist to consider when developing for the cloud:

ORM for Data Partitioning and Query Splitting - Split queries between updates and deletes from the start

Monitoring process, resources, and uptime - Process Monitoring, Resource Monitoring, UpTime Monitoring

Performance Testing and Capacity Planning - Can't make good decisions without doing some degree of Performance Testing and Capacity planning.

Static vs. Dynamic Content splitting / CDN - Reverse Proxy, Splitting Static and Dynamic content

Bundling and Compressing JS and CSS - Bundle them, compress, version, and then properly cache those bundles

Logging - Log appropriately and monitor those logs

Pragmatic Caching - Most current web applications will have between 3-5 layers of caching

Functional Decomposition - Decompose your entire application into functional silos

Deployment - It should be efficient, it should have a roll back capability, and it should be almost entirely automated to development

Asynchronous Practices - Most cases work can be queued and done by a separate process

Make sure your application processes are as lean as possible - More efficient code means less servers Please follow the link to Kent's post for a full explanation. To some this may seem obvious, but that doesn't mean it gets done. Good helpful stuff.

Joyent - Cloud Computing Built on Accelerators by Kent Langley

Click to read more ...

Todd Hoff |

7 Comments |

Permalink |

Print Article

Email Article

Strategy

Friday

Oct172008

A High Performance Memory Database for Web Application Caches

Friday, October 17, 2008 at 1:22AM

Abstract—This paper presents the architecture and characteristics of a memory database intended to be used as a cache engine for web applications. Primary goals of this database are speed and efficiency while running on SMP systems with several CPU cores (four and more). A secondary goal is the support for simple metadata structures associated with cached data that can aid in efficient use of the cache. Due to these goals, some data structures and algorithms normally associated with this field of computing needed to be adapted to the new environment.

Click to read more ...

Todd Hoff |

2 Comments |

Permalink |

Print Article

Email Article

Caching,

Paper

Friday

Oct172008

Scaling Spam Eradication Using Purposeful Games: Die Spammer Die!

Friday, October 17, 2008 at 1:01AM

Update: As expected I'm undergoing a massive spam attack for speaking truth to dark powers. This is the time to be strong. Together we can make a change. What change you may ask? I can't say, just change and lots more change. Let's link arms together and bravely stand against the forces of chaos for a better yesterday and a better tomorrow. CAPTCHA doesn't work. Even Google can't make CAPTCHA work (Spammers Choose GMail). And even if CAPTCHA worked it wouldn't really work because CAPTCHA solving markets (Inside India’s CAPTCHA solving economy) have evolved where for a mere $2 you can buy 1000 human broken CAPTCHA's. And we know once the free market tackles a problem that's it. Game over :-) Making ever more clever CAPTCHA programs won't outwit and outlast the CAPTCHA solving markets. Until Skynet evolves the only way to defeat humans is with humans.

Using Games to Get Humans to Do Work (like CAPTCHA) for Free

How do we harness the power of humans to do battle with the CAPTCHA solving networks, without, of course, paying them anything? We make it a game! In particular we make a Game With a Purpose (GWAP). Read all about GWAPs in Designing games with a purpose. A GWAP is a game in which people, as a side effect of playing, perform tasks computers are unable to perform.

Google's Image Labeler

A good example GWAP is Google's Image Labeler, a game in which people provide meaningful, accurate labels for images on the Web as a side effect of playing the game; for example, an image of a man and a dog is labeled "dog," "man," and "pet.". Now this sounds like work. And it is. But because it's made into a game people will do it for free! An example Labeler session looks like:

In the game two people are matched at random to label the same set of images. Points are awarded when you and your partner match labels. Top scores are kept so you can earn your label street cred. But can't people cheat? GWAP games include cheating detection mechanisms, but we won't go into detail here, see Designing games with a purpose for cheater foiling strategies.

ESP Game, Tag a Tune, and Squigl

More games can be found at the GWAP Home Page. They have the ESP Game which is like Labeler. Tag a Tune is a game where players hear tunes, describe them, and through the description guess if they are listening to the same tune. In Squigl partners see an image and a word. Using the mouse each player traces the object described by the word in the image. Winning is when both players trace the same image. Here's what a Squigl session looks like:

So you see the pattern. Players are picked from a pool. They are asked to do some task that's hard for computers to do. The task must be structured so that winning enables the system to learn something valid while providing a feeling of game play for the humans. Points are awarded and scores are kept to keep the poor human slaves playing.

Creating a Spam Catcher Game

With the basic ideas in place let's create a game for identifying and filtering out comment spam. According to Designing games with a purpose this appears to a be an output-agreement type game, which has the following structure:

Initial setup. Two strangers are randomly chosen by the game itself from among all potential players;

Rules. In each round, both are given the same input and must produce outputs based on the input. Game instructions indicate that players should try to produce the same output as their partners. Players cannot see one another's outputs or communicate with one another; and

Winning condition. Both players must produce the same output; they do not have to produce it at the same time but must produce it at some point while the input is displayed onscreen. Simple enough. But comments exist as a part of blogs, websites, microblogging engines, and other programs. Any game has to interface with live systems. Integrating the game with a comment system might work something like:

User comments are sent from an originating system to a decentralized game comment queue.

Comments are pulled from the queue as new games start. Posts are stripped of identifying information and presented to the players.

Points are allocated if both players agree that a comment is spam or not spam within a very short period of time. With comments latency is the name of the game so they need to be processed as fast as possible.

Comments and the spam judgments are sent back to the originating system for handling. It's not too hard too imagine such a system being used for content other than comments and for making judgments like age appropriateness and other subtle criteria that could be communicated using site meta data. One UI idea it to make the game like a first-person-shooter. Spam is blasted into a 1000 pieces. Oh that would be rewarding, but you can also imagine all the usual game type mechanisms to keep people interested. An accuracy feedback loop would be useful to rate players so less accurate players could be dropped from the game. Players would be recruited from the general population. Another good source of players is the site owners and the site participants who's sites are the source of comments. This would be sort of Internet Comment Tax for keeping the Internet safe and sane. I, for example, would sign up to process 500 comments a week in order to have HighScalability.com comments processed by the game. Everyone else taking advantage of the system could pledge a number that made sense for their site. This would provide a ready pool of motivated players and docents to keep the game running efficiently. A nice widget system would make it possible to play the game from any site.

The Final Move

Spam crushes many sites. Many site owners don't even allow comments anymore because of the time it takes to deal with spam, which is a shame, because without interactivity the internet might as well be a newspaper. We can't let those spammers win! A system like the Spam Catcher Game might be able provide the human oversight, quick latency, and high throughput needed to out compete the CAPTCHA solving networks. The game is finally afoot!

GWAP Home

Designing games with a purpose

Inside India’s CAPTCHA solving economy

Spammers Choose GMail

Google's Image Labeler

Google Crashing

Click to read more ...

Todd Hoff |

7 Comments |

Permalink |

Print Article

Email Article

games,

spam

Friday

Oct102008

Useful Corporate Blogs that Talk About Scalability

Friday, October 10, 2008 at 3:26AM

Some intrepid company blogs are posting their technical challenges and how they solve them. I wish more would open up and talk about what they are doing as it helps everyone move forward. Here are a few blogs documenting their encounters with the bleeding edge:

Flickr

Digg

Facebook

Amazon Web Services blog

Joyent's Blog Any others that should be added?

Click to read more ...

Todd Hoff |