High Scalability -

Friday

Aug282009

Strategy: Solve Only 80 Percent of the Problem

Friday, August 28, 2009 at 2:04AM

Solve only 80% of a problem. That's usually good enough and you'll not only get done faster, you'll actually have a chance of getting done at all.

This strategy is given by Amix in HOW TWITTER (AND FACEBOOK) SOLVE PROBLEMS PARTIALLY. The idea is solving 100% of a complex problem can be so hard and so expensive that you'll end up wasting all your bullets on a problem that could have been satisfactoraly solved in a much simpler way.

The example given is for Twitter's real-time search. Real-time search almost by definition is focussed on recent events. So in the design should you be able to search historically back from the beginning of time or should you just be able to search for recent time periods? A complete historical search is the 100% solution. The recent data only search is the 80% solution. Which should you choose?

The 100% solution is dramatically more difficult to solve. It requires searching disk in real-time which is a killer. So it makes more sense to work on the 80% problem because it will satisfy most of your users and is much more doable.

By reducing the amount of data you need to search it's possible to make some simplifying design choices, like using fixed sized buffers that reside completely in memory. With that architecture your streaming searches can be blisteringly fast while returning the most relevant data. Users are happy and you are happy.

It's not a 100% solution, but it's a good enough solution that works. Sometimes as programmers we are blinded by the glory of the challenge of solving the 100% solution when there's a more reasonable, rational alternative that's almost as good. Something to keep in mind when you are wondering how you'll possibly get it all done. Don't even try.

Amix has a very good discussion of Twitter and this strategy on his blog.

Worse is Better

A Hacker News post discussing this article brought up that this strategy is the same as Richard Gabriel's famous Worse-is-Better paradox which holds: The right thing is frequently a monolithic piece of software, but for no reason other than that the right thing is often designed monolithically. That is, this characteristic is a happenstance. The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing.

Unix, C, C++, Twitter and almost every product that has experienced wide adoption has followed this philosophy.

Worse-is-Better solutions have the following characteristics:

Simplicity - The design must be simple, both in implementation and interface. It is more important for the implementation to be simpler than the interface. Simplicity is the most important consideration in a design.

Correctness - The design must be correct in all observable aspects. It is slightly better to be simple than correct.

Consistency - The design must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either implementational complexity or inconsistency.

Completeness - The design must cover as many important situations as is practical. All reasonably expected cases should be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

In my gut I think Worse-is-Better is different than "Solve Only 80 Percent of the Problem" primarily because Worse-is-Better is more about product adoption curves and 80% is more a design heuristic. After some cogitating this seems a false distinction so I have to concluded I'm wrong and have added Worse-is-Better to this post.

Worse Is Better Richard P. Gabriel

Lisp: Good News, Bad News, How to Win Big

Interesting Hacker News Thread

In Praise of Evolvable Systems by Clay Shirky

Big Ball of Mud by Brian Foote and Joseph Yoder

Todd Hoff |

8 Comments |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Aug262009

Hot Links for 2009-8-26

Wednesday, August 26, 2009 at 1:35AM

I'm Going To Scale My Foot Up Your Ass - Shut up about scalability, no one is using your app anyway.

Multi-Tenant Data Architecture - Microsoft's take on different approaches to multitenancy.

Cloud computing rides on spiraling Energy costs - A report by US researchers has shown the increasing cost of power and cooling in the data centre is a driver towards cloud computing.

Interview: Apple’s Gigantic New Data Center Hints at Cloud Computing - Companies building centers this big are getting into cloud computing. Running apps in the cloud requires massive infrastructure: Google-size infrastructure.

What Does Cloud Computing Actually Cost? An Analysis of the Top Vendors - Amazon is currently the lowest cost cloud computing option overall. At least for production applications that need more than 6.5 hours of CPU/day, otherwise GAE is technically cheaper because it's free until this usage level.

no:sql(east) - October 28–30, 2009, Atlanta, GA. Very cute page playing off of SQL syntax.

New Products and Updates

Gear6 Web Cache Virtual Appliance - a feature complete virtual machine (VM) of the Gear6 Web Cache software. It includes all the functionality of the Gear6 Web Cache including simulating Gear6 high density RAM-flash architecture.

Seamlessly Extending the Data Center - Introducing Amazon Virtual Private Cloud (VPC) - We have developed Amazon VPC to allow our customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work.

NetApp reveals cloud computing plan, new Data OnTap OS - Our research shows users are very interested in scale-out technology," she said. "What's nice about it is as you add processor and storage resources, you get much higher storage utilization rates and the new scale-out system grows up to 14 petabytes, but it can still be managed in a single array.

The Big Cheese: Powerful Version Of Google Search Appliance Can Grow Exponentially.

Updates to Articles on High Scalability

Streamy Explains CAP and HBase's Approach to CAP - We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent. Updated: How Google Serves Data from Multiple Datacenters.

The fantasy sponsor for this post are those little food kiosks outside Home Depot stores. I love their Fire Dogs. Hot and yummy. I bet most home improvement projects in America are inspired by cravings for one of these little beauties.

Todd Hoff |

1 Comment |

Permalink |

Print Article

Email Article

hot links

Monday

Aug242009

How Google Serves Data from Multiple Datacenters

Monday, August 24, 2009 at 1:54PM

Update: Streamy Explains CAP and HBase's Approach to CAP. We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent.

Ryan Barrett, Google App Engine datastore lead, gave this talk Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.

While the talk doesn't necessarily break new technical ground, Ryan does an excellent job explaining and evaluating the different options you have when architecting a system to work across multiple datacenters. This is called multihoming, operating from multiple datacenters simultaneously.

As multihoming is one of the most challenging tasks in all computing, Ryan's clear and thoughtful style comfortably leads you through the various options. On the trip you learn:

The different multi-homing options are: Backups, Master-Slave, Multi-Master, 2PC, and Paxos. You'll also learn how they each fair on support for consistency, transactions, latency, throughput, data loss, and failover.

Google App Engine uses master/slave replication between datacenters. They chose this approach in order to provide:
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.

No solution is all win, so a compromise must be made depending on what you think is important. A major Google App Engine goal was to provide a strong consistency model for programmers. They also wanted to be able to survive datacenter failures. And they wanted write performance that wasn't too far behind a typical relational database. These priorities guided their architectural choices.

In the future they hope to offer optional models so you can select Paxos, 2PC, etc for your particular problem requirements (Yahoo's PNUTS does something like this).

There's still a lot more to learn. Here's my gloss on the talk:

Consistency - What happens happens after you read after a write?

Read/write data is one of the hardest kinds of data to run across datacenters. Users a expect a certain level of reliability and consistency.

Weak - it might be there, might not. Best effort. Like memcached. It's OK to drop for some applications like Voip, live video, and multiplayer games. You care more about where things are now, not where they where. For data this is not good.

Eventual - You eventually see the stuff you wrote, just not right away. Email is a good example. You send it but it doesn't arrive right away, but it gets there, eventually. DNS change propagation, SMTP, Amazon S3, SimpleDB, search engine indexing are all of this type. There's a delay after a write when a read won't see what was written, but the writes eventually push through. Still not ideal for data.

Strong - The ideal solution for a structured data system. You get what you put it in. Simplest to program against and think about. Any read after a write will return what was written. AppEngine, file systems, Microsoft Azure, and RDBMSes work this way.

Once we move data across datacenters what consistency guarantees do we have? We can give up some guarantees, but we should know what we are getting.

Transactions - Extended form of consistency across multiple operations.

Transaction Properties: Correctness, consistency, enforce variants, ACID.

Example: bank transaction. Transfer money from A to B. Subtract money from A and add to B. These happen at different times. What happens if another transfer happens for A in-between? What happens if there's a failure? What happens of program reads from A or B? You want guarantees. On a crash will money added to B still be added to B? Will money taken from A still be taken from A? You don't want to lose or create money.

When you start operating across datacenters it's even harder to enforce transactions because more things can go wrong and operations have high latency.

Why Operate in Multiple Datacenters?

Sh*t happens - datacenters fail for any number of reasons.

Performance - geolocality allows operations to be moved closer to the user. The speed of light limits limits how fast data can be transferred and becomes significant when operating across the world. Going through multiple router hops also slows traffic. So closer is better and you can only be closer if your data is near the user which requires operating in multiple datacenters. CDNs do this for you, especially for more static data. They put data everywhere.

Why Not Operate in Multiple Datacenters?

Operating in a single datacenter is easy: Low cost bandwidth. Low latency. High bandwidth. Easy operations. Easier code.

Operating in multiple datacenters is hard: high cost, high latency, low latency, difficult operations, harder code.

It's especially hard if you have a read/write structured data system where you accept writes from more than one location. You have consistency problems. Maintaining consistency in the face of the distances and failures is non-trivial.

Your Different Architecture Options

Single Datacenter. Don't bother operating in mutiple datacenters. This is the easiest option and is what most people do. But datacenters fail, you could lose data, and your site could go down.

Bunkerize. Create a Maginot Line for the Ultimate Datacenter. Make sure your datacenter doesn't ever go down. SimpleDB and Azure use this strategy.

Single Master. Pick a master datacenter that writes go to and other sites replicate to. The replicates sites off read-only services.
- Better, but not great.
- Data are usually replicated asynchronously so there's a window of vulnerability for loss.
- Data in your other datacenters may not be consistent on failure.
- Popular with financial institutions.
- You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.

Multi-Master. True multihoming. The Holy Grail. All datacenters are serving reads and writes. All data is consistent. Transactions just work. This is really hard.
- So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
- Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.

How Do You Actually Do This?

What are the techniques and tradeoffs of different approaches? Here's the evaluation matrix:

	Backups	M/S	MM	2PC	Paxos
Consistency	Weak	Eventual	Eventual	Strong	Strong
Transactions	No	Full	Local	Full	Full
Latency	Low	Low	Low	High	High
Throughput	High	High	High	Low	Medium
Data loss	Lots	Some	Some	None	None
Failover	Down	Read-only	Read/Write	Read/Write	Read/Write

- M/S = master/slave, MM - multi-master, 2PC - 2 Phase Commit
- What kind of consistency, transactions, latency throughput do we get for a particular approach? Will we lose data on failure? How much will we lose? When we failover for maintenance or we want to move things, say decommissioning a datacenter, how well do we do that, how well do the techniques support it?

Backups - Make a copy of your data that's secret and safe. Generally weak consistency. Usually no transactions. Used for the first internal datastore launch. Not good enough for a production system. Lose data since last backup. You are down while restoring a backup to another datacenter.

Master/Slave Replication - Writes to a master are also written to one or more slaves.
- Replication is asynchronous so good for latency and throughput.
- Weak/eventual consistency unless you are very careful.
- You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
- Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.

Multi-Master Replication - support writes from multiple datacenters simultaneously.
- You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
- Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
- To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
- There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
- AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
- Failover is easy because each datacenter can handle writes.

Two Phase Commit (2PC) - protocol for setting up transactions between distributed systems.
- Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
- It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
- Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
- This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
- Need N+1 datacenters. If you take one down then you still have N to handle your load.

Paxos - A consensus protocol where a group of independent nodes reach a majority consensus on a decision.
- Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
- Unlike 2PC it is fully distributed. There's no single master coordinator.
- Multiple transactions can be run in parallel. There's less serialization.
- Writes are high latency because of the 2 extra round coordination trips required in the protocol.
- Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
- They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
- Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.

Miscellaneous

Entity Groups are the unit of consistency in AppEngine. Operations are serialized on Entity Groups. The log for each commit to an entity group is replicated. This maintains consistency and provides transactions. Entity Groups are essentially shards. Sharding enables scaling because it allows you to handle a lot of writes. Datastore shards in entity group size chunks. BuddyPoke has 40 million users, each of which has an entity group. That's 40 million different shards.

Eating your own dog food is a strategy used a lot at Google. Iterate and make people use new features internally. Using a ton of stuff that's very early. You can iterated many many times so that improves it before you are ready to launch.

They see relational databases in the datacenter as their competition as much as Azure and SimpleDB. Inserts into RDBMS are in low milliseconds. Writes into AppEngine are 30-40 msecs. Reads are fast. They like this trade-off because on the web reads vastly out number writes.

Discussion

A few things I wondered through the talk. Did they ever consider a distributed MVCC approach? That might be interesting and wasn't addressed as an option. Clearly at Google scale an in-memory data grid isn't yet appropriate.

A preference for the strong consistency model was repeatedly specified as a major design goal because this makes the job of the programmer easier. A counter to this is that the programming model for Google App Engine is already very difficult. The limits and the lack of traditional relational database programming features put a lot of responsibility back on the programmer to write a scalable app. I wonder if giving up strong consistency would have been such a big deal in comparison?

I really appreciated the evaluation matrix and the discussion of why Google App Engine made the choices they did. Since writes are already slow on Google App Engine they didn't have a lot of headroom to absorb more layers of coordination. These are the kinds of things developers talk about in design meetings, but they usually don't make it outside the cubicle or conference room walls. I can just hear Ryan, with voiced raised, saying "Why can't we have it all!" But it never seems we can have everything. Many thanks to Ryan for sharing.

Slides for the Talk

ZooKeeper - A Reliable, Scalable Distributed Coordination System

Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?

Paper: Consensus Protocols: Paxos by Henry Robinson

Paper: Consensus Protocols: Two-Phase Commit by Henry Robinson

Paper: Dynamo: Amazon’s Highly Available Key-value Store

Are Cloud Based Memory Architectures the Next Big Thing?

Todd Hoff |

7 Comments |

Permalink |

Print Article

Email Article

Example,

Strategy,

distributed systems

Thursday

Aug202009

VMware to bridge a DMZ.

Thursday, August 20, 2009 at 2:43PM

Hey guys,

There is a renewed push at my organization to deploy vmware...everywhere.

I am rather excited as I know we have a lot of waste when it comes to resources.

What has pricked my ears up however, is the notion of using this technology in our very busy public facing DMZ's.

Today we get lots of spikes of traffic and we are coping very well. 40x HP blades, apache/php/perl/tomcat/ all in HA behind HA F5's and HA Checkpoint FW's. (20 servers in 2 datacentres).

The idea is, we virtualise these machines, including the firewalls onto hosts vmware clusters that span the public interface to our internal networks. This is something that has gone against the #1 rule I have ever lived by while working on the inet. No airgaps from the unknown to the known!

I am interested in feedback on this scenario.

From a resource perspective, our resource requirements in the DMZ will be lowered over time due to business change and we still have a lot of head room in our capacity.

Do you think this is change for change sake? All I can see is more complexity, higher risk and more skill required to manage what today is a very simple and resilient setup with no security flaws.

VMware and some big name companies/gov agencies stand by the notion the software dividing the host machine is more than capable are keeping the DMZ's in check. It just doesn't sit well with me, knowing we may have a public facing website on the same host machine which is running a critical safety or customer management tool.

Apart from the ease of management to grow/shrink (something we don't need todo in any rush), what are the advantages to increase risk and complexity?

Are any of you in the same position?

Costs wise - our website costs are minuscule compared to the revenue we generate thru them - Would you risk what is a sound and stable environment because it sounds cool to 'virtualise' or is there something I am missing?

Kind regards,
Foodie

ps. I don't post much on here but I love reading your articles. The website I am referring to in my post hits a peak of $250/second and is responsible for 90% of revenue to the business.

foodie |

1 Comment |

Permalink |

Print Article

Email Article

General Discussion,

dmz,

vmware,

web,

websites

Thursday

Aug202009

Dependency Injection and AOP frameworks for .NET

Thursday, August 20, 2009 at 12:49AM

We're looking to implement a framework to do Dependency Injection and AOP for a new solution we're working on. It will likely get hit pretty hard, so we'd like to chose a framework that's proven to scale well, and operates well under pressure.

Right now, we're looking closely at Spring.NET, Castle Project's Windsor framework, and Unity. Does anyone have any feedback on implementing any of these in large, high traffic environments?

mebemikeyc |

1 Comment |

Permalink |

Print Article

Email Article

General Discussion

Tuesday

Aug182009

Hardware Architecture Example (geographical level mapping of servers)

Tuesday, August 18, 2009 at 9:11PM

I have put down my thoughts in the architecture discussed in the blog. Although I have done substantial research to understand how things should work before deciding this architecture but I will be requiring huge amount of inputs from everyone to come to an architecture decision. Hardware entities which were thought while designing the entities are:
1. Master Web Server which will map different users to web servers placed in different geographical locations. (will prefer storing a mapping table in RAM)
2. Web Servers
3. Application Servers
4. Master Database Servers (to implement entity wise look up sharding)
5. Slave Database Servers.

Will really appreciate if some good inputs of using Cloud Computing are given and how to go about it against or in addition to the given architecture. Would like to in fact know people's view on when to decide using cloud computing techniques. Looking forward for inputs from the community.

paragarora |

3 Comments |

Permalink |

sharding

Tuesday

Aug182009

Real World Web: Performance & Scalability

Tuesday, August 18, 2009 at 1:54AM

We've referenced this 189 slide masterpiece by Ask Bjorn Hansen before, but it was hidden without its own first class link. He describes his presentation as 3 hours of 5 minute lightening talks and that sounds about right.

The presentation covers: overall platform and architecture considerations involved in tuning applications from a holistic perspective. You’ll be shown design scalable architectures for dynamic, high-volume web sites. Topics covered include caching, scalable database design, replication architecture, load-balancing, and architectural decisions derived from many years of experience.

His prime directive of scaling: Think Horizontally at every point in your architecture, not just at the web tier.

You may not agree with everything, but there's a lot of useful advice. Here's a summary of some of what is covered:

Benchmarking

Vertical scaling sucks.

Horizontal scaling rocks.

Run many application servers

Don't keep state in the app server

Be stateless

Optimization is necessary, but is different than scalability.

Cache things you hit all the time.

Measure, don't assume, check.

Make pages static.

Caching is a trade-off.

Cache full pages.

Cache partial pages.

Cache complex data.

MySQL query cache is flushed on update.

Cache invalidation is hard.

Replication scales reads, not writes.

Partition to scale writes. 96% of applications can skip this step.

Master-master setup facilitates on-line schema changes.

Create summary tables and summary databases rather than do COUNT and GROUP-BY at runtime.

Make code idempotent. If it fails you should just be able to run it again.

Load data asynchronously. Aggregate updates into batches.

Move processing to application and out of the database as much as possible.

Stored procedures are dangerous.

Add more memory.

Enable query logging and take a look at what your app is doing.

Run different MySQL instances for different work loads.

Config tuning helps, query tuning works.

Reconsider persistent DB connections.

Don't overwork the database. It's hard to scale.

Work in parallel.

Use a job queuing system.

Log http requests.

Use light processes for light tasks.

Build on APIs internally. Clean loosely coupled APIs are easy to scale.

Don't incur technical debt.

Automatically handle failures.

Make services that always work.

Load balancing is the key to horizontal scaling.

Redundancy is not load-balancing. Always have n+1 capacity.

Plan for disasters.

Make backups.

Keep software deployments easy.

Have everything scripted.

Monitor everything. Graph everything.

Run one service per server.

Don't ever swap memory for disk.

Run memcached if you have extra memory.

Use memory to save CPU or IO. Balance memory vs CPU vs IO.

Netboot your application servers.

There's lot of good slides on what to graph.

Use a CDN.

Use YSlow to find client side problems.

This is just a high level blitz through the presentation. Topics are given a lot more detail in the presentation. Audio of Ask's dulcet tones would be nice, but there's still a lot to learn here.

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Strategy

Sunday

Aug162009

TechDev Stages

Sunday, August 16, 2009 at 5:34PM

Tech Dev Stages explains the basic steps involved for the product development given business problems. A must read for newbie or starters for architecture development.

paragarora |

ThePort Network Architecture

Sunday, August 16, 2009 at 6:26AM

ThePort Network's Director of Engineering, TJ Muehleman was kind of enough to share some of the architectural details for their white label social media system. It currently runs about 50 social networks varying in size from less than 1000 members to more than 300,000 members, all on a Microsoft stack. In addition to their social networking platform, they offer Javascript APIs and web service APIs (both REST and SOAP) which account for a significant percentage of overall system usage.

ThePort is an excellent example of a real world in-the-trenches product offering real value to customers. One of the most interesting problems they have to solve is multi-tenancy. How do you provide good performance, complete customization, support, develop new features, and provide individual search indexes for each customer? It's not an easy problem to solve.

How did they solve their problems and build a successful system?

Site: http://theport.com

Platform

Microsoft.NET 3.5

C# / VB.NET

SQL Server 2005

Visual Studio 2008 Pro Edition

Prototype

Subversion

TortoiseSVN

Trac (for internal defect tracking. Will possibly move all internal and external issue tracking to it)

Beyond Compare 3

Web Tier
* 6 x Dell blade servers running windows 2008 / IIS 7

Data Tier
* 1 r/w SQL Cluster – dell 6850s (6 single core processors, 32 GB RAM)
* 2 read-only dell 2950 (2 quad core processors, 16 GB RAM)
* 1 distribution server – dell 2950 (2 quad core processors, 16 GB RAM)

We also use SQL Server Service Broker as a queuing system for some of our saves. It's an alternative to MSMQ that uses the DB for persistence in case of failure. We will most likely be moving to MSMQ in the near future to remove us from SQL dependence.

Caching
* 2 Dell blade servers 8 GB RAM each to total 16 GB of available RAM
* Running SharedCache (Basically an open source .NET port of MemCacheD. We initially looked at MemCacheD but our internal benchmarking indicated SharedCache had better performance – at least w/in a Microsoft environment. We may still investigate Microsoft's Velocity cache platform when it goes live)

Search
* 2 Dell 2950s with 725 GB Storage
* Running Lucene + SOLR
* We chose Lucene over Lucene.NET because Lucene.NET's wildcard search was a little buggy in our initial beta testing. SQL Full Text wasn't a viable option because there was no clear and easy way to split indexes between customers. SOLR cores make this part easy. Above and beyond that, Lucene is lightning fast and is available with features we couldn't turn down (proximity search, searching w/in documents, and built-in RESTful APIs to name a few)

How do you handle multi-tenancy?

A multi-tenant platform has two primary hurdles to overcome:

1. Preventing a single, large customer from overwhelming the system?

The primary bottleneck for this is in the data layer. Our current DB architecture has helped mitigate this problem. The read-only servers help offset most of this by absorbing the bulk of the data calls. We did have to beef up the distribution server because latency between the r/w server and the read only servers had crept too high. Getting a new machine (2 quad cores with 16 GB of RAM) helped reduce the latency to less than a second.

However robust the cluster is, we've concluded that we will eventually have to move to a sharded architecture with MySQL. MS SQL licensing fees makes both continuing to enhance the cluster and scaling out to multiple machines prohibitive. Additionally, sharding allows us to scale either by customer (because some may be more active than others) or by functional area (photos, comments, etc).

2. Allowing clients to have total control over the look, feel, and user experience of their sites.

Allowing CSS control isn't enough; we needed a templating system that allows total control over the site. We looked at using .NET master pages and user controls to accomplish this. But that assumes a level of knowledge in .NET for outside developers. We built a proprietary templating system that unfortunately became too limiting and would one day lead to a drag on performance.

So we settled on using XML / XSLT. All of our business / entity objects are serializable to XML. This made XSLT a natural choice from the templating angle. We've seen a considerable boost in performance from this upgrade and an even greater increase in flexibility in terms of what our designers can do. Once the learning curve is overcome, the web designers love the amount of control they get.

What did you do that was especially cool that people could learn from?

XSLT as Custom Templating System

Building a templating system in XSLT that actually allows the template author to make a web service call to our internal web service layer (or external web services) straight from the templating system. This allowed the development team to build a flexible, powerful system that allows a web designer to embed real-time calls into a given template. We accomplish this using XSLT Extension Objects. What we've found in our internal testing is that these extension objects scale way better than our previous templating system (a homegrown proprietary system). We've used ANTS profiler to compare the two and the difference is in orders of magnitude.

Obviously we have to cache the hell out of this or the performance of the pages the calls are embedded in would suffer. For now, we make the internal web services calls via HTTP, but we will soon be moving this to a TCP call to take advantage of the better connection pooling offered by TCP. We're most likely to use WCF because of it's native support of TCP bindings. However, we haven't yet benchmarked that so it's possible it could change.

Not Using the Database to Build Collections

Another cool thing we've done is to move strongly away from using the database to retrieve collections of 'things'. For instance, if we needed a collection of comments, previously we'd hit the database for the 5, 10, 100, etc comments we wanted, do the sorting / filtering in the DB, return a single dataset, cache that, and then display.

However, this is a database intensive operation, especially if you're going to join against user data (which you inevitably will). What we've started doing recently is caching the recent comment objects, and using our cache providers MultiGet ability to simultaneously retrieve all comments at the same time. We then sort / filter in memory in the application tier, discard whatever comments we don't need, and then display. We found that doing it this way, we save lots of hits to our database and in fact, saw a considerable performance gain from it.

Our tests (on a developer laptop) fetched 10,000 objects from cache in about 1 second, then sorted them by date time in about .015 second.

What prompted you to move to a SOA architecture?

To better compartmentalize our code.

Given the growth of our templating system mentioned above, we realized it was best to truly separate the tiers into discrete areas. Since our application is easily accessed via a set of REST APIs and our own internal skinning system (and who knows what in the future), dividing the application like this gives us a lot of leeway in being able to swap out components. Additionally, we're doing more and more queuing which lines up nicely with SOA.

Performance

Since modern web apps deal with complex data, breaking the work into more discrete operations handled by offline processes on their own infrastructure makes a lot of sense from a performance point of view.

How do you handle consistency between the database and the search engine?

We have a multi-threaded windows service that scans our database once every 5 minutes looking for new data. The service then adds the new items to the Lucene index. We keep audit columns on all our database tables so capturing new data is pretty simple. Once a night, we purge the Lucene index and run a full rescan of the database. We think this system will work for the near to mid term but long term, we'll take advantage of a queuing system to keep the index in sync.

How you handle your release, support, bug fixing, development, etc.

We have a decent sized dev team. 1 platform architect responsible for overall system architecture (selecting which systems to use, tuning them), 1 lead software architect, and 3 senior – mid level developers. Since we're a start-up in a fast evolving market (social media) we find that we're constantly having to adjust to market demands and the latest in social functionality. So we have a 2 month build cycle which is pretty aggressive.

In terms of actual development, we've found the following to be keys to success:
1. Daily stand-ups: it's absolutely necessary for everyone on the team to know what the other is doing. A code base as large as ours, it's very likely I'm writing a function someone has already written or solving a problem someone has solved previously. Daily stand-ups help with that
2. Iterate: Build the core functionality, get it into QA and / or beta, beat the bugs out of it, move to the next piece. We've found this to be easier said than done. Market pressures sometimes dictate you roll with something more feature rich than you'd like. Sticking to an iterative cycle creates better code and more market ready products.
3. Beta test: This goes hand-in-hand w/ #2 above. Get something done and get it in the hands of actual users. This is the best way to find where your app falls down

With regard to support / bug fixing, we're moving to a forums based support model for many of our customers. We've found the same problems, especially in an app as configurable as ours, occur over and over. Getting those answers into an open, searchable format should hopefully cut down on confusion and get developers talking directly to developers.

Internally we use Trac for bug tracking and devote roughly 20% of our week maintaining, supporting, and fixing issues. That may seem like a lot but given how configurable our system is, we're essentially running 50 heavily data driven websites.

WCF sounds like a buggy underpeforming mess. How is it working?

So far we have no complaints with WCF. I think baking it directly into .NET 3.5 helped iron a lot of the big kinks out. It does come with it's quirks, no doubt. We built our REST libraries on top of it and found that posting XML is not exactly the easiest thing in the world. But it was more than made up for with the ease in deploying all our GET operations with REST. Our next step will be to set up TCP and MSMQ bindings with WCF to handle our internal service requests and queuing, respectively. Since WCF exposes all of these bindings natively, we think we will see a lot of effective code re-use out of this.

I'd like to thank TJ for taking the time and making the effort to write up their architecture for people to learn from. I'm sure it will help others when they are trying to build their own systems.

You too can share the architecture for your amazing system. Come on, you've learned a lot from others, it's time to return the favor and give back. It's not that hard, really. If interested please contact me and we can get started.

Todd Hoff |

2 Comments |

Permalink |

Print Article

Email Article

Example,

Microsoft

Thursday

Aug132009

Reconnoiter - Large-Scale Trending and Fault-Detection

Thursday, August 13, 2009 at 3:15PM

One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data.

Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book.

So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of good things to come.

The problem Reconnoiter is trying to solve is monitoring thousands of nodes across many datacenters where the nodes can vary widely in power, architecture, and software configuration. With that kind of problem what they really want is the ability to:

Configure everything from one place.

Cheap checks that are made on the specified time interval and aren't late and don't cause a heavy load on the machine.

Change the configuration from any datacenter without coordination.

Add checks in the field.

Separate data collection from visualization and fault-detection.

Analyze trends for long-term capacity planning and postmortem analysis.

Detect when faults have happened and when they are about to happen.

Support trending: the intelligent data correlation, regression analysis/curve fitting and looking into the past to see how much you go where you are now so you can do better next time.

Create a monitoring system that doesn't require a separate powerful network and its own set of hosts on which to run.

If you've ever used or written a distributed stats collection system the architecture of Reconnoiter will look somewhat familiar:

Some of the more interesting bits of the architecture are:

PostgresSQL stores all the data. The data isn't stuck in funky little files.

Fault-detection is based on Esper, a streaming complex event processing system. It's not clear how well this approach will work but the hooks are there.

A Comet-style web server is used to feed real-time updates. Much better than your traditional polling cycle.

Although the web console is PHP based, PHP is used mainly to execute Json calls. Rendering happens in the browser in an AJAX client.

Canvas is used for real time graphics. No images are created on the fly.

Data is transferred securely over SSL.

The system is robust against failures.

Data is not thrown away as it is with some systems so you can check against history.

Reconnoiter isn't completely pain free. Lua for an extension language is an interesting choice. The installation and configuration process is very complex. There are a lot of separate steps and bits to configure. Another potential problem is monitoring produces a lot of real-time data. I have to wonder if PostgresSQL can handle that flow for very large systems. The data is partitioned by month, but a large number of machines and a large number of events can be crushing. And I wasn't sure if graph data could be correlated with released features or other system changes. In the video Theo mentions seeing in the graphs that using deflate improved performance, but I'm not sure just looking at the graph how you would be able correlate system data with system changes.

It's droolingly clear where Reconnoiter shines is on creating complex graphs, charts, and other visualizations. The graphs look useful and quick to render. The real time visualizations are spectacular and extremely are difficult to do in other systems.

OmniTI Reconnoiter: Web Management and Analysis by Eric J. Bruno

Reconnoiter Update by Theo Schlossnagle

Reconnoiter Project Home Page

Video: Reconnoiter: a whirlwind tour

Big Picture of the Overall System

Reconnoiter: Monitoring and Trend Analysis from OSCON

OmniTI Unveils Open Source Monitoring Tool, Reconnoiter by Jayashree Adkoli

The sad state of open source monitoring tools by Grig Gheorghiu

How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book.

New open source IT management tool: Lighter-weight than Nagios, more granular than Cacti by Matt Stansberry

Todd Hoff |

2 Comments |