High Scalability -

Entries in Java (39)

Monday

Nov122007

a8cjdbc - Database Clustering via JDBC

Monday, November 12, 2007 at 11:05PM

Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose? Well clearly this depends on what kind of data you are storing. In our case the users of our solutions use our software products to do their everyday (all day) work. They have "everything" they need for their business stored in the database we are providing. So is 24 hours of data loss acceptable? No, not really. One hour? Maybe. But what we really want is a second database running with the EXACT same data. We mostly use PostgreSQL which does not have built in database replication. There is some solution based on triggers to replicate the data from one database to another one. We have learned that setting all this up on an existing database with plenty of tables is rather complicated and changing the database structure afterwards can not be done with simple create/alter statements anymore. And since we ARE running solutions that constantly change and improve, we need to be able to deploy updates including database structure changes quickly and easily. So what we really wanted was a transparent JDBC layer that does the replication for us. We tested a great solution called "Sequoia", but it is also a rather heavy-weight product with a lot of features that did not really help in the performance department and that we didn't need anyway. What we needed was:

a JDBC driver so the application does not know anything about the replication
of course: transactional safety for write operations
load-balanced reads (we are running 2 database servers, so why waste the ability to do parallel reads from 2 servers and almost multiply the performance by 2?)
for backups: the ability to detach one server, do the backup on that machine and then reattach the server
automatic and transparent failover / failsafe
Fast In-VM-Replication - no serialisation
Easy integration

Click to read more ...

Todd Hoff |

5 Comments |

Permalink |

Print Article

Email Article

Activ8,

Clustering,

Database,

JDBC,

Java,

MySQL,

Postgres,

Product

Sunday

Nov112007

Linkedin architecture

Sunday, November 11, 2007 at 3:22PM

Hi, An interesting post on Linkedin architecture: http://furiouspurpose.blogspot.com/2007/11/qcon-linkedin-architecture.html

Click to read more ...

HighScalability Team |

ID generator

Thursday, November 8, 2007 at 9:41PM

Hi, I would like feed back on a ID generator I just made. What positive and negative effects do you see with this. It's programmed in Java, but could just as easily be programmed in any other typical language. It's thread safe and does not use any synchronization. When testing it on my laptop, I was able to generate 10 million IDs within about 15 seconds, so it should be more than fast enough. Take a look at the attachment.. (had to rename it from IdGen.java to IdGen.txt to attach it) IdGen.java

Click to read more ...

cfelde |

4 Comments |

Permalink |

Java,

fast,

thread safe

Tuesday

Oct302007

Paper: Dynamo: Amazon’s Highly Available Key-value Store

Tuesday, October 30, 2007 at 11:43PM

Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated. Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues. Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand. From the abstract you can get a feel for what the paper is about:

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

My first impressions after reading the paper:

Wow. But crap, I'll never be able to build anything like that. This is really competition through better infrastructure. Take that Google :-)

Their purposeful embracing of probability and manged centers of uncertainty must be dizzying for those from a RDBMS background. In a RDBMS it's all right angles. You write something and it's assumed consistent, correct, and durable. Now, how do you do this at scale across multiple data centers under failure conditions? There's the rub. So Amazon says writes must go through and we will deal with the complexities that model generates. They version objects and merge them later. Who does that? I love it, because when delve into these problems you realize you need this type of functionality, but it's too complex, so you back away and continue trying to force a square peg in a round whole. To have no fear to go where your requirements leads you is real engineering.

Can you imagine finding a problem in that system? I'd love to be a fly in those debugging sessions. But infrastructure takes on self-consciousness of its own when dealing with complex problems, so you just have to deal with knowing you don't know anymore. A lot of this thinking is driven by the CAP conjecture which states it's impossible for a web service to simultaneously guarantee consistency, availability, and partition-tolerance. When you get over your initial "that can't be true" reaction and embrace it, you get something like Dynamo. I'd really love to hear what you guys think about Dynamo.

Click to read more ...

Todd Hoff |

Why most large-scale Web sites are not written in Java

Wednesday, October 3, 2007 at 3:25AM

There is a lot of information in the blogosphere describing the architecture of many popular sites, such as Google, Amazon, eBay, LinkedIn, TypePad, WikiPedia and others. I've summarized this issue in a blog post here I would really appreciate your opinion on this matter.

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

General Discussion,

Java,

LAMP,

Scalability,

j2EE,

web 2.0

Tuesday

Oct022007

Secrets to Fotolog's Scaling Success

Tuesday, October 2, 2007 at 12:08AM

Fotolog, a social blogging site centered around photos, grew from about 300 thousand users in 2004 to over 11 million users in 2007. Though they initially experienced the inevitable pains of rapid growth, they overcame their problems and now manage over 300 million photos and 800,000 new photos are added each day. Generating all that fabulous content are 20 million unique monthly visitors and a volunteer army of 30,000 new users each day. They did so well a very impressed suitor bought them out for a cool $90 million. That's scale meets success by anyone standards. How did they do it? Site: http://www.fotolog.com/

Information Sources

Scaling the World's Largest Photo Blogging Community

Congrats to Fotolog on $90mm sale to Hi-Media

Fotolog overtaking Flickr?

Fotolog Hits 11 Million Members and 300 Million Photos Posted

Site of the Week: Fotolog.com by PC Magazine

CEO John Borthwick's Blog.

DBA Frank Mash's Blog

Fotolog, lessons learnt by John Borthwick .

The Platform

Java

PHP

Sun

Solaris 10

MySQL

Apache

Hibernate

Memcached

3PAR (a simple, efficient and scalable tiered-storage array for utility computing)

IBRIX (a single namespace parallel file system, a scalable volume manager, high availability feature)

StrongMail

CDN: Akamai/Panther

The Stats

Started in 2002. In 2004 they had around 300k or 400k members, 3 employees, no scalable infrastructure, and no revenue model.

Due to the rapid growth the site had frequent technical problems and 2005 they had to limit new free members to 1,000 a day.

In 2007 they had over 11 million users and were sold for $90 million to Hi-Media.

Members are from over 200 countries with a majority in South America. Over 20% of page views are from Europe. They rejected a US centric strategy, developing a global and engaged audience.

Generates over 3.5 billion page views and receives over 20 million unique visitors each month and has earned a top 20 Alexa ranking.

Manages over 300 Million photos and over 500,000 photos are uploaded each day.

Over 30,000 new members are added each day and attracts more than 4.6 million daily users. Expanded with no marketing or member incentives.

Over 500 user-generated communities.

20% of member visit the site daily and spend an average of 24 minutes.

32 MySQL servers and a 30 memcached server cluster.

The Architecture

Site originally written in PHP. - Their new "Fotolog memberpage" feature is written in Java with significant performance improvement. Page is cleaner with an improved response time. - They are now serving the site on less than half the boxes they were using. - Daily registrations are up over 35% given the improved performance and a requirement to register to post a guest book message. - The new code base allows them to innovate much more on the member experience.

They have surpassed Flickr in popularity being a firmly Web 1.0 application. - There are no tags, no APIs, no JavaScript widgets, no Ajax. - They have a Spanish language option which extends the site to a broad user base. - They use very little text. It's mostly visual so it usable by a broad range of users. - Their interface is customizable and many people like to express their individual identities. - Their unique visitors are 1MM less than Yahoo's, yet the total minutes on the site are twice that of Yahoo and pages are 3x.

Revenue model: - Gold camera member for about $5/month means you can upload 6 photos a day instead of 1, have 200 comments per photo instead of 20, a custom title image for your profile, a mini-thumbnail of your most recent photo displayed next to your name in guest books, plus the possibility of having your photo featured on the front page.. - Adsense. Revenue lift from Google is trending up approximately 15% given additional contextual data from guest books. - Will move to a peer-to-peer advertising among their members. - Members will have the ability to buy and sell real and virtual items using a micro-payment service.

They have a one-post-per-day rule where users can only post one photo a day. Rather than inhibit growth this rules ensures quality and generates exceptional usage by increasing the chance of a photo being seen and by attracting positive comments. Where as people usually run out of things to say on a blog, people can always find a picture to take, upload, and talk about.

Only photos less than 2,000 kb in size can be uploaded. These are automatically resized to a 500x500 format. Pages look cleaner and load faster.

Model is browsing over searching. Opportunistic serendipitous treasure hunting is encouraged.

Friends are added automatically without needing permission. This generates an audience for your photos.

Supports a browse by groups feature, which have categories like "Colors" and "Emotions."

The site is intentionally simple. - They have resisted the temptation to add feature after feature. Instead their vision is to offer a handful of features, similar to Craig's list, the focus being on content and the conversations. - Pages need to be social. - Pages need to include not only your images, but also images from across the network, providing a visual navigation that today drives much of the time their members spend on the site, a self formed, organic distribution system, letting members see and be seen. - Complementing this social network of images are comments and guest book entries — making the experience one where media intersects with communications, day in day out, millions of images collide with billions of conversations.

Photobucket vs Fotolog - Photobucket stores image-based media, then distributes it to your page on social networking sites such as Myspace, Bebo, Piczo, Friendster, etc. - Fotolog is a destination. - The first generation of social-networking sites stressed self-publishing over connections (from Geocities, to Tripod to Blogger). The next generation focused mostly on connections (sixdegrees, and friendster are the classic examples here — tools to gather friends and connections, as social capital accrues in theory to the people with the most connections). The third and current generation of sites blends media with connections — each with a different emphasis.

Backup: Sun 6540 disk array

Their 32 SQL servers are divided into four clusters - user, GB (guest book), PH (photos), FF (friends and favorites lists) - Uses non-persistent connections. - Connection pooling on the Java side. - InnoDB - Partitioning is handled by the application layer.

Each cluster: - Is fronted by a set of application servers. - Divided into a set of shards. - Each shard has MySQL write-only master-master configuration feeding a few read-only slaves. - Application servers send their read requests to the slaves and their write requests to the masters. - Data are assigned to shared based on some sort of cluster specific partioning key. Naive partitioning algorithms can lead to very uneven shard loads, you want a more balanced load on each shard.

MySQL is used to store image metadata only. This seems pretty standard. Almost nobody seems to store important blobs in the database because it slows down database operations.

Photo storage uses 3PAR and IBRIX. A CDN is used for hot content.

The virtual storage system, though expensive, has worked very well.

As more selects are used lock contention for auto-incremented keys grows.

Through database optimizations they've been able to grow from 4 million members to 11 million members on the same 32 database servers. This is also do to the efficiency of MySQL running on Solaris 10, and increasing the memcache cluster, porting to Java, and increasing RAM.

Happy with memcached. - Created a distributed cluster of 50 memcached servers with a total cache size of approximately 150 gigabytes, supporting around 4 billion page views/month. Peak load times dropped from 10 seconds to 2 seconds. - Quote from CTO:

I have a new memcached user to add to your list: we here at Fotolog, the world's largest photo blogging community, now use it and we love it. I just rolled our first code to use it into production today and it has been a lifesaver. I can't wait to start using it in places where we had been relying on Berkeley databases to offload some database work. We are not some wimpy million page a day site, either. Fotolog is a billion+ pages/month site (35 to 40 million views/day is pretty typical for us). We had recently overcome some significant DB-related performance issues which allowed our site traffic to explode, and it started to bog down again under the heavy traffic load (getting back up towards 10 seconds for a page to load sometimes during the peak periods). The servers were churning away each recreating a list every time when it could easily be shared in the same form for at least 5 or 10 minutes. So we introduced memcache, creating a distributed 30-server cluster with 4 gigs available in total and made a very minor code mod to use memcache, and our peak period load times dropped back down to the 2 second or so range. It has allowed for continued growth and incredible efficiency. I can't say when I've ever been so pleased with something that worked so simply."

Lessons Learned

Popularity is driven by a base of active users, not a rich set of cool features.

The web is global and its tail is very long. By courting users outside the US with language and culturally specific design you can compete with the big boys. Some the hardest competition for Google, Yahoo, etc comes from local startups with an ear to what the locals want.

If you want to get a lot of buzz then do what ever alpha geeks want you to do. If you want a lot of happy users do what they want you to do.

Constraints in web sites can, like in poetry, make something unexpectedly better. The rule that users are only allowed to post one photo per day creates an environment where people comment more on each others photos which creates a more engaged community. Who knew?

Protect your website with limits. Limit the size of pictures, comments, etc so your resource usage doesn't grow outrageously.

Have a vision. Have a strong sense of what your site is supposed to be and why, then use that vision to decide what you should build and how you should build it. Their vision of social site built around daily photographs led to a very different site than one where your goal is to store all your photos.

Revenue generation features can be added without destroying the integrity of your site. I really like how they give people a reasonable set of features for free and then charge for the resources they need to have more. Those features also serve to extend and reinforce the social vision of their site. It will be interesting to see how their new monetization strategies play out.

Don't be afraid to scale up and out. By adding more cache, more RAM, more CPUs, and more efficient CPUs you handle dramatically more load with the same number of machines. And that's a good thing from a datacenter space and power POV.

Making MySQL perform: - Find the source of the problem. - Mature systems are mostly disk bound. - The query cache may be hurting you. - Add RAM to help dodge the bullet. - Stripe your disks. - Restructure tables for optimal performance. - Use libumem.so to find memory leaks.

Things to remember: - Know the problem - Know your application - Know your storage engine - Know your requirements - Know your budget - Use all this information to decide what parts of your system really require the investment of time, money, and testing to be highly available.

Flickr Architecture

An Unorthodox Approach to Database Design : The Coming of the Shard

Click to read more ...

Todd Hoff |

12 Comments |

Permalink |

CDN,

Java,

MySQL,

PHP,

Storage Virtualization,

hibernate,

solaris

Tuesday

Sep182007

Amazon Architecture

Tuesday, September 18, 2007 at 5:44AM

This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included. Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles Site: http://amazon.com

Information Sources

Early Amazon by Greg Linden

How Linux saved Amazon millions

Interview Werner Vogels - Amazon's CTO

Asynchronous Architectures - a nice summary of Werner Vogels' talk by Chris Loosley

Learning from the Amazon technology platform - A Conversation with Werner Vogels

Werner Vogels' Weblog - building scalable and robust distributed systems

Platform

Linux

Oracle

C++

Perl

Mason

Java

Jboss

Servlets

The Stats

More than 55 million active customer accounts.

More than 1 million active retail partners worldwide.

Between 100-150 services are accessed to build a page.

The Architecture

What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow.

The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications.

Started as one application talking to a back end. Written in C++.

It grew. For years the scaling efforts at Amazon focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. In 2001 it became clear that the front-end application couldn't scale anymore. The databases were split into small parts and around each part and created a services interface that was the only way to access the data.

The databases became a shared resource that made it hard to scale-out the overall business. The front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently.

Grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface.

Many third party technologies are hard to scale to Amazon size. Especially communication infrastructure technologies. They work well up to a certain scale and then fail. So they are forced to build their own.

Not stuck with one particular approach. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack.

C++ is uses to process requests. Perl/Mason is used to build content.

Amazon doesn't like middleware because it tends to be framework and not a tool. If you use a middleware package you get lock-in around the software patterns they have chosen. You'll only be able to use their software. So if you want to use different packages you won't be able to. You're stuck. One event loop for messaging, data persistence, AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested.

The SOAP web stack seems to want to solve all the same distributed systems problems all over again.

Offer both SOAP and REST web services. 30% use SOAP. These tend to be Java and .NET users and use WSDL files to generate remote object interfaces. 70% use REST. These tend to be PHP or PERL users.

In either SOAP or REST developers can get an object interface to Amazon. Developers just want to get job done. They don't care what goes over the wire.

Amazon wanted to build an open community around their services. Web services were chosed because it's simple. But hat's only on the perimeter. Internally it's a service oriented architecture. You can only access the data via the interface. It's described in WSDL, but they use their own encapsulation and transport mechanisms.

Teams are Small and are Organized Around Services - Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams. - If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. - Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit. - As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed. - Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements.

Deployment - They create special infrastructure for managing dependencies and doing a deployment. - Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box. - Everyone has a home grown system to solve these problems. - Output of deployment process is a virtual machine. You can use EC2 to run them.

Work From the Customer Backwards to Verify a New Service is Worth Doing - Work from the customer backward. Focus on value you want to deliver for the customer. - Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it. - Start with a press release of what features the user will see and work backwards to check that you are building something valuable. - End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems.

State Management is the Core Problem for Large Scale Systems - Internally they can deliver infinite storage. - Not all that many operations are stateful. Checkout steps are stateful. - Most recent clicked web page service has recommendations based on session IDs. - They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services.

Eric Brewer's CAP Theorem or the Three properties of Systems - Three properties of a system: consistency, availability, tolerance to network partitions. - You can have at most two of these three properties for any shared-data system. - Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone. - Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true. - Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent. - To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency. - Choose a specific approach based on the needs of the service. - For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later. - When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data.

Lessons Learned

You must change your mentality to build really scalable systems. Approach chaos in a probabilistic sense that things will work well. In traditional systems we present a perfect world where nothing goes down and then we build complex algorithms (agreement technologies) on this perfect world. Instead, take it for granted stuff fails, that's reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations.

Create a shared nothing infrastructure. Infrastructure can become a shared resource for development and deployment with the same downsides as shared resources in your logic and data tiers. It can cause locking and blocking and dead lock. A service oriented architecture allows the creation of a parallel and isolated development process that scales feature development to match your growth.

Open up you system with APIs and you'll create an ecosystem around your application.

Only way to manage as large distributed system is to keep things as simple as possible. Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity.

Organizing around services gives agility. You can do things in parallel is because the output is a service. This allows fast time to market. Create an infrastructure that allows services to be built very fast.

There's bound to be problems with anything that produces hype before real implementation

Use SLAs internally to manage services.

Anyone can very quickly add web services to their product. Just implement one part of your product as a service and start using it.

Build your own infrastructure for performance, reliability, and cost control reasons. By building it yourself you never have to say you went down because it was company X's fault. Your software may not be more reliable than others, but you can fix, debug, and deployment much quicker than when working with a 3rd party.

Use measurement and objective debate to separate the good from the bad. I've been to several presentations by ex-Amazoners and this is the aspect of Amazon that strikes me as uniquely different and interesting from other companies. Their deep seated ethic is to expose real customers to a choice and see which one works best and to make decisions based on those tests. Avinash Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want.

Create a frugal culture. Amazon used doors for desks, for example.

Know what you need. Amazon has a bad experience with an early recommender system that didn't work out: "This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own."

People's side projects, the one's they follow because they are interested, are often ones where you get the most value and innovation. Never underestimate the power of wandering where you are most interested.

Involve everyone in making dog food. Go out into the warehouse and pack books during the Christmas rush. That's teamwork.

Create a staging site where you can run thorough tests before releasing into the wild.

A robust, clustered, replicated, distributed file system is perfect for read-only data used by the web servers.

Have a way to rollback if an update doesn't work. Write the tools if necessary.

Switch to a deep services-based architecture (http://webservices.sys-con.com/read/262024.htm).

Look for three things in interviews: enthusiasm, creativity, competence. The single biggest predictor of success at Amazon.com was enthusiasm.

Hire a Bob. Someone who knows their stuff, has incredible debugging skills and system knowledge, and most importantly, has the stones to tackle the worst high pressure problems imaginable by just leaping in.

Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools.

Creativity must flow from everywhere.

Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule.

Embrace innovation. In front of the whole company, Jeff Bezos would give an old Nike shoe as "Just do it" award to those who innovated.

Don't pay for performance. Give good perks and high pay, but keep it flat. Recognize exceptional work in other ways. Merit pay sounds good but is almost impossible to do fairly in large organizations. Use non-monetary awards, like an old shoe. It's a way of saying thank you, somebody cared.

Get big fast. The big guys like Barnes and Nobel are on your tail. Amazon wasn't even the first, second, or even third book store on the web, but their vision and drive won out in the end.

In the data center, only 30 percent of the staff time spent on infrastructure issues related to value creation, with the remaining 70 percent devoted to dealing with the "heavy lifting" of hardware procurement, software management, load balancing, maintenance, scalability challenges and so on.

Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications.

Create a single unified service-access mechanism. This allows for the easy aggregation of services, decentralized request routing, distributed request tracking, and other advanced infrastructure techniques.

Making Amazon.com available through a Web services interface to any developer in the world free of charge has also been a major success because it has driven so much innovation that they couldn't have thought of or built on their own.

Developers themselves know best which tools make them most productive and which tools are right for the job.

Don't impose too many constraints on engineers. Provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, allow teams to function as independently as possible.

Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. Have many support tools that are of a self-help nature. Support an environment around the service development that never gets in the way of the development itself.

You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

Developers should spend some time with customer service every two years. Their they'll actually listen to customer service calls, answer customer service e-mails, and really understand the impact of the kinds of things they do as technologists.

Use a "voice of the customer," which is a realistic story from a customer about some specific part of your site's experience. This helps managers and engineers connect with the fact that we build these technologies for real people. Customer service statistics are an early indicator if you are doing something wrong, or what the real pain points are for your customers.

Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration.

Click to read more ...

Todd Hoff |

76 Comments |

Permalink |

Print Article

Email Article

Example,

Java,

Linux,

Oracle,

Perl

Monday

Jul232007

GoogleTalk Architecture

Monday, July 23, 2007 at 8:47AM

Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk

Information Sources

GoogleTalk Architecture

Platform

Linux

Java

Google Stack

Shard

What's Inside?

The Stats

Support presence and messages for millions of users.

Handles billions of packets per day in under 100ms.

IM is different than many other applications because the requests are small packets.

Routing and application logic are applied per packet for sender and receiver.

Messages must be delivered in-order.

Architecture extends to new clients and Google services.

Lessons Learned

Measure the right thing. - People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question. - Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges - A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day. - Have a large number friends and presence explodes. The number IMs not that big of deal.

Real Life Load Tests - Lab tests are good, but don't tell you enough. - Did a backend launch before the real product launch. - Simulate presence requests and going on-line and off-line for weeks and months, even if real data is not returned. It works out many of the kinks in network, failover, etc.

Dynamic Resharding - Divide user data or load across shards. - Google Talk backend servers handle traffic for a subset of users. - Make it easy to change the number of shards with zero downtime. - Don't shard across data centers. Try and keep users local. - Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.

Add Abstractions to Hide System Complexity - Different systems should have little knowledge of each other, especially when separate groups are working together. - Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system. - Abstract these complexities into a set of gateways that are discovered at runtime. - RPC infrastructure should handle rerouting.

Understand Semantics of Lower Level Libraries - Everything is abstracted, but you must still have enough knowledge of how they work to architect your system. - Does your RPC create TCP connections to all or some of your servers? Very different implications. - Does the library performance health checking? This is architectural implications as you can have separate system failing independently. - Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.

Protect Again Operation Problems - Smooth out all spoke in server activity graphs. - What happens when servers restart with an empty cache? - What happens if traffic shifts to a new data center? - Limit cascading problems. Back of from busy servers. Don't accept work when sick. - Isolate in emergencies. Don't infect others with your problems. - Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.

Any Scalable System is a Distributed System - Add fault tolerance to every component of the system. Everything fails. - Add ability to profile live servers without impacting server. Allows continual improvement. - Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects. - Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.

Software Development Strategies - Make sure binaries are both backward and forward compatible so you can have old clients work with new code. - Build an experimentation framework to try new features. - Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Java,

Linux,

Shard

Thursday

Jul122007

FeedBurner Architecture

Thursday, July 12, 2007 at 10:34AM

FeedBurner is a news feed management provider launched in 2004. FeedBurner provides custom RSS feeds and management tools to bloggers, podcasters, and other web-based content publishers. Services provided to publishers include traffic analysis and an optional advertising system. Site: http://www.feedburner.com

Information Sources

FeedBurner - Scalable Web Applications using MySQL and Java

What the Web’s most popular sites are running on

Platform

Java

MySQL

Hibernate

Spring

Tomcat

Cacti

Load balancing: NetScaler Application Switches

Routers, switches: HP, Cisco

DNS: bind

The Stats

FeedBurner is growing faster than MySpace and Digg with 385% traffic growth. Total feeds: 808,707, Number of publishers: 471,686.

11 million subscribers in 190 countries

Scaling History - July 2004: 300Kbps, 5,600 feeds, 3 app servers, 3 web servers 2 DB servers, Round Robin DNS - April 2005: 5Mbps, 47,700 feeds, 6 app servers, 6 web servers (same machines) - September 2005: 20Mbps, 109,200 feeds - Currently: 250 Mbps bandwidth usage, 310 million feed views per day, 100 Million hits per day

The Architecture

Scalability Problem 1: Plain old reliability - Single-server failure, seen by 1/3 of all users - Health Check all the way back to the database that is monitored by load balancers to route requests in to live machines on failure. - Use Cacti and Nagios for monitoring. Using these tools you can look at uptime and performance to identify performance problems.

Scalability Problem 2: Stats recording/mgmt - Every hit is recorded which slows everything down because of table level locks. - Used Doug Lea’s concurrency library to do updates in multiple threads. - Only stats for today are calculated in real-time. Other stats are calculate lazily.

Scalability Problem 3: Primary DB overload - Use master DB for everything. - Balance read and read/write load - Found where we could break up read vs. read/write - Balanced master vs. slave load

Scalability Problem 4: Total DB overload - Everything slowed down, was using the database has cache, used MyISAM - Add caching layers. RAM on the machines, memcached, and in the database

Scalability Problem 5: Lazy initialization - When stats get rolled up on demand popular feeds slowed down the whol system - Turned to batch processing, doing the rollups once a night.

Scalability Problem 6: Stats writes, again - Wrote to the master too much. More data with each feed. Added more stats tracking for ads, items, and circulation. - Use merge tables. Truncate the data from 2 days ago. - Went to horizontal partitioning: ad serving, flare serving, circulation. - Move hottest tables/queries to own clusters.

Scalability Problem 7: Master DB Failure - Using a primary and slave there's a single point of failure because it's hard to promote a slave to a master. Went to a multi master solution.

Scalability Problem 8: Power Failure - Needed a disaster recovery/secondary site. - Active/active not possible. Too much hardware, didn't like having half the hardware going to waste, and needed a really fast connection between data centers. - Create custom solution to download feeds to remote servers.

They have two sites in primary and secondary roles (active-passive) as their geographical redundancy plan. They plan on moving to active-active model in the future.

Lessons Learned

Know your DB workload, Cacti really helps with this.

‘EXPLAIN’ all of your queries. Helps keep crushing queries out of the system.

Cache everything that you can.

Profile your code, usually only needed on hard-to-find leaks.

The greatest challenge was finding the most efficient ways to locate hotspots and bottlenecks in the application. With a loose methodology for locating problems, the analysis became very easy. Detailed monitoring was crucial in this, keeping track of disk, CPU and memory usage, slow database queries, handler details in MySQL, etc.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Java,

MySQL