Entries in Example (248)

Monday
Nov042013

ESPN's Architecture at Scale - Operating at 100,000 Duh Nuh Nuhs Per Second

ESPN went on air in 1978. In those 30+ years think of the wonders we’ve seen! When I think of ESPN I think of a world wide brand that is the very definition of prime time. And it shows in their stats. ESPN.com peaks at 100,000 requests per second. Their peak event is, not surprisingly, the World Cup. But would you be surprised to learn ESPN is powered by only a few hundred servers and a couple of dozen engineers? I was.

And would you be surprised to learn ESPN is undergoing a fundamental transition from an Enterprise architecture to one capable of handling web scale loads driven by increasing mobile usage, personalization, and a service orientation? Again, thinking ESPN was just about watching sports on TV, I was surprised. ESPN is becoming much more than that. ESPN is becoming a sports platform. 

How does ESPN handle all of this complexity, responsibility, change, and load? Unlike most every other profile on HighScalability. The fascinating story of ESPN’s architecture is told by Manny Pelarinos, Senior Director, Engineering at ESPN in the InfoQ presentation Architecture at Scale at ESPN. Information from Max Protect: Scalability and Caching at ESPN.com has also been folded in. 

Starting in a pre-personal computer era ESPN developed an innovative cable and satellite TV sports empire. From an initial 30 minute program reviewing the day’s sports, they went on to make deals with the NBA, USFL, NHL, and what would become the big fish of all sports in the US, the National Football League.

Sport by sport deals were made to bring sports data in from all possible sources so ESPN could report scores, play film clips, and generally become one stop shopping for all things sports on TV and later the web.

It’s a complex system to understand. They have a lot going on with Television & Broadcasting, live scoring, editing and publishing, Digital Media, giving sports scores, web and mobile, personalization, fantasy games, and they also want to expand API access to 3rd party developers. Unlike most every profile on HighScalability ESPN has an enterprise heritage. It’s a Java Enterprise stack, so you’ll see Oracle databases, JMS brokers, Java Beans, and Hibernate.

Some of the most important lessons we’ll learn about: 

  • Platform changes everything. ESPN sees themselves as a content provider. These days content is accessed through multiple paths. It can be on TV, or on ESPN.com, or on mobile, but content is also being consumed by more and more internal applications, like Fantasy Games. And they also want to provide an external API so developers can build on ESPN resources. ESPN wants to become a walled garden built on a sports content platform that centralizes access to their prime advantage over everyone else, which is unprecedented access to sports related content and data. The walled garden approach that Facebook has made work for social, Apple has made work for apps, and Google has made work for AI, is what ESPN wants to do for sports. The problem is transitioning from an enterprise architecture to a platform based on APIs and services is a tough change to make. They can do it. They are doing it. But it will be hard.

  • Web scale changes everything. Many web properties today use Java as their standard backend development environment, but ESPN.com, which grew up in the Java Enterprise era, went all in for the canonical Enterprise architecture. And it has worked quite well. Until there was a sort of phase transition from enterprise class loads experienced by a relatively predictable ESPN.com to a world dominated by high mobile traffic, mass customization, and platform concerns. Many of the architecture choices we see in native web properties must now be used by ESPN.com.

  • Personalization changes everything. The cache that once saved your database is now much less useful when all content becomes dynamically constructed for each user and must follow you on every mode of access (.com, mobile, TV). 

  • Mobile changes everything. It puts pressure everywhere on your architecture. When there was just the web architecture didn’t matter as much because there were fewer users and fewer servers. In the mobile age with so many more users and servers these kind of architecture decisions make a huge difference. 

  • Partnerships are power. ESPN can create a walled garden because over the years they have developed partnerships that gives them special access to data that nobody else has. It’s good to be firstest with the mostest. That individual sports like the NFL and MLB seeking to capture this value with their own network lessens this advantage somewhat, but the forces are such that everyone needs to get along, which puts ESPN in the middle of a powerful platform play, if they can execute.

Lights. Camera. Action. Let’s learn how ESPN scales...

Click to read more ...

Monday
Oct282013

Design Decisions for Scaling Your High Traffic Feeds

Guest post by Thierry Schellenbach, Founder/CTO of Fashiolista.com, follow @tschellenbach on Twitter and Github

Fashiolista started out as a hobby project which we built on the side. We had absolutely no idea it would grow into one of the largest online fashion communities. The entire first version took about two weeks to develop and our feed implementation was dead simple. We’ve come a long way since then and I’d like to share our experience with scaling feed systems.

Feeds are a core component of many large startups such as Pinterest, Instagram, Wanelo and Fashiolista. At Fashiolista the feed system powers the flat feed, aggregated feed and the notification system. This article will explain the troubles we ran into when scaling our feeds and the design decisions involved with building your own solution. Understanding the basics of how these feed systems work is essential as more and more applications rely on them.

Furthermore we’ve open sourced Feedly, the Python module powering our feeds. Where applicable I’ll reference how to use it to quickly build your own feed solution.

Introduction to Feeds

The problem of scaling feed systems has been widely discussed, but let me start by clarifying the basics:

Click to read more ...

Monday
Sep232013

Salesforce Architecture - How they Handle 1.3 Billion Transactions a Day

This is a guest post written by Claude Johnson, a Lead Site Reliability Engineer at salesforce.com.

The following is an architectural overview of salesforce.com’s core platform and applications. Other systems such as Heroku's Dyno architecture or the subsystems of other products such as work.com and do.com are specifically not covered by this material, although database.com is. The idea is to share with the technology community some insight about how salesforce.com does what it does. Any mistakes or omissions are mine.

This is by no means comprehensive but if there is interest, the author would be happy to tackle other areas of how salesforce.com works. Salesforce.com is interested in being more open with the technology communities that we have not previously interacted with. Here’s to the start of “Opening the Kimono” about how we work.

Since 1999, salesforce.com has been singularly focused on building technologies for business that are delivered over the Internet, displacing traditional enterprise software. Our customers pay via monthly subscription to access our services anywhere, anytime through a web browser. We hope this exploration of the core salesforce.com architecture will be the first of many contributions to the community.

Definitions

Let’s start with some basic salesforce.com terminology:

Click to read more ...

Monday
Jul082013

The Architecture Twitter Uses to Deal with 150M Active Users, 300K QPS, a 22 MB/S Firehose, and Send Tweets in Under 5 Seconds

Toy solutions solving Twitter’s “problems” are a favorite scalability trope. Everybody has this idea that Twitter is easy. With a little architectural hand waving we have a scalable Twitter, just that simple. Well, it’s not that simple as Raffi Krikorian, VP of Engineering at Twitter, describes in his superb and very detailed presentation on Timelines at Scale. If you want to know how Twitter works - then start here.

It happened gradually so you may have missed it, but Twitter has grown up. It started as a struggling three-tierish Ruby on Rails website to become a beautifully service driven core that we actually go to now to see if other services are down. Quite a change.

Twitter now has 150M world wide active users, handles 300K QPS to generate timelines, and a firehose that churns out 22 MB/sec. 400 million tweets a day flow through the system and it can take up to 5 minutes for a tweet to flow from Lady Gaga’s fingers to her 31 million followers.

A couple of points stood out:

  • Twitter no longer wants to be a web app. Twitter wants to be a set of APIs that power mobile clients worldwide, acting as one of the largest real-time event busses on the planet.
  • Twitter is primarily a consumption mechanism, not a production mechanism. 300K QPS are spent reading timelines and only 6000 requests per second are spent on writes.
  • Outliers, those with huge follower lists, are becoming a common case. Sending a tweet from a user with a lot of followers, that is with a large fanout, can be slow. Twitter tries to do it under 5 seconds, but it doesn’t always work, especially when celebrities tweet and tweet each other, which is happening more and more. One of the consequences is replies can arrive before the original tweet is received. Twitter is changing from doing all the work on writes to doing more work on reads for high value users.
  • Your home timeline sits in a Redis cluster and has a maximum of 800 entries.
  • Users care about tweets, but the text of the tweet is almost irrelevant to most of Twitter's infrastructure.
  • Twitter knows a lot about you from who you follow and what links you click on. Much can be implied by the implicit social contract when bidirectional follows don’t exist.
  • It takes a very sophisticated monitoring and debugging system to trace down performance problems in a complicated stack. And the ghost of legacy decisions past always haunt the system.

How does Twitter work? Read this gloss of Raffi’s excellent talk and find out...

Click to read more ...

Wednesday
Jul032013

5 Rockin' Tips for Scaling PHP to 30,000 Concurrent Users Per Server

Jonathan Block, CTO at RockThePost.com, a crowdfunding company, has written a nice set of tips for smaller sites on how to scale a service on EC2 using a small two person development team. 

Their service has a typical small scale structure:

  • PHP's Zend Framework 2
  • Two m1.medium for web servers
  • ELB to split the load
  • master/slave MySQL database
  • Siege for load testing

The very sensible tips that can handle 30,000 concurrent users per web server: 

Click to read more ...

Wednesday
Jun262013

Leveraging Cloud Computing at Yelp - 102 Million Monthly Vistors and 39 Million Reviews

This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.

In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on.  We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.

In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers.  By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster.  Today we run over 500 jobs a day, from integration tests to advertising metrics.  We’ve learned a few lessons along the way that can hopefully benefit you as well.

Job Flow Pooling

Click to read more ...

Tuesday
Jun182013

Scaling Mailbox - From 0 to One Million Users in 6 Weeks and 100 Million Messages Per Day

You know your product is doing well when most of your early blog posts deal with the status of the waiting list of hundreds of thousands of users eagerly waiting to download your product. That's the enviable position Mailbox, a free mobile email management app, found themselves early in their release cycle. 

Hasn't email been done already? Apparently not. Mailbox scaled to one million users in a paltry six weeks with a team of about 14 people. As of April they were delivering over 100 million messages per day.

How did they do it? Mailbox engineering lead, Sean Beausoleil, gave an informative interview on readwrite.com on how Mailbox planned to scale... 

Click to read more ...

Monday
Jun032013

GOV.UK - Not Your Father's Stack

I'm not sure what I was expecting the stack GOV.UK used at launch to look like. Maybe some messenger owls and lots of cobwebs? But not so at all. So much not so I thought any organization looking at their own stack for ideas could learn something from the considered choices of others.

The diversity of technologies used was surprising. They use "at least five different programming languages, three separate database types, two versions of an operating system." Some may think of this as a weakness, but they think it a strength:

The reason we operate such a diverse ecosystem is that we are focused on solving real problems. Our first task is to understand the problem or need we are solving and then to choose the best tool for the job. If we restrict ourselves to moulding the need to the tools we already have, then we risk not solving the initial problem in the best way possible for the user. By restricting software diversity or enforcing rigid organisational standards on a project, there is a possibility of descending into a cargo cult, where we simply repeat the same patterns and mistakes in everything we make.

This "use the best tool no matter what" policy is outlined in a blog post Benefits of diversity. The only choice that wouldn't be found in a modern startup is the use of Skyscape as their cloud provider. I'm assuming this has to do with legal issues around data sovereignty as this is government site, but otherwise it's all straight out of standard modern web practice: monitoring, dashboards, continuous release, polyglot persistence, distributed source code control, etc. Good to see a government getting it.

What stack are they using? (it's a direct copy so feel free to read the original)

Click to read more ...

Monday
May132013

The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel. 

Which is why he says the key is to understand:

  • The kernel isn’t the solution. The kernel is the problem.

Which means:

  • Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

Click to read more ...

Wednesday
May012013

Myth: Eric Brewer on Why Banks are BASE Not ACID - Availability Is Revenue 

In NoSQL: Past, Present, Future Eric Brewer has a particularly fine section on explaining the often hard to understand ideas of BASE (Basically Available, Soft State, Eventually Consistent), ACID (Atomicity, Consistency, Isolation, Durability), CAP (Consistency Availability, Partition Tolerance), in terms of a pernicious long standing myth about the sanctity of consistency in banking.

Myth: Money is important, so banks must use transactions to keep money safe and consistent, right?

Reality: Banking transactions are inconsistent, particularly for ATMs. ATMs are designed to have a normal case behaviour and a partition mode behaviour. In partition mode Availability is chosen over Consistency.

Why? 1) Availability correlates with revenue and consistency generally does not. 2) Historically there was never an idea of perfect communication so everything was partitioned...

Click to read more ...