High Scalability -

Entries in Example (248)

Tuesday

Feb212012

Pixable Architecture - Crawling, Analyzing, and Ranking 20 Million Photos a Day

Tuesday, February 21, 2012 at 9:15AM

This is a guest post by Alberto Lopez Toledo, PHD, CTO of Pixable, and Julio Viera, VP of Engineering at Pixable.

Pixable aggregates photos from across your different social networks and finds the best ones so you never miss an important moment. That means currently processing the metadata of more than 20 million new photos per day: crawling, analyzing, ranking, and sorting them along with the other 5+ billion that are already stored in our database. Making sense of all that data has challenges, but two in particular rise above the rest:

How to access millions of photos per day from Facebook, Twitter, Instagram, and other services in the most efficient manner.
How to process, organize, index, and store all the meta-data related to those photos.

Sure, Pixable’s infrastructure is changing continuously, but there are some things that we have learned over the last year. As a result, we have been able to build a scalable infrastructure that takes advantage of today’s tools, languages and cloud service, all running on Amazon Web Services where we have more than 80 servers running. This document provides a brief introduction to those lessons:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Example

Monday

Feb202012

Berkeley DB Architecture - NoSQL Before NoSQL was Cool

Monday, February 20, 2012 at 8:56AM

After the filesystem and simple library packages like dbm, Berkeley DB was the original luxury embedded database widely used by applications as their core database engine. NoSQL before NoSQL was cool. The hidden secret making complex applications sing. If you want to dispense with all the network overhead of a server based system, it's still a a good choice.

There's a great writeup for the architecture behind Berkeley DB in the book The Architecture of Open Source Applications. If you want to understand more about how a database works or if you are pondering how to build your own, it's rich in detail, explanations, and lessons. Here's the Berkeley DB chapter from the book. It covers topics like: Architectural Overview; The Access Methods: Btree, Hash, Recno, Queue; The Library Interface Layer; The Buffer Manager: Mpool; Write-ahead Logging; The Lock Manager: Lock; The Log Manager: Log; The Transaction Manager: Txn.

Click to read more ...

HighScalability Team |

3 Comments |

Permalink |

Print Article

Email Article

Example,

nosql

Thursday

Feb162012

A Super Short on the Youporn Stack - 300K QPS and 100 Million Page Views Per Day

Thursday, February 16, 2012 at 9:10AM

Eric Pickup from Youporn.com posted on a news group that Youporn is now 100% Redis based and will soon be revealing more about their architecture at the ConFoo conference. Some stunning, but not surprising numbers were revealed:

100 million page views per day
A cluster of Redis slaves are handling over 300k queries per second.

Some additional nuggets:

Additional Redis nodes were added because the network cards couldn't keep up with Redis.
Impressed with Redis' performance.
All reads come from Redis; we are maintaining MySQL just to allow us to build new sorted sets as our requirement change
Most data is found in hashes with ordered sets used to know what data to show.
- A typical lookup would be an zInterStore on: videos:filters:released, Videos:filters:orientation:straight,Videos:filters:categories:{category_id}, Videos:ordering:rating
- Then perform a zRange to get the pages we want and get the list of video_ids back.
- Then start a pipeline and get all the videos from hashes.
- Do use some key/value lookups and some lists, but the majority of our operations are using the above pattern.

Not much to see yet, but hopefully we'll learn more after their talk.

HighScalability Team |

1 Comment |

Permalink |

Print Article

Email Article

Example

Thursday

Feb162012

A Short on the Pinterest Stack for Handling 3+ Million Users

Thursday, February 16, 2012 at 9:05AM

Pinterest co-founder Paul Sciarra shared a bit about their stack on Quora:

Python + heavily-modified Django at the application layer
Tornado and (very selectively) node.js as web-servers.
Memcached and membase / redis for object- and logical-caching, respectively.
RabbitMQ as a message queue.
Nginx, HAproxy and Varnish for static-delivery and load-balancing.
Persistent data storage using MySQL.
MrJob on EMR for map-reduce.
Git.

Alex Popescu has created a cool diagram of the setup and provided some thoughtful analysis as well.

HighScalability Team |

Tumblr Architecture - 15 Billion Page Views a Month and Harder to Scale than Twitter

Monday, February 13, 2012 at 9:15AM

With over 15 billion page views a month Tumblr has become an insanely popular blogging platform. Users may like Tumblr for its simplicity, its beauty, its strong focus on user experience, or its friendly and engaged community, but like it they do.

Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.

One of the common patterns across successful startups is the perilous chasm crossing from startup to wildly successful startup. Finding people, evolving infrastructures, servicing old infrastructures, while handling huge month over month increases in traffic, all with only four engineers, means you have to make difficult choices about what to work on. This was Tumblr’s situation. Now with twenty engineers there’s enough energy to work on issues and develop some very interesting solutions.

Tumblr started as a fairly typical large LAMP application. The direction they are moving in now is towards a distributed services model built around Scala, HBase, Redis, Kafka, Finagle, and an intriguing cell based architecture for powering their Dashboard. Effort is now going into fixing short term problems in their PHP application, pulling things out, and doing it right using services.

The theme at Tumblr is transition at massive scale. Transition from a LAMP stack to a somewhat bleeding edge stack. Transition from a small startup team to a fully armed and ready development team churning out new features and infrastructure. To help us understand how Tumblr is living this theme is startup veteran Blake Matheny, Distributed Systems Engineer at Tumblr. Here’s what Blake has to say about the House of Tumblr:

Click to read more ...

HighScalability Team |

23 Comments |

Permalink |

Print Article

Email Article

Example

Monday

Feb062012

The Design of 99designs - A Clean Tens of Millions Pageviews Architecture

Monday, February 6, 2012 at 9:33AM

99designs is a crowdsourced design contest marketplace based out of Melbourne Australia. The idea is that if you have a design you need created you create a contest and designers compete to give you the best design within your budget.

If you are a medium sized commerce site this is a clean example architecture of a site that reliably supports a lot of users and a complex workflow on the cloud. Lars Yencken wrote a nicely written overview of the architecture behind 99designs in Infrastructure at 99designs. Here's a gloss on their architecture:

Stats

Click to read more ...

General Chicken | Comments Off |

Permalink |

Print Article

Email Article

Example

Thursday

Feb022012

The Data-Scope Project - 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops

Thursday, February 2, 2012 at 9:10AM

“Data is everywhere, never be at a single location. Not scalable, not maintainable.” –Alex Szalay

While Galileo played life and death doctrinal games over the mysteries revealed by the telescope, another revolution went unnoticed, the microscope gave up mystery after mystery and nobody yet understood how subversive would be what it revealed. For the first time these new tools of perceptual augmentation allowed humans to peek behind the veil of appearance. A new new eye driving human invention and discovery for hundreds of years.

Data is another material that hides, revealing itself only when we look at different scales and investigate its underlying patterns. If the universe is truly made of information, then we are looking into truly primal stuff. A new eye is needed for Data and an ambitious project called Data-scope aims to be the lens.

A detailed paper on the Data-Scope tells more about what it is:

The Data-Scope is a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB. There is a vacuum today in data-intensive scientific computations, similar to the one that lead to the development of the BeoWulf cluster: an inexpensive yet efficient template for data intensive computing in academic environments based on commodity components. The proposed Data-Scope aims to fill this gap.

A very accessible interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with "Dr. Data". Roberto Zicari also has a good interview with Dr. Szalay in Objects in Space vs. Friends in Facebook.

The paper is filled with lots of very specific recommendations on their hardware choices and architecture, so please read the paper for the deeper details. Many BigData operations have the same IO/scale/storage/processing issues Data-Scope is solving, so it’s well worth a look. Here are some of the highlights:

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Paper

Monday

Jan092012

The Etsy Saga: From Silos to Happy to Billions of Pageviews a Month

Monday, January 9, 2012 at 9:10AM

Seldom do we get to hear stories of the bumps and bruses earned by a popular website during its formative years. Ross Snyder, a Sr. Software Engineer at Etsy, changes that with an engaging talk he gave at Surge 2011: Scaling Etsy: What Went Wrong, What Went Right.

Ross gives a detailed and honest account of how Etsy went from a raw startup in 2005, to a startup struggling with their success in 2007, to the mean, handmade, super scaling, ops driven machine they’ve become in 2011.

There’s lots to learn from this illuminating story of transformation:

Origin Story

Click to read more ...

HighScalability Team |

5 Comments |

Permalink |

Print Article

Email Article

Example

Tuesday

Dec272011

PlentyOfFish Update - 6 Billion Pageviews and 32 Billion Images a Month

Tuesday, December 27, 2011 at 11:26AM

Markus has a short update on their PlentyOfFish Architecture. Impressive November statistics:

6 billion pageviews served
32 billion images served
6 million logins in one day
IM servers handle about 30 billion pageviews
11 webservers (5 of which could be dropped)
Hired first DBA in July. They currently have a handful of employees.
All hosting/cdn costs combined are under $70k/month.

Lesson: small organization, simple architecture, on raw hardware is still plenty profitable for PlentyOfFish.

Click to read more ...

HighScalability Team |

4 Comments |

Permalink |

Print Article

Email Article

Example

Tuesday

Dec062011

Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies

Tuesday, December 6, 2011 at 9:15AM

Instagram is a free photo sharing and social networking service for your iPhone that has been an instant success. Growing to 14 million users in just over a year, they reached 150 million photos in August while amassing several terabytes of photos, and they did this with just 3 Instaneers, all on the Amazon stack.

The Instagram team has written up what can be considered the canonical description of an early stage startup in this era: What Powers Instagram: Hundreds of Instances, Dozens of Technologies.

Instagram uses a pastiche of different technologies and strategies. The team is small yet has experience rapid growth riding the crest of a rising social and mobile wave, it uses a hybrid of SQL and NoSQL, it uses a ton of open source projects, they chose the cloud over colo, Amazon services are highly leveraged rather than building their own, reliability is through availability zones, async work scheduling links components together, the system is composed as much as possible of services exposing an API and external services they don't have to build, data is stored in-memory and in the cloud, most code is in a dynamic language, custom bits have been coded to link everything together, and they have gone fast and kept small. A very modern construction.

We'll just tl;dr the article here, it's very well written and to the point. Definitely worth reading. Here are the essentials:

Click to read more ...

HighScalability Team |

8 Comments |

Permalink |

Print Article

Email Article

Example