Entries by HighScalability Team (1576)

Thursday
Apr052012

Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example:

4f67bfc603106cb2

These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day's worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don't have a single machine with several hundred gigs of free memory sitting around so we needed a better solution...

Click to read more ...

Monday
Apr022012

YouPorn - Targeting 200 Million Views a Day and Beyond

Erick Pickup, lead developer at YouPorn.com, presented their architecture in a talk titled Building a Website To Scale given at the ConFoo conference.  As you might expect, YouPorn is a beast, streaming three full DVDs of video every second, handing 300K queries every second, and generating up to 15GBs of log data per hour.

Unfortunately, all we have are the slides of the talk, so this article isn’t as technical as I might like, there’s no visibility at all on the video handling for example, but we do get some interesting details.

The most interesting takeway is that YouPorn is a pretty conventional LAMP stack, with a NoSQL twist as Redis now replaces MySQL in the live datapath. Reminds me a little of YouTube in its simplicity.

The second most interesting takeaway was the great switchover. Common wisdom says never rewrite, but in 2011 YouPorn rewrote their entire site to use PHP + Redis instead of a complex Perl + MySQL based architecture. And by all accounts the switchover went well. The site is 10% faster and they moved over 6 years of legacy data with no down time.

Read on to learn more about the YouPorn architecture...

Click to read more ...

Friday
Mar302012

Stuff The Internet Says On Scalability For March 30, 2012

Choosy Mothers Choose HighScalability:

  • Quotable quotes:
    • @itarradellas: "Revolutions in science have often been preceded by revolutions in measurement" 
    •  @jasongorman: Use dependency injection, not Spring. Use event-driven, asynchronous I/O, not Node.js. Use MVC, not http://ASP.NET MVC etc etc
    • @bernardgolden: #netflix uses most aggressive #aws reservation system. Gets pricing down to ~ 33% of "list' pricing.
    • @ikarzali: Hey, for all facebook's talk at scalability conferences, I have to say Timeline is super slow(!) Howz that memcache workin out for you now?
  • How OMGPOP scaled to 36 million users in three weeks. Draw Something has been downloaded 35+ million times; 1 billion pictures created at 3,000 pictures per second; Couchbase is used as the database; SoftLayer is their cloud providing tens of nodes for tens of thousands of operations per second; no downtime; $200 million acquisition after 7 weeks.
  • Chris Dixon got it 2/3rds right with Give away the diagnostic, sell the remedy. The most profitable part was missing: create the problem for which you give away a diagnostic that detects the problem for which you sell the remedy.
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Thursday
Mar292012

Strategy: Exploit Processor Affinity for High and Predictable Performance

Martin Thompson wrote a really interesting article on the beneficial performance impact of taking advantage of Processor Affinity:

The interesting thing I've observed is that the unpinned test will follow a step function of unpredictable performance.  Across many runs I've seen different patterns but all similar in this step function nature.  For the pinned tests I get consistent throughput with no step pattern and always the greatest throughput.

The idea is by assigning a thread to a particular CPU that when a thread is rescheduled to run on the same CPU, it can take advantage of the "accumulated  state in the processor, including instructions and data in the cache."  With multi-core chips the norm now, you may want to decide for yourself how to assign work to cores and not let the OS do it for you. The results are surprisingly strong.

 

Tuesday
Mar272012

Sponsored Post: Gigaspaces, Nokia, Oracle, Percona Live, AiCache, ElasticHosts, Logic Monitor, Attribution Modeling, New Relic, AppDynamics, CloudSigma, ManageEngine, Site24x

Who's Hiring?

  • Nokia's Cloud Computing, Operations and Development Group is hiring! Check out: http://devops.nokia.com. CCOD designs, builds, manages and scales Nokia’s cloud computing. 
  • ConnecTV is a start up looking for a DevOps & System Administration Leader to help build a revolutionary new social network to enrich the experience of watching TV. Apply here.

Fun and Informative Events

  • The Percona Live MySQL Conference & Expo features 60+ speakers, 72 breakout sessions, and keynotes from HP, Facebook, Box, Eucalyptus Systems, and more. April 10-12 in Santa Clara
  • Sign up for this free 30-minute webinar exploring how new technology can determine which ads have been seen by users and will discuss the C3 Metrics Labs analysis of over 2 billion impressions. 

Cool Products and Services

  • Take your application to the next level of performance & scalability with the GigaSpaces In-Memory Data Grid (IMDG)
  • Join the MySQL experts from Oracle to learn Oracle's strategy for MySQL as well as the latest product development and features. So learn more and register now!
  • aiCache creates a better user experience by increasing the speed scale and stability of your web-site. Test aicache acceleration for free.  No sign-up required. http://deploy.aicache.com
  • ElasticHosts award winning cloud server hosting launches across North America. Adding data centers in Los Angeles and Toronto. Free trial.
  • LogicMonitor - Hosted monitoring of your entire technology stack. Dashboards, trending graphs, alerting. Try it free and be up and running in just 15 minutes.
  • New Relic - real user monitoring optimize for humans, not bots. Live application stats, SQL/NoSQL performance, web transactions, proactive notifications. Take 2 minutes to sign up for a free trial.
  • AppDynamics is the very first free product designed for troubleshooting Java performance while getting full visibility in production environments. Visit http://www.appdynamics.com/free.
  • CloudSigma. Utility style high performance cloud servers in the US and Europe delivered on all 10GigE networking. Run any OS, take advantage of SSD storage and tailored infrastructure options.
  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.
  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

For a longer description of each sponsor, please read more below...

Click to read more ...

Monday
Mar262012

7 Years of YouTube Scalability Lessons in 30 Minutes

If you started out building a dating site and instead ended up  building a video sharing site (YouTube) that handles 4 billion views a day, then it’s just possible you learned something along the way. And indeed, Mike Solomon, one of the original engineers at YouTube, did learn a lot and he has given a talk about it at PyCon: Scalability at YouTube.

This isn’t an architecture driven talk where we are led through a description of how a lot of boxes connect to each other. Mike could give that sort of talk. He has worked on building YouTube’s servlet infrastructure, video indexing feature, video transcoding system, their full text search, a CDN, and much more. But instead, he’s taken a step back, took a long look around at what time has wrought, and shared some deep lessons, obviously hard won from experience.

The key takeaway away of the talk for me was doing a lot with really simple tools. While many teams are moving on to more complex ecosystems, YouTube really does keep it simple. They program primarily in Python, use MySQL as their database, they’ve stuck with Apache, and even new features for such a massive site start as a very simple Python program.

That doesn’t mean YouTube doesn’t do cool stuff, they do, but what makes everything work together is more a philosophy or a way of doing things than technological hocus pocus. What made YouTube into one of the world’s largest websites? Read on and see...

Click to read more ...

Friday
Mar232012

Stuff The Internet Says On Scalability For March 23, 2012

Plop, Plop, Fizz, Fizz, Oh, What a HighScalability it is:

  • $1.5 billion: The cost of cutting London-Toyko latency by 60ms; 9 days: It took AOL 9 years to hit 1 million users. Facebook 9 months. Draw Something 9 days;  ~362 sq ft solar array: powers 1 sq ft of data center.
  • Is Amazon is trying to margenalize OpenStack by partnering with Eucalyptus?
  • As the DevOps turns. You won't see this on TMZ. Adrian Cockcroft:  There is no central control, the teams do it for themselves in the cloud; John Allspaw: s/NoOps/OpsDoneMaturelyButStillOps/g; Edward Capriolo: Trust developers not. Good thing is we all agree DevOps is necessary, the differences are in the how and whom.
  • In a word (or two), Wordnik has gone cloud. Gone is their big iron, in are envious EC2 instances. Driving the move was HA in multi-datacenters, elasticity for traffic bursts, and incremental cluster upgrades. There has been almost no reduction in performance.
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Thursday
Mar222012

Paper: Revisiting Network I/O APIs: The netmap Framework

Here's a really good article in the Communications of the ACM on reducing network packet processing overhead by redesigning the network stack: Revisiting Network I/O APIs: The Netmap Framework by Luigi Rizzo. As commodity networking performance increases operating systems need to keep up or all those CPUs will go to waste. How do they make this happen?

 

Abstract:

Today 10-gigabit interfaces are used more and more in datacenters and servers. On these links, packets flow as fast as one every 67.2 nanoseconds, yet modern operating systems can take 10-20 times longer just to move one packet between the wire and the application. We can do much better, not with more powerful hardware but by revising architectural decisions made long ago regarding the design of device drivers and network stacks.

The netmap framework is a promising step in this direction. Thanks to a careful design and the engineering of a new packet I/O API, netmap eliminates much unnecessary overhead and moves traffic up to 40 times faster than existing operating systems. Most importantly, netmap is largely compatible with existing applications, so it can be incrementally deployed.

Click to read more ...

Monday
Mar192012

LinkedIn: Creating a Low Latency Change Data Capture System with Databus

This is a guest post by Siddharth Anand, a senior member of LinkedIn's Distributed Data Systems team.

Over the past 3 years, I've had the good fortune to work with many emerging NoSQL products in the context of supporting the needs of a high-traffic, customer facing web site.

In 2010, I helped Netflix to successfully transition its web scale use-cases from Oracle to SimpleDB, AWS' hosted database service. On completion of that migration, we started a second migration, this time from SimpleDB to Cassandra. The first transition was key to our move from our own data center to AWS' cloud. The second was key to our expansion from one AWS Region to multiple geographically-distributed Regions -- today Netflix serves traffic out of two AWS Regions, one in Virginia, the other in Ireland (F1). Both of these transitions have been successful, but have involved integration pain points such as the creation of database replication technology.

In December 2011, I moved to LinkedIn's Distributed Data Systems (DDS) team. DDS develops data infrastructure, including but not limited to, NoSQL databases and data replication systems. LinkedIn, no stranger to building and open-sourcing innovative projects, is doubling down on NoSQL to accelerate its business -- DDS is developing a new NoSQL database called Espresso (R1), a topic for a future post.

Having observed two high-traffic web companies solve similar problems, I cannot help but notice a set of wheel-reinventions. Some of these problems are difficult and it is truly unfortunate for each company to solve its problems separately. At the same time, each company has had to solve these problems due to an absence of a reliable open-source alternative. This clearly has implications for an industry dominated by fast-moving start-ups that cannot build 50-person infrastructure development teams or dedicate months away from building features.

Change Data Capture Systems

Click to read more ...

Friday
Mar162012

Stuff The Internet Says On Scalability For March 16, 2012

HighScalability is What We Do:

  • 454,400: Number of Amazon servers; 45PB: Facebook Data Warehouse, grows exponentially; 5 Atoms: Ultimate limit of thermodynamics; YouTube: 4 billion views/day, 60 hours of video uploaded every minute, revenue doubled in 2010 
  • Quotable quotes:
    • @adrianco: Walmart labs run large single region Cassandra clusters with Intel SSDs and have been in production for two years. Working well for them.
    • @mybellemac: Scalability is a mother. #pinterest
    • @fakesigi: Thanks for the correction. I saw cloud computing, scalability and my brain turned off.
    • @BVA100: I disagree with "If it ain't broke, don't fix it". We ought to be forward thinkers, concerned with leading indicators and scalability.
  • Dilbert on the meaning of it all.
  • Cassandra and Solid State Drives. DataStax's Rick Branson with a sweet explanation of how Cassandra was built for a world of spinning disks, which means it only writes sequentially, which turns out to be a good way to use SSDs too.
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...