Entries by HighScalability Team (1576)

Monday
Dec052011

Stuff The Internet Says On Scalability For December 5, 2011

It's HighScalability Time!

  • Quotable quotes:
    • @jaykreps : Was wondering, How can I turn my boring, cachable, read-only traffic into random writes on mongodb? And lo! link
    • @marshallk : Google runs 100-200 experiments every day on UI, algorithm & product
    • @styggiti : The problem with companies like IBM and Oracle baking NoSQL "scalability" into their products isn't the tech, it's the $$ licensing.
  • Blazing fast node.js: 10 performance tips from LinkedIn Mobile. You may have thought that node.js made just everything magically fast, but Shravya Garlapati has some great strategies for going even faster: Avoid synchronous code; Turn off socket pooling;  Don't use Node.js for static assets; Render on the client-side; Use gzip; Go parallel; Go session-free; Use binary modules; Use standard V8 JavaScript instead of client-side libraries; Keep your code small and light.
  • Nice thread in NoSQL Databases on HBase and Consistency in CAP. The short summary of the article is that CAP isn't "C, A, or P, choose two," but rather "When P happens, choose A or C."
To read more of what the Internet has to say on scalability, please read more below...

Click to read more ...

Friday
Dec022011

Stuff The Internet Says On Scalability For December 2, 2011

Sorry, this edition of Stuff the Internet on Scalability has been called on the account of two straight days of whipping, whirling, wind that has left me powerless to complete the post. Service will resume when Mother Nature is nicer. Pardon me while I try and find our roof...

Tuesday
Nov292011

DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second

I remember the excitement of when Twitter first opened up their firehose. As an early adopter of the Twitter API I could easily imagine some of the cool things you could do with all that data. I also remember the disappointment of learning that in the land of BigData, data has a price, and that price would be too high for little fish like me. It was like learning for the first time there would be no BigData Santa Clause.

For a while though I had the pleasure of pondering just how I would handle all that data. It's a fascinating problem. You have to be able to reliably consume it, normalize it, merge it with other data, apply functions on it, store it, query it, distribute it, and oh yah, monetize it. Most of that in realish-time. And if you are trying to create a platform for allowing the entire Internet do to the same thing to the firehose, the challenge is exponentially harder.

DataSift is in the exciting position of creating just such a firehose eating, data chomping machine. You see, DataSift has bought multi-year re-syndication rights from Twitter, which grants them access to the full Twitter firehose with the ability resell subsets of it to other parties, which could be anyone, but the primary target is of course businesses. Gnip is the only other company to have these rights.

DataSift was created out of Nick Halstead's, Founder and CTO of DataSift, experience with TweetMeme, a popular real-time Twitter news aggregator, which at one time handled 1.1 billion page views per day. TweetMeme is famous for inventing the social signaling mechanism, better known as the retweet, with their retweet button, an idea that came out of an even earlier startup called fav.or.it (favorite). Imagine if you will a time before like buttons were plastered all over the virtual place.

So processing the TweetMeme at scale is nothing new for the folks at DataSift, what has been the challenge is turning that experience into an Internet-scale platform so that everyone else can do the same thing. That has been a multi-year odyssey. 

DataSift is position themselves as a realtime datamining platform. The platform angle here is really the key take home message. They are pursuing a true platform strategy for processing real-time streams. TweetMeme while successful, could not be a billion dollar company, but a BigData platform could grow that large, so that’s the direction they are headed. A money quote by Nick highlights the logic in neon: "There's no money in buttons, there's money in data."

Part of the strategy behind a platform play is to become the incumbent player by building a giant technological moat around your core value proposition. When others come a knockin they can't cross over your moat because of your towering technological barrier to entry. That's what DataSift is trying to do. The drawbridge on the moat is favored access to Twitter's firehose, but the real power is in the Google quality real-time data processing platform infrastructure that they are trying to create. 

DataSift's real innovation is in creating an Internet scale filtering system that can quickly evaluate very large filters (think Lady Gaga follower size) combined with the virtuous economics of virtualization, where the more customers you have the more money you make because they are sharing resources.

How are they making all this magic happen? Let's see...

Click to read more ...

Tuesday
Nov292011

Sponsored Post: Cedexis, Callfire, Attribution Modeling, Logic Monitor, New Relic, ScaleOut, Percona Live MySQL, AppDynamics, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

  • Callfire, one of the largest cloud telephony platforms on the web, is hiring a Sr. Software Engineer. You can learn more here.

Fun and Informative Events

  • Sign up for this free 30-minute webinar exploring how new technology can determine which ads have been seen by users and will discuss the C3 Metrics Labs analysis of over 2 billion impressions.
  • Come one come all! Introducing Percona Live: MySQL Conference And Expo 2012. Join us for this three day intensive MySQL conference April 10th-12th, 2012. 

Cool Products and Services

  • Not satisfied with performance in the cloud? Visit Cedexis and Lose the Wait. Looking for a path to the Hybrid Cloud?  Cedexis can help you find the right path.
  • LogicMonitor - Hosted monitoring of your entire technology stack. Dashboards, trending graphs, alerting. Be up and running in 15 minutes.
  • New Relic - real user monitoring optimize for humans, not bots. Live application stats, SQL/NoSQL performance, web transactions, proactive notifications. Take 2 minutes to sign up for a free trial.
  • ScaleOut StateServer® Delivers Map/Reduce Analysis and Scalable Application Performance. Gain competitive advantage with rapid access to business intelligence. Download a free evaluation trial today.
  • AppDynamics is the very first free product designed for troubleshooting Java performance while getting full visibility in production environments. Visit http://www.appdynamics.com/free.
  • CloudSigma. Instantly scalable European cloud servers.
  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.
  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

For a longer description of each sponsor, please read more below...

Click to read more ...

Friday
Nov252011

Stuff The Internet Says On Scalability For November 25, 2011

A HighScalability a day keeps the fail whale away.:

To read more of what the Internet has to Say on Scalability, please read more below...

Click to read more ...

Wednesday
Nov232011

Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Teams from Princeton and CMU are working together to solve one of the most difficult problems in the repertoire: scalable geo-distributed data stores. Major companies like Google and Facebook have been working on multiple datacenter database functionality for some time, but there's still a general lack of available systems that work for complex data scenarios.

The ideas in this paper--Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS--are different. It's not another eventually consistent system, or a traditional transaction oriented system, or a replication based system, or a system that punts on the issue. It's something new, a causally consistent system that achieves ALPS system properties. Move over CAP, NoSQL, etc, we have another acronym: ALPS - Available (operations always complete successfully), Low-latency (operations complete quickly (single digit milliseconds)), Partition-tolerant (operates with a partition), and Scalable (just add more servers to add more capacity). ALPS is the recipe for an always-on data store: operations always complete, they are always successful, and they are always fast.

ALPS sounds great, but we want more, we want consistency guarantees as well. Fast and wrong is no way to go through life. Most current systems achieve low latency by avoiding synchronous operation across the WAN, directing reads and writes to a local datacenter, and then using eventual consistency to maintain order. Causal consistency promises another way.

Intrigued? Let's learn more about causal consistency and how it might help us build bigger and better distributed systems.

Click to read more ...

Friday
Nov182011

Stuff The Internet Says On Scalability For November 18, 2011

Every kiss begins with HighScalability:

  • Amazon and the secret to life: 42; 10,240 cores
  • Many quatloos worth of quotable quotes:
    • @alesroubicek :  State kills scalability
    • @cincura_net : Wrong. *Shared* state kills scalability.
    • @kpshea : When I think "cloud" computing, I imagine the gaseous Star Trek blob that ate red blood cells (your sensitive data).
    • @kotobuki : I'm interested in scalability of personal fabrication. How to 'scale' in batch production stages will be a key, but still there are barriers.
    • @simonraikallen : The two rules of scalability testing: (1) The bottleneck is always the database (2) You can never predict what the bottleneck will be.
    • @marksbirch : Photo: newyorker: The way we are producing data, we may need a place even bigger than heaven to hold it all…
  • Why Stack Exchange Isn’t in the Cloud. It's about love, the love of computers, and what you love you don't let other people own. Also How StackOverflow Scales with SQL Server
  • NoSQL No More: Let’s double down with MoreSQL. Alex Tatiyants with an impassioned plea for programmers to throw down these new fangled databases and return to a comfortable and much loved past. To bring this world about Alex wants SQL Everywhere, to spread FUD about NoSQL, and to recognize with enough effort SQL can work for every problem. This brave old vision should bring comfort everywhere to people wearing very small shoes.
The Internet has so much more to say on scalability, don't be left out, read more by clicking below...

Click to read more ...

Wednesday
Nov162011

Google+ Infrastructure Update - the JavaScript Story

In Google+ Is Built Using Tools You Can Use Too: Closure, Java Servlets, JavaScript, BigTable, Colossus, Quick Turnaround we glimpsed inside Google's technology stack for building Google+. Mark Knichel, an engineer on the Google+ infrastructure team, has helped us look a little deeper on how Javascript is handled in Google+.  Here's a quick look:

  1. They love Closure for its library, templates, compiler, and strict type checking. Compilation is now required for good performance. I've wondered if GWT will be killed off as have other Google properties, but I've been told GWT is being used heavily inside Google, so thankfully that probably won't happen.
  2. Closure templates are used both Java and JavaScript to render pages server-side and in the browser. 
  3. Just-in-time JavaScript. Code is split into modules so the minimum amount of JavaScript is loaded asynchronously in the background as necessary. Navigation happens without loading the page.
  4. Page navigation happens without page reloads. 
  5. HTML Flush. Asynchronous data from the server is rendered in the browser immediately so the whole page doesn't need to be loaded. 
  6. Iframes are used to load JavaScript in parallel.

Why no Google+ API? There's conjecture that making a fast responsive UI means the API can't come first because it may not take the shape required to make the UI fast. So the approach is: make the UI first, make it fast, and then wrap an API around whatever evolved. A controversial methodology, but given the imperative for making a responsive UI, it makes sense. Is that the right goal? However good this approach is for creating fast UIs, it's death for feature development and fast responsive innovation. With an API you can develop in parallel and release stuff faster and iterate faster. Which is more important: innovative features that differentiate Google+ from competitors or a fast UI?

More juicy details on this Google+ post.

Monday
Nov142011

Using Gossip Protocols for Failure Detection, Monitoring, Messaging and Other Good Things

When building a system on top of a set of wildly uncooperative and unruly computers you have knowledge problems: knowing when other nodes are dead; knowing when nodes become alive; getting information about other nodes so you can make local decisions, like knowing which node should handle a request based on a scheme for assigning nodes to a certain range of users; learning about new configuration data; agreeing on data values; and so on.

How do you solve these problems? 

A common centralized approach is to use a database and all nodes query it for information. Obvious availability and performance issues for large distributed clusters. Another approach is to use Paxos, a protocol for solving consensus in a network to maintain strict consistency requirements for small groups of unreliable processes. Not practical when larger number of nodes are involved.

So what's the super cool decentralized way to bring order to large clusters?

Click to read more ...

Friday
Nov112011

Stuff The Internet Says On Scalability For November 11, 2011

You got performance in my scalability! You got scalability in my performance! Two great tastes that taste great together:

  • Quotable quotes:
    • @jasoncbooth : Tired of the term #nosql. I would like to coin NRDS (pronounced "nerds"), standing for Non Relational Data Store. 
    • @zenfeed : One lesson I learn about scalability, is that it has a LOT to do with simplicity and consistency.
    • Ray Walters : Quad-core chips in mobile phones is nothing but a marketing snow job
  • Flickr:  Real-time Updates on the Cheap for Fun and Profit. How Flickr added real-time push feed on the cheap. Events happen all over Flickr, uploads and updates (around 100/s depending on the time of day), all of them inserting tasks. Implemented with Cache, Tasks, & Queues: PubSubHubbub; Async task system Gearman; use async EVERYWHERE; use Redis Lists for queues; cron to consume events off the queue; 
To read even more Stuff the Internet has on Scalability, please click below...

Click to read more ...