High Scalability -

Entries in Strategy (358)

Wednesday

Feb252015

Deep Learning without Deep Pockets

Wednesday, February 25, 2015 at 8:56AM

Now that you’ve transformed your system through successive evolutions of architecture goodness...you've made it cloud native, you now treat a fist full of datacenters as a single computer, you’ve microservicized it, you’ve containerized it, you’re continuously releasing and improving it, you’ve made it reactive, you’ve socialized it, you’ve mobilized it, you’ve Hadoop’ed it, you’ve made it DevOps friendly, and you have real-time dashboards that would make NORAD jealous...what’s next?

Deep learning is what’s next. Making machines that learn. The problem is how?

All the other transformations have been changes good programmers can learn to do. Deep learning is still deep magic. We are waiting for the Hadoop of deep learning to be built.

Until then, if you aren’t Google with Google sized clusters and cloisters of PhDs, what can you do? Greg Corrado, Senior Research Scientist at Google, gave a great presentation at the RE.WORK Deep Learning Summit 2015 (videos) that has some useful suggestions:

Click to read more ...

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Feb112015

Rescuing an Outsourced Project from Collapse: 8 Problems Found and 8 Lessons Learned

Wednesday, February 11, 2015 at 8:56AM

If you are one of those people that think most of the products featured on HighScalability use way too many servers then you'll love this story: 130 VMs serving less than 10,000 users daily were chopped down to just one machine.

Here's the setup. A smallish website was having problems. Users were unhappy. In the balance was not only the product, but the company. The site was built using Angular, Symfony2, Postgres, Redis, Centos, 8 HP blades with 128 G RAM each, two racks, a very large HP 3par storage array, a 1Gbps uplink, and VMWare.

More than enough power for the task at hand. Yet the system couldn't handle the load. What would you do?

That's the story Jacques Mattheij tells in his very entertaining and educational Saving a Project and a Company article.

Jacques says much was right about the website, but time pressure and mismanagement created big problems at the system level. "A single clueless person in a position of trust with non technical management, an outsourced project and a huge budget, what could possibly go wrong?" Sound familiar?

Problem 1: Virtualization Gone Crazy

Click to read more ...

HighScalability Team |

1 Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Feb042015

Matt Cutts: 10 Lessons Learned from the Early Days of Google

Wednesday, February 4, 2015 at 9:35AM

I mainly know of Matt Cutts, long time Google employee (since 2000) and currently head of Google's Webspam team, from his appearances on TwiT with Leo Laporte. On TwiT Matt always comes off as smart, thoughtful, and a really nice guy. This you might expect.

What I didn’t expect is in this talk he gave, Lessons learned from the early days of Google, is that Matt also turns out to be quite funny and a good story teller. The stories he’s telling are about Matt’s early days at Google. He puts a very human face on Google. When you think everything Google does is a calculation made by some behind the scenes AI, Matt reminds us that it’s humans making these decisions and they generally just do the best they can.

The primary theme of the talk is innovation and problem solving through creativity. When you are caught between a rock and a hard place you need to get creative. Question your assumptions. Maybe there’s a creative way to solve your problem?

The talk is short and well worth watching. There are lots of those fun little details that only someone with experience and perspective can give. And there’s lots of wisdom here too. Here’s my gloss on Matt’s talk:

1. Sometimes creativity makes a big difference.

Click to read more ...

HighScalability Team |

1 Comment |

Permalink |

Print Article

Email Article

Strategy

Monday

Feb022015

Marco Arment Uses Go Instead of PHP and Saves Money by Cutting the Number of Servers in Half

Monday, February 2, 2015 at 8:56AM

On the excellent Accidental Tech Podcast there's a running conversation about Marco Arment's (Tumblr, Instapaper) switch to Go, from a much loved PHP, to implement feed crawling for Overcast, his popular podcasting app for the iPhone.

In Episode 101 (at about 1:10) Marco said he halved the number of servers used for crawling feeds by switching to Go. The total savings was a few hundred dollars a month in server costs.

Why? Feed crawling requires lots of parallel networking requests and PHP is bad at that sort of thing, while Go is good at it.

Amazingly, Marco wrote an article on how much Overcast earned in 2014. It earned $164,000 after Apple's 30%, but before other expenses. At this revenue level the savings, while not huge in absolute terms given the traffic of some other products Marco has worked on, was a good return on programming effort.

How much effort? It took about two months to rewrite and debug the feed crawlers. In addition, lots of supporting infrastructure that tied into the crawling system had to be created, like the logging infrastructure, the infrastructure that says when a feed was last crawled, monitoring delays, knowing if there's queue congestion, and forcing a feed to be crawled immediately.

So while the development costs were high up front, as Overcast grows the savings will also grow over time as efficient code on fast servers can absorb more load without spinning up more servers.

Lots of good lessons here, especially for the lone developer:

Click to read more ...

HighScalability Team |

4 Comments |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Jan282015

Instagram Strategy to Radically Reduce Traffic: Kill all the spambots!

Wednesday, January 28, 2015 at 9:50AM

RIP to my fallen robot followers on Instagram, if there's a heaven for robot instagram users, you guys are in there

— alldaychubbyboy (@Allday)

How do you scale to handle increased user traffic? Have less traffic. No, this is not a koan. The best way to deal with traffic is not to have it.

In a two day span Instagram disappeared 18.9 million users or more than 29 percent of their "followers." Justin Bieber lost 3.5 million followers (15 percent), Kim Kardashian lost 1.3 million followers (5.5 percent), Rihanna lost 1.2 million followers.

Instagram explains this dramatic reckoning was achieved by "removing deactivated spam accounts and accounts that violated its community guidelines."

In an age when high user counts and tantalizing engagement metrics are more valuable than bitcoins, this can't have been an easy decision, but it was made after being bought by Facebook.

Why? Gabe Madway, an Instagram spokesman, tells us why: We totally get that it’s uncomfortable for people. The overall goal is we want it to be perceived that the people following you are real.

Uncomfortable is an understatement. A BuzzFeed article nicely captured some of the anger, here's just one example (could be NSFW):

Click to read more ...

HighScalability Team |

1 Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Jan212015

Learn from my pain - 5 Lessons from Ello's Adventures in Rapid Scaling

Wednesday, January 21, 2015 at 8:56AM

Within one week Ello went from thousands of sessions a day to a few million sessions a day. Mike Pack wrote a great article sharing what they’ve learned: 5 Early Lessons from Rapid, High Availability Scaling with Rails.

Some of their scaling challenges: quantity of data, team size, DNS, bot prevention, responding to users, inappropriate content, and other forms of caching. What did they learn?

Move the graph. User relationships were implemented on a standard Rails stack using Heroku and Postgres. The relationships table became the bottleneck. Solution: denormalize the social graph and move hot data into Redis. Redis is used for speed and Postgres is used for durability. Lesson: know the core pillar that supports your core offering and make it work.
Create indexes early, or you're screwed. There's a camp that says only create indexes when they are needed. They are wrong. The lack of btree indexes kills query performance. Forget a unique index and your data becomes corrupted. Once the damage is done it's hard to add unique indexes later. The data has to be cleaned up and indexes take a long time to build when there's a lot of data.
Sharding is cool, but not that cool. Shard all the things only after you've tried vertically scaling as much as possible. Sharding caused a lot of pain. Creating a covering index from the start and adding more RAM so data could be served from memory, not from disk, would have saved a lot of time and stress as the system scaled.
Don't create bottlenecks, or do. Every new user automatically followed a system user that was used for announcements, etc. Scaling problems that would have been months down the road hit quickly as any write to the system user caused a write amplification of millions of records. The lesson here is not what you may think. While scaling to meet the challenge of the system user was a pain, it made them stay ahead of the scaling challenge. Lesson: self-inflict problems early and often.
It always takes 10 times longer. All the solutions mentioned take much longer to implement than you might think. Early estimates of a couple days soon give way to the reality of much longer time hits. Simply moving large amounts of data can take days. Adding indexes to large amounts of data takes time. And with large amounts of data problems tend to happen as you get to the larger data sizes which means you need to apply a fix and start over.

This full article is excellent and is filled with much more detail that makes it well worth reading.

General Chicken |

Post a Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Jan072015

The Ultimate Guide: 5 Methods for Debugging Production Servers at Scale

Wednesday, January 7, 2015 at 8:56AM

This a guest post by Alex Zhitnitsky, an engineer working at Takipi, who is on a mission to help Java and Scala developers solve bugs in production and rid the world of buggy software.

How to approach the production debugging conundrum?

All sorts of wild things happen when your code leaves the safe and warm development environment. Unlike the comfort of the debugger in your favorite IDE, when errors happen on a live server - you better come prepared. No more breakpoints, step over, or step into, and you can forget about adding that quick line of code to help you understand what just happened. In production, bad things happen first and then you have to figure out what exactly went wrong. To be able to debug in this kind of environment we first need to switch our debugging mindset to plan ahead. If you’re not prepared with good practices in advance, roaming around aimlessly through the logs wouldn’t be too effective.

And that’s not all. With high scalability architectures, enter high scalability errors. In many cases we find transactions that originate on one machine or microservice and break something on another. Together with Continuous Delivery practices and constant code changes, errors find their way to production with an increasing rate. The biggest problem we’re facing here is capturing the exact state which led to the error, what were the variable values, which thread are we in, and what was this piece of code even trying to do?

Let’s take a look at 5 methods that can help us answer just that. Distributed logging, advanced jstack techniques, BTrace and other custom JVM agents:

1. Distributed Logging

Click to read more ...

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Dec312014

Linus: The whole "parallel computing is the future" is a bunch of crock.

Wednesday, December 31, 2014 at 9:09AM

Linus Torvalds in his usual politically correct way made a typically understated statement about “pushing the whole parallelism snake-oil” that generated almost no response whatsoever.

Well, not quite. His comment on Avoiding ping pong has generated hundreds of responses, both on the original post and on Reddit.

The contention:

The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage. Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics).

Nobody is ever going to go backwards from where we are today. Those complex OoO [Out-of-order execution] cores aren't going away. Scaling isn't going to continue forever, and people want mobility, so the crazies talking about scaling to hundreds of cores are just that - crazy. Why give them an ounce of credibility?

Where the hell do you envision that those magical parallel algorithms would be used?

The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

So give up on parallelism already. It's not going to happen. End users are fine with roughly on the order of four cores, and you can't fit any more anyway without using too much energy to be practical in that space. And nobody sane would make the cores smaller and weaker in order to fit more of them - the only reason to make them smaller and weaker is because you want to go even further down in power use, so you'd still not have lots of those weak cores.

Give it up. The whole "parallel computing is the future" is a bunch of crock.

An interesting question to ponder on the cusp of a new year. What will programs look like in the future? Very different than they look today? Or pretty much the same?

From the variety of replies to Linus it's obvious we are in no danger of arriving at consensus. There was the usual discussion of the differences between distributed, parallel, concurrent, and multithreading, with each succeeding explanation more confusing than the next. The general gist being that how you describe a problem in code is not how it has to run. Which is why I was not surprised to see a mini-language war erupt.

The idea is parallelization is a problem only because of the old fashioned languages that are used. Use a better language and parallelization of the design can be separated from the runtime and it will all just magically work. There are echoes here of how datacenter architectures are now utilizing schedulers like Mesos to treat entire datacenters as a programmable fabric.

One of the more interesting issues raised in the comments was a confusion over what exactly is a server? Can a desktop machine that needs to run fast parallel builds be considered a server? An unsatisfying definition of a not-server may simply be a device that can comfortably run applications that aren't highly parallelized.

I pulled out some of the more representative comments from the threads for your enjoyment. The consensus? There is none, but it's quite an interesting discussion...

Click to read more ...

HighScalability Team |

23 Comments |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Dec102014

Reactive prefetch speeds Google's mobile search by 100-150 milliseconds.

Wednesday, December 10, 2014 at 8:56AM

Increasing responsiveness by parallelizing and prefetching content using hints and dependency graphs, is an old concept, but seldom do we see such a nice tight example of the benefit as is given by the great Ilya Grigorik in this G+ post.

The insight here is that we're initiating the fetch for the HTML and its critical resources in parallel... which requires that the page initiating the navigation knows which critical resources are being used on the target page.

This is a powerful pattern and one that you can use to accelerate your site as well. The key insight is that we are not speculatively prefetching resources and do not incur unnecessary downloads. Instead, we wait for the user to click the link and tell us exactly where they are headed, and once we know that, we tell the browser which other resources it should fetch in parallel - aka, reactive prefetch!

As you can infer, implementing the above strategy requires a lot of smarts both in the browser and within the search engine... First, we need to know the list of critical resources that may delay rendering of the destination page for every page on the web! No small feat, but the Search team has us covered - they're good like that. Next, we need a browser API that allows us to invoke the prefetch logic when the click occurs: the search page listens for the click event, and once invoked, dynamically inserts prefetch hints into the search results page. Finally, this is where Chrome comes in: as the search results page is unloaded, the browser begins fetching the hinted resources in parallel with the request for the destination page. The net result is that the critical resources are fetched much sooner, allowing the browser to render the destination page 100-150 milliseconds earlier.

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Strategy

Wednesday

Dec032014

All employees should be limited only by their ability rather than an absence of resources.

Wednesday, December 3, 2014 at 8:56AM

James Hamilton hid a pearl of wisdom inside Why Renewable Energy (Alone) Won't Full Solve the Problem that I think is well worth prying out:

I’ve long advocated the use of economic incentives to drive innovative uses of computing resources inside the company while preventing costs from spiraling out of control. Most IT departments control costs by having computing resources in short supply and only buying more resources slowly and with considerable care. Effectively computing is a scarce resource so it needs to get used carefully. This effectively limits IT cost growth and controls wastage but it also limits overall corporate innovation and the gains driven by the experiments that need these additional resources.

I’m a big believer in making effectively infinite computing resources available internally and billing them back precisely to the team that used them. Of course, each internal group needs to show the customer value of their resource consumption. Asking every group to effectively be a standalone profit center is, in some ways, complex in that the “product” from some groups is hard to quantitatively measure. Giving teams the resources they need to experiment and then allowing successful experiments to progress rapidly into production encourages innovation, makes for a more exciting place to work, and the improvements brought by successful experiments help the company be more competitive and better serve its customers.

I argue that all employees should be limited only by their ability rather than an absence of resources or an inability to argue convincingly for more. This is one of the most important yet least discussed advantages of cloud computing: taking away artificial resource limitations in support light-weight experimentation and rapid innovation. Making individual engineers and teams responsible to deliver more value for more resources consumed makes it possible encourage experimentation without fear that costs will rise without sufficient value being produced.

HighScalability Team |

Post a Comment |

Permalink |

Print Article

Email Article

Strategy