Entries in Strategy (358)

Monday
Feb242020

Socratic vs. Euclidean Forms of API Documentation

 

I was emailing a service about their documentation and while their doc was good, about one particularly tricky concept they told me that once you use it for a while, that’s when you’ll understand it.

In other words: you’ll only understand it after you understand it.

I didn’t like that response. I want documentation that takes me from an unproductive newbie to a somewhat functioning journeyperson. Not an expert, but I want to get stuff done as soon as possible. And for that you need to understand the mental model behind the API. Otherwise, how do you know how to make anything happen?

I realize it’s hard to make good documentation. I spent a lot of time writing Explain the Cloud Like I'm 10 just to communicate the mental model behind the cloud. It’s not easy.

Then I read that something that showed me there are two different styles of documentation: Euclidean and Socratic:

Euclidean - state your axioms and let users derive the rest. Easiest for the API provider, but hardest on the user. This is the most common form of documentation. You see it all the time. Each entry point in the API is sort of explained, but there’s nothing tying the whole API together. You just pray someone on Stackoverflow already asked the questions you want to ask and someone made the effort to answer—before the question was voted into oblivion.

Socratic - an open-inquiry meant to bring about a deeper understanding in the API user. You get the Euclidean part, but you also get a FAQ, you get recipes for common tasks, you get error conditions and possible responses, and you get working code examples. You get a deep explanation of what the API is trying to accomplish and how you can use it to accomplish your own goals. People tend to think once they've written an example on GitHub that they've fulfilled their Socratic duty. Not so. You need to help people get to the point where they could write the example code. The person who works at the company and wrote the example already knows all that, but it's that knowledge that must be communicated, not the end product.

API doc tends to be Euclidean, but at its best documentation is Socratic. That’s when you can really be productive—fast.

If you can't explain something well, it's likely you don't understand it either. And if you don't understand it as an API provider, how is anyone else going to understand it? Please make that extra effort. 

What did I read that explained the idea of Euclidean vs Socratic approaches to a topic? Take a look at this interview with David B. Kinney on the Philosophy of Science:

Click to read more ...

Wednesday
Apr252018

Google: Addressing Cascading Failures

Like the Spanish Inquisition, nobody expects cascading failures. Here's how Google handles them.

This excerpt is a particularly interesting and comprehensive chapter—Chapter 22 - Addressing Cascading Failures—from Google's awesome book on Site Reliability Engineering. Worth reading if it hasn't been on your radar. And it's free!

If at first you don't succeed, back off exponentially."

Dan Sandler, Google Software Engineer

Why do people always forget that you need to add a little jitter?"

Ade Oshineye, Google Developer Advocate

A cascading failure is a failure that grows over time as a result of positive feedback.107 It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.

We’ll use the Shakespeare search service discussed in Shakespeare: A Sample Service as an example throughout this chapter. Its production configuration might look something like Figure 22-1.

Example production configuration for the Shakespeare search service.Figure 22-1. Example production configuration for the Shakespeare search service

Causes of Cascading Failures and Designing to Avoid Them

Click to read more ...

Monday
Apr232018

Strategy: Use TensorFlow.js in the Browser to Reduce Server Costs

 

One of the strategies Jacob Richter describes (How we built a big data platform on AWS for 100 users for under $2 a month) in his relentless drive to lower AWS costs is moving ML from the server to the client.

Moving work to the client is a time honored way of saving on server side costs. Not long ago it might have seemed like moving ML to the browser would be an insane thing to do. What was crazy yesterday often becomes standard practice today, especially when it works, especially when it saves money:

Our post-processing machine learning models are mostly K-means clustering models to identify outliers. As a further step to reduce our costs, we are now running those models on the client side using TensorFlow.js instead of using an autoscaling EC2 container. While this was only a small change using the same code, it resulted in a proportionally massive cost reduction. 

To learn more Alexis Perrier has a good article on Tensorflow with Javascript Brings Deep Learning to the Browser:

Tensorflow.js has four layers: The WebGL API for GPU-supported numerical operations, the web browser for user interactions, and two APIs: Core and Layers. The low-level Core API corresponds to the former deeplearn.js library. It provides hardware-accelerated linear algebra operations and an eager API for automatic differentiation. The higher-level Layers API is used to build machine-learning models on top of Core. The Layers API is modeled after Keras and implements similar functionality. It also allows to import models previously trained in python with Keras or TensorFlow SavedModels and use it for inference or transfer learning in the browser. 

If you want a well paved path to success this is not it. I couldn’t find a lot of success stories yet of using TensorFlow in the browser.

One clear use case came from kbrose:

There are a lot of services that offer free or very cheap hosting of static websites. If you are able to implement your ML model in JS this allows you to deploy your product/app/whatever easily and with low cost. In comparison, requiring a backend server running your model is harder to setup and maintain, in addition to costing more. 

Related Articles

 

Monday
Apr162018

Google: A Collection of Best Practices for Production Services

This excerpt—Appendix B - A Collection of Best Practices for Production Services—is from Google's awesome book on Site Reliability Engineering. Worth reading if it hasn't been on your radar. And it's free!

 

Fail Sanely

Sanitize and validate configuration inputs, and respond to implausible inputs by both continuing to operate in the previous state and alerting to the receipt of bad input. Bad input often falls into one of these categories:

Incorrect data

Validate both syntax and, if possible, semantics. Watch for empty data and partial or truncated data (e.g., alert if the configuration is N% smaller than the previous version).

Delayed data

This may invalidate current data due to timeouts. Alert well before the data is expected to expire.

Fail in a way that preserves function, possibly at the expense of being overly permissive or overly simplistic. We’ve found that it’s generally safer for systems to continue functioning with their previous configuration and await a human’s approval before using the new, perhaps invalid, data.

Examples

In 2005, Google’s global DNS load- and latency-balancing system received an empty DNS entry file as a result of file permissions. It accepted this empty file and served NXDOMAIN for six minutes for all Google properties. In response, the system now performs a number of sanity checks on new configurations, including confirming the presence of virtual IPs for google.com, and will continue serving the previous DNS entries until it receives a new file that passes its input checks.

In 2009, incorrect (but valid) data led to Google marking the entire Web as containing malware [May09]. A configuration file containing the list of suspect URLs was replaced by a single forward slash character (/), which matched all URLs. Checks for dramatic changes in file size and checks to see whether the configuration is matching sites that are believed unlikely to contain malware would have prevented this from reaching production.

Progressive Rollouts

Click to read more ...

Monday
Dec042017

The Eternal Cost Savings of Netflix's Internal Spot Market

 

Netflix used their internal spot market to save 92% on video encoding costs. The story of how is told by Dave Hahn in his now annual A Day in the Life of a Netflix Engineer. Netflix first talked about their spot market in a pair of articles published in 2015: Creating Your Own EC2 Spot Market Part 1 and Part 2.

The idea is simple:

  • Netflix runs out of three AWS regions and uses hundreds of thousands of EC2 instances; many are underutilized at various parts in the day.

  • Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs in over 1000 different autoscaling groups.

  • So why not create a spot market out of their own underutilized reserved instances to process video encoding?

Before proceeding let's define what a spot market is:

Spot Instances enable you to request unused EC2 instances, which can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance (of each instance type in each Availability Zone) is set by Amazon EC2, and adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.

At any point in time AWS has a lot of underutilized instances. It turns out so does Netflix. To understand why creating an internal spot market helped Netflix so much, we'll first need to understand how they encode video.

How Netflix Encodes Video

Click to read more ...

Wednesday
Aug312016

My Test Tube Filled with DNA is Better than Your Mesos Cluster

 

We’ve seen computation using slime mold, soap film, water droplets, there’s even a 10,000 Domino Computer. Now DNA can do math In a test tube. Using addition, subtraction, multiplication, and division.

It’s not fast. Calculations can take hours. The upside: they are tiny and can work in wet environments. Think of running calculations in your bloodstream or in cells, like a programmable firewall, to monitor and alert on targeted health metrics and then trigger a localized response. Or if you are writing  science fiction perhaps the ocean could become one giant computer?

The applications already sound like science fiction:

Prior devices for control of chemical reaction networks and DNA doctor applications have been limited to finite-state control, and analog DNA circuits will allow much more sophisticated analog signal processing and control. DNA robotics have allowed devices to operate autonomously (e.g., to walk on a nanostructure) but also have been limited to finite-state control.
Analog DNA circuits can allow molecular robots to include real-time analog control circuits to provide much more sophisticated control than offered by purely digital control. Many artificial intelligence systems (e.g., neural networks and probabilistic inference) that dynamically learn from environments require analog computation, and analog DNA circuits can be used for back-propagation computation of neural networks and Bayesian probabilistic inference systems.

How does it work?

Click to read more ...

Tuesday
Aug232016

The Always On Architecture - Moving Beyond Legacy Disaster Recovery

Failover does not cut it anymore. You need an ALWAYS ON architecture with multiple data centers.
-- Martin Van Ryswyk, VP of Engineering at DataStax

Failover, switching to a redundant or standby system when a component fails, has a long and checkered history as a way of dealing with failure. The reason is your failover mechanism becomes a single point of failure that often fails just when it's needed most. Having worked on a few telecom systems that used a failover strategy I know exactly how stressful failover events can be and how stupid you feel when your failover fails. If you have a double or triple fault in your system failover is exactly the time when it will happen. 

For a long time the only real trick we had for achieving fault tolerance was to have a hot, warm, or cold standby (disk, interface, card, server, router, generator, datacenter, etc.) and failover to it when there's a problem. This old style of Disaster Recovery planning is no longer adequate or necessary.

Now, thanks to cloud infrastructures, at least at a software system level, we have an alternative: an always on architecture. Google calls this a natively multihomed architecture. You can distribute data across multiple datacenters in such away that all your datacenters are always active. Each datacenter can automatically scale capacity up and down depending on what happens to other datacenters. You know, the usual sort of cloud propaganda. Robin Schumacher makes a good case here: Long live Dear CXO – When Will What Happened to Delta Happen to You?

Recent Problems With Disaster !Recovery

Click to read more ...

Tuesday
Aug092016

10 Gameday Failure Testing Scenarios from Obama for America

I have dozens if not hundreds of half finished articles and snippets of ideas in the haunted house that is my Google Docs. Walking the house around midnight, with the lights turned off course, I stumbled upon one ghost that has been haunting me since 2012. It is time to perform the ritual of exorcism by just publishing something.

You may or may not remember Obama for America, which in 2012 had a staff of 120 people that built and maintained the infrastructure that helped get out the vote for Obama. 

Harper Reed and Dylan Richard headed up the effort. Around that time they were getting a lot of press. One of the things that interested me was how they held Gameday test events, where they would simulate failure modes in their testing environments. Google calls these DiRT (Disaster Recovery Testing event) exercises

So I asked Harper and Dylan what these exercises actually were and they were kind enough to reply. And I apparently forgot all about it. My apologies. Better late than never? Yah, let's go with that.

Here are some of the failure testing scenarios carried out by the Obama for America team:

  1. Flush memcache
  2. Kill memcache (null route on instances)
  3. Kill replicants (we used security groups to deny access)
  4. Kill master
  5. Kill the backing API (we had a heavy SOA)
  6. Put API in read-only (killing master should accomplish this - but this tests client apps explicitly)
  7. Kill SQS (we used it heavily, particularly for decoupled systems and fall backs)
  8. Emulate an EBS failure (kill all DBs [we used RDS], kill all EBS backed instances)
  9. Emulate full east coast failure (we had a 2 stage failover plan to the west coast - fail to a read only mode which we could do easily, and fail over permanently which would only happen in the case of extended east coast AWS unavailability)
  10. Emulate human error (claim to have done something [scale up, restart a DB, flush the cache, bounce the wsgi proc, etc] but don't actually do it) 

Now there's one less ghost haunting the halls.

Related Articles

Wednesday
Jul132016

Why Amazon Retail Went to a Service Oriented Architecture

When Lee Atchison arrived at Amazon, Amazon was in the process of moving from a large monolithic application to a Service Oriented Architecture.

Lee talks about this evolution in an interesting interview on Software Engineering Daily: Scalable Architecture with Lee Atchison, about Lee's new book: Architecting for Scale: High Availability for Your Growing Applications.

This is a topic Adrian Cockcroft has talked a lot about in relation to his work at Netflix, but it's a powerful experience to hear Lee talk about how Amazon made the transition with us having the understanding of what Amazon would later become. 

Amazon was running into the problems of success. Not so much from a scaling to handle the requests perspective, but they were suffering from the problem of scaling the number of engineers working in the same code base.

At the time their philosophy was based on the Two Pizza team. A small group owns a particular piece of functionality. The problem is it doesn’t work to have hundreds of pizza teams working on the same code base. It became very difficult to innovate and add new features. It even became hard to build the application, pass the test suites, and deploy the software.

The solution: move to a Service Oriented Architecture (not microservices).

Organizing around services allowed individual teams to truly own the code base, the support responsibility, and top to bottom responsibility for the functionality.

The result: a dramatic increase in innovation and the ability to grow. After a while the Amazon retail site grew with a constant stream of new capabilities. Maybe too many :-)

The process requires a culture shift, really more of an ownership shift, from being part of a larger group to be an entity of its own that has responsibilities outside of its group as well as responsibilities inside the group.

While the strategy of consciously exploiting team organization as a means of speeding up product development and encouraging innovation is not a new idea now, in the early 2000s it would have been one ballsy move.

Related Articles 

Wednesday
Jul062016

Machine Learning Driven Programming: A New Programming for a New World

 

If Google were created from scratch today, much of it would be learned, not coded. Around 10% of Google's 25,000 developers are proficient in ML; it should be 100% -- Jeff Dean

Like the weather, everybody complains about programming, but nobody does anything about it. That’s changing and like an unexpected storm the change comes from an unexpected direction: Machine Learning / Deep Learning.

I know, you are tired of hearing about Deep Learning. Who isn’t by now? But programming has been stuck in a rut for a very long time and it's time we do something about it.

Lots of silly little programming wars continue to be fought that decide nothing. Functions vs objects; this language vs that language; this public cloud vs that public cloud vs this private cloud vs that ‘fill in the blank’; REST vs unrest; this byte level encoding vs some different one; this framework vs that framework; this methodology vs that methodology; bare metal vs containers vs VMs vs unikernels; monoliths vs microservices vs nanoservices; eventually consistent vs transactional; mutable vs immutable; DevOps vs NoOps vs SysOps; scale-up vs scale-out; centralized vs decentralized; single threaded vs massively parallel; sync vs async. And so on ad infinitum.

It’s all pretty much the same shite different day. We are just creating different ways of calling functions that we humans still have to write. The real power would be in getting a machine to write the functions. And that’s what Machine Learning can do, write functions for us. Machine Learning might just might be some different kind of shite for a different day.

Machine Learning Driven Programming

Click to read more ...