Entries in Strategy (358)

Monday
Nov162015

9ish Low Latency Strategies for SaaS Companies

Achieving very low latencies takes special engineering, but if you are a SaaS company latencies of a few hundred milliseconds are possible for complex business logic using standard technologies like load balancers, queues, JVMs, and rest APIs.

Itai Frenkel, a software engineer at Forter, which provides a Fraud Prevention Decision as a Service, shows how in an excellent article: 9.5 Low Latency Decision as a Service Design Patterns.

While any article on latency will have some familiar suggestions, Itai goes into some new territory you can really learn from. The full article is rich with detail, so you'll want to read it, but here's a short gloss:

Click to read more ...

Wednesday
Nov042015

Strategy: Avoid Lots of Little Files

I've been bitten by this one. It happens when you quite naturally use the file system as a quick and dirty database. A directory is a lot like a table and a file name looks a lot like a key. You can store many-to-one relationships via subdirectories. And the path to a file makes a handy quick lookup key. 

The problem is a file system isn't a database. That realization doesn't hit until you reach a threshold where there are actually lots of files. Everything works perfectly until then.

When the threshold is hit iterating a directory becomes very slow because most file system directory data structures are not optimized for the lots of small files case. And even opening a file becomes slow.

According to Steve Gibson on Security Now (@16:10) LastPass ran into this problem. LastPass stored every item in their vault in an individual file. This allowed standard file syncing technology to be used to update only the changed files. Updating a password changes just one file so only that file is synced.

Steve thinks this is a design mistake, but this approach makes perfect sense. It's simple and robust, which is good design given, what I assume, is the original reasonable expectation of relatively small vaults.

The problem is the file approach doesn't scale to larger vaults with thousands of files for thousands of web sites. Interestingly, decrypting files was not the bottleneck, the overhead of opening files became the problem. The slowdown was on the elaborate security checks the OS makes to validate if a process has the rights to open a file.

The new version of 1Password uses a UUID to shard items into one of 16 files based on the first digit of the UUID. Given good random number generation the files should grow more or less equally as items are added. Problem solved. Would this be your first solution when first building a product? Probably not.

Apologies to 1Password if this is not a correct characterization of their situation, but even if wrong, the lesson still remains.

Click to read more ...

Wednesday
Oct212015

5 Lessons from 5 Years of Building Instagram

Instagram has always been generous in sharing their accumulated wisdom. Just take a look at the Related Articles section of this post to see how generous.

The tradition continues. Mike Krieger, Instagram co-founder, wrote a really good article on lessons learned from milestones achieved during Five Years of Building Instagram. Here's a summary of the lessons, but the article goes into much more of the connective tissue and is well worth reading.

  1. Do the simple thing first. This is the secret of supporting exponential growth. There's no need to future proof everything you do. That leads to paralysis. For each new challenge find the fastest, simplest fix for each. 
  2. Do fewer things better. Focus on a single platform. This allows you to iterate faster because not everything has to be done twice. When you have to expand create a team explicitly for each platform.
  3. Upfront work but can pay huge dividends. Create an automated scriptable infrastructure implementing a repeatable server provisioning process. This makes it easier to bring on new hires and handle disasters. Hire engineers with the right stuff who aren't afraid to work through a disaster. 
  4. Don’t reinvent the wheel. Instagram moved to Facebook's infrastructure because it allowed them to stay small and leverage a treasure trove of capabilities.
  5. Nothing lasts forever. Be open to evolve your product. Don't be afraid of creating special teams to tackle features and adapt to a rapidly scaling community.

Related Articles

Wednesday
Oct142015

Save some bandwidth by turning off TCP Timestamps

This is a guest post by Donatas Abraitis, System Engineer at Vinted, with an unusual approach for saving a little bandwidth.

Looking at https://tools.ietf.org/html/rfc1323 there is a nice title: 'TCP Extensions for High Performance'. It's worth to take a look at date May 1992. Timestamps option may appear in any data or ACK segment, adding 12 bytes to the 20-byte TCP header. 

Using TCP options, the sender places a timestamp in each data segment, and the receiver reflects these timestamps back in ACK segments. Then a single subtract gives the sender an accurate RTT measurement for every ACK segment.

To prove this let's dig into kernel source:

./include/net/tcp.h:#define TCPOLEN_TSTAMP_ALIGNED    12
./net/ipv4/tcp_output.c:static void tcp_connect_init(struct sock *sk)
  ...
  tp->tcp_header_len = sizeof(struct tcphdr) +
    (sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);

Some visualizations:

Click to read more ...

Monday
Oct052015

Your Load Generator is Probably Lying to You - Take the Red Pill and Find Out Why

Pretty much all your load generation and monitoring tools do not work correctly. Those charts you thought were full of relevant information about how your system is performing are really just telling you a lie. Your sensory inputs are being jammed. 

To find out how listen to the Morpheous of performance monitoring Gil Tene, CTO and co-founder at Azul Systems, makers of truly high performance JVMs, in a mesmerizing talk on How NOT to Measure Latency.

This talk is about removing the wool from your eyes. It's the red pill option for what you thought you were testing with load generators.

Some highlights:

  • If you want to hide the truth from someone show them a chart of all normal traffic with one just one bad spike surging into 95 percentile territory. 

  • The number one indicator you should never get rid of is the maximum value. That’s not noise, it’s the signal, the rest is noise.

  • 99% of users experience ~99.995%’ile response times, so why are you even looking at 95%'ile numbers?

  • Monitoring tools routinely drop important samples in the result set, leading you to draw really bad conclusions about the quality of the performance of your system.

It doesn't take long into the talk to realize Gil really knows his stuff. It's a deep talk with deep thoughts based on deep experience, filled with surprising insights. So if you take the red pill, you'll learn a lot, but you may not always like what you've learned.

Here's my inadequate gloss on Gil's amazing talk:

How to Lie With Percentiles

Click to read more ...

Wednesday
Sep302015

Strategy: Taming Linux Scheduler Jitter Using CPU Isolation and Thread Affinity

When nanoseconds matter you have to pay attention to OS scheduling details. Mark Price, who works in the rarified high performance environment of high finance, shows how in his excellent article on Reducing system jitter.

For a tuning example he uses the famous Disrupter inter-thread messaging library. The goal is to keep the OS continuously feeding CPUs work from high priority threads. His baseline test shows the fastest message is sent in 76 nanoseconds, 1 in 100 messages took longer than 2 milliseconds, and the longest delay was 11 milliseconds.

The next section of the article shows in loving detail how to bring those latencies lower and more consistent, a job many people will need to do in practice. You'll want to read the article for a full explanation, including how to use perf_events and HdrHistogram. It's really great at showing the process, but in short:

  • Turning off power save mode on the CPU reduced brought the max latency from 11 msec down to 8 msec.
  • Guaranteeing threads will always have CPU resources using CPU isolation and thread affinity brought the maximum latency down to 14 microseconds.

Related Articles

Wednesday
Sep162015

5 Lessons and 8 Industry Changes Over 5 Years as Etsy CTO

Endings are often a time for reflection and from reflection often comes wisdom. That is the case for Kellan Elliott-McCrea, who recently announced he was leaving his job after five successful years as the CTO of Etsy. Kellan wrote a rather remarkable going away post: Five years, building a culture, and handing it off, brimming with both insight and thoughtful commentary.

This post is just a short gloss of the major points. He goes into more depth on each point, so please read his post.

The Five Lessons:

  1. Nothing we “know” about software development should be assumed to be true.
  2. Technology is the product of the culture that builds it.
  3. Software development should be thought of as a cycle of continual learning and improvement rather a progression from start to finish, or a search for correctness.
  4. You build a culture of learning by optimizing globally not locally.
  5. If you want to build for the long term, the only guarantee is change.

The Eight Industry Changes

  1. Five years ago, continuous deployment was still a heretical idea. 
  2. Five years ago, it was crazy to discuss that monitoring, testing, debugging, QA, staged releases, game days, user research, and prototypes are all tools with the same goal, improving confidence, rather than separate disciplines handled by distinct teams.
  3. Five years ago, focusing on detection and response vs prevention in order to achieve better, more reliable, more scalable, and more secure software was unprofessional.
  4. Five years ago, suggesting that better software is written by a diverse team of kind people who care about each other was antithetical to our self-image as an industry.
  5. Five years ago, trusting not only our designers and product managers to code and deploy to production, but trusting everyone in the company to deploy to production.
  6. Five years ago, rooms of people excitedly talking about their own contribution to a serious outage would have been a prelude to mass firings, rather than a path to profound learning.
  7. And five years ago no one was experimenting in public about how to do this stuff, sharing their findings, and open sourcing code to support this way of working.
  8. Five years ago, it would have seemed ludicrous to think a small team supporting a small site selling crafts could aspire to change how software is built and, in the process, cause us to rethink how the economy works.

While many of these ideas were happening more than five years ago the point still stands, the industry has undergone a lot of changes recently, and sometimes it's worth taking a little time to reflect on that a bit. 

Thursday
Sep032015

How Agari Uses Airbnb's Airflow as a Smarter Cron

This is a guest repost by Siddharth Anand, Data Architect at Agari, on Airbnb's open source project Airflow, a workflow scheduler for data pipelines. Some think Airflow has a superior approach.

Workflow schedulers are systems that are responsbile for the periodic execution of workflows in a reliable and scalable manner. Workflow schedulers are pervasive - for instance, any company that has a data warehouse, a specialized database typically used for reporting, uses a workflow scheduler to coordinate nightly data loads into the data warehouse. Of more interest to companies like Agari is the use of workflow schedulers to reliably execute complex and business-critical "big" data science workloads! Agari, an email security company that tackles the problem of phishing, is increasingly leveraging data science, machine learning, and big data practices typically seen in data-driven companies like LinkedIn, Google, and Facebook in order to meet the demands of burgeoning data and dynamicism around modeling.

In a previous post, I described how we leverage AWS to build a scalable data pipeline at Agari. In this post, I discuss our need for a workflow scheduler in order to improve the reliablity of our data pipelines, providing the previous post's pipeline as a working example.

Scheduling Workflows @ Agari - A Smarter Cron

Click to read more ...

Wednesday
Aug262015

7 Strategies for 10x Transformative Change

Peter Thiel, VC, PayPal co-founder, early Facebook investor, and most importantly, the supposed inspiration for Silicon Valley's intriguing Peter Gregory character, argues in his book Zero to One that a successful business needs to make a product that is 10 times better than its closest competitor

The title Zero to One refers to the idea of progress as either horizontal/extensive or vertical/intensive. For a more detailed explanation take a look at Peter Thiel's CS183: Startup - Class 1 Notes Essay.

Horizontal/extensive progress refers to copying things that work. Observe, imitate, and repeat.  The one word summary for the concept is  "globalization.” For more on this PAYPAL MAFIA: Reid Hoffman & Peter Thiel's Master Class in China is an interesting watch.

Vertical/intensive progress means doing something genuinely new, that is going from zero to one, as apposed to going from one to N, which is merely globalization. This is the creative spark. The hero's journey of over coming obstacles on the way to becoming the Master of the Universe you were always meant to be.

We see this pattern with Google a lot. Google often hits scaling challenges long before anyone else and because they have a systematizing culture they produce discrete replicatable technologies that then diffuse out to the rest of the world, often through open source efforts.

Google told us about the Google File System in 2003, MapReduce in 2004, Bigtable in 2006, The Datacenter as a Computer in 2009, Percolator (real-time updates) in 2010, Pregel (graph processing) in 2010, Dremel (interactive analysis) in 2010, Spanner (globally distributed database) in 2012,  Omega (cluster scheduling) in 2013, Borg (cluster manager) in 2015, and Jupiter Rising (advanced networking) in 2015.

Sometime later we've seen the development of open source parallels like HDFS, Hadoop, HBase, Giraph, YARN, Drill, and Mesos. 

So, how can you rise up and meet the 10x challenge?

Murat Demirbas, a computer science and engineering professor at SUNY Buffalo, and awesome writer on all things distributed, came up with some good suggestions in How to go for 10X

Click to read more ...

Monday
Aug032015

Seven of the Nastiest Anti-patterns in Microservices

Daniel Bryant gave an energetic talk at Devoxx UK 2015 on lessons learned from over five years of experience with microservice based projects. The talk: The Seven Deadly Sins of Microservices: Redux (video, slides).

If you don't want to risk your immortal API then be sure to avoid:

  1. Lust - using the latest and greatest tech with the idea it will solve all your problems. It won't. Do you really need microservices at all? If you do go microservices do you really need new tech in your stack? Choose boring technology. Know why you are choosing something. A monolith can perform better and because a monolith can be developed faster it may also be the correct choice in proving your business case 
  2. Gluttony - excessive communication protocols. Projects often have a crazy number of protocols for gluing parts together. Standardize on the glue across an organization. Choose one synchronous and one asynchronous protocol. Don't gold-plate.
  3. Greed - all your service are belong to us. Do not underestimate the impact moving to a microservice approach will have on your organization. Your business organization needs to change to take advantage of microservices. Typically orgs will have silos between Dev, QA, and Ops with even more silos inside each silo like front-end, middleware, and database. Use cross functional teams like Spotify, Amazon, and Gilt. Connect rather than divide your company. 
  4. Sloth - creating a distributed monolith. If you can't deploy your services independently then they aren't microservices. Decouple. Transform data at a less central part of the stack. Some options are schema-first design and consumer-driven contracts.
  5. Wrath - blowing up when bad things happen. Bad things happen all the time so you need to test. Microservices are inherently distributed so you have network problems to deal with that weren't a problem in a monolith. The book Release It! has a lot of good fault tolerance patterns. Operationally you need to implement continuous delivery, agile, and devops. Test for failures using real life disaster scenarios testing, live injection failure testing, and something like Netflix's Simian Army.
  6. Envy - the shared single domain fallacy. A lot of time has been spent building and perfecting the model of a single domain. There's one big database with a unified schema. Microservices decompose a system along different lines and that can cause contention in an organization. Reports can be generated using pull by service or data pumps with events. 
  7. Pride - testing in the world of transience. Does your stuff really work? We all make mistakes. Think testing at the developer level, operational level, and business level. Surprisingly little has been written about testing microservices. Invest in your build pipeline testing. Some tools: Serenity BOD, Wiremock/Saboteur, Jenkins Performance Plugin. Testing in production is an emerging idea with companies that deploy many microservices.

Click to read more ...