Built to Fail: How companies like Google, IDEO, and 37signals build failure-tolerant systems for anything!

Planning for success, not failure

High achieving people who have a long history of being successful often plan accordingly – doing so, of course, means that they plan for success in whatever they do. And when you take a successful person and put them in a successful big company that’s already making money from their products, there’s even more reason to plan for high-achievement outcomes.

But let’s say that you put these successful people and put them in environments of great uncertainty, like at a Silicon Valley startup – what happens? That’s when realities collide! When you apply the big successful company playbook to startups, you can end up with monolithic planning processes, products that can’t find their markets, and lots of money being spent on launches for the wrong products. It’s not that these tactics are stupid, it’s just that they don’t work as well when you’re dealing with ill-defined customer problems with unknown solutions.

At the heart of this conversation is – what happens when you take something that’s usually assumed to be successful, and you instead say that it’s very likely to fail?

In a way, you can think of this as planning to fail, but then building the support structure around the failure in order to create a failure-tolerant system. Let’s dive into this.

Planning for failure, not success
The title of this blog refers to the fact that companies like Google, IDEO, and 37signals all have the culture of “Failure is OK” built into them.

At Google:

Google makes money by being always available, ubiquitous, and having a great product
To deliver their service, they have 100,000s of servers (maybe more?)
Any one of these servers have a high likelihood of failing at any time
To create a fault-tolerant system, they have lots of redundancy and lots of sophistication around what happens when an individual box fails
Contrast this to a big-iron approach that builds all the redundancy into specialized hardware that’s designed to never fail

At IDEO:

Companies hire IDEO to give them fresh designs based on a customer-focused approach
Part of every project involves lots of brainstorming and coming up with ideas
However, any specific idea is likely bad (for example, 12 out of 4,000 toy ideas were actually successful = 0.3%)
Thus, IDEO combines structured brainstorming, rapid prototyping, and field research to rapidly try out new concepts and get to good products
Contrast this to a process where the “Great Man” designer thinks about a design problem and then comes up with the right solution spontaneously

At 37signals, in particular Ruby on Rails:

Rails is framework built for programmers to build websites
Of course, every web project requires lots of lines of code which can easily break at any moment
If you assume that programmers will more often write code that is buggy and breaks, then you’ll want to make testing and iteration easy – this is at the heart of Agile, TDD, continuous integration, and other related disciplines
Contrast this to a waterfall engineering approach which assumes the correct design and architecture can be thought out by experienced software engineers

Each one of these examples is similar, yet unique in their own way – but there are similar themes that pervade each one of these approaches.

Characteristics of failure-tolerant systems
Each one of these systems takes the central part of a process and assumes failure, and then builds up a support system around it.

This happens by building on a few core principles:

Acceptance of failure: You have to accept that shit happens and failure is commonplace – this needs to be internalized so that failure isn’t punished, but rather embraced!
Massive redundancy: Then, it needs to be easy to have lots of redundancy built into the system – for designers, that means lots of designs get generated. For startups, that means lots of ideas are tested, and for Google, that means lots of servers are used
Cheap, easy, fast: As a side-effect of the redundancy, it needs to be easy, cheap, and fast to have lots of ideas, lots of servers, or write lots of code. The harder it is, harder it will be to create redundancy
Iterative, reality-based testing: Testing these individual components constantly becomes key – you need to force failure on the system to figure out how it reacts from a system-wide level

Building up processes based on the ideas above makes it easier and easier to deal with failure and come out on the other side!

Conclusion and next ideas
There are lots of interesting directions that this line of thinking can go.

This area of thinking started out with the hiring process, and the idea that maybe interviews don’t work at all – there’s a bunch of academic research that implies that, actually. So if how would you build a failure-tolerant system around the hiring process, if you assume that good interview candidates actually have no correlation to successful employees?

For dating, what happens if you assume that people you like to date may not be the kind of person you’d have a successful marriage with? What if people suck at figuring out what kind of guy or gal is the “type you’d bring home to Mom?” I think anyone could attest to the idea that many people suck at figuring out the right person to date, much less the right kind of person to marry. I personally find it crazy that people make a 50+year decision to be married based on a 18-month sample size :-)

For careers, what if it turns out that people have a really bad idea figuring out what they’ll actually want to do 40 hours a week, 50 weeks a year, for the rest of their life? How would you figure out the right career faster rather than shorter?

All of these are great thought experiments, I think.

What else am I missing? :-) I’d love to take any suggestions and write up some thought experiments around it.

Want more?
If you liked this post, please subscribe or follow me on Twitter. You can also find more essays here.

Built to Fail: How companies like Google, IDEO, and 37signals build failure-tolerant systems for anything!

Published by

Andrew Chen