Failsafe – failure handling with retries, circuit breakers and fallbacks

by jodahon 7/23/2016, 5:23 PMwith 41 comments

by dredmorbiuson 7/23/2016, 8:41 PM

A note on the name: "fail-safe" in engineering doesn't mean that a system cannot fail, but rather, that when it does, it does so in the safest manner possible.

The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).

https://en.m.wikipedia.org/wiki/Railway_air_brake

Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.

https://en.m.wikipedia.org/wiki/Fail-safe

Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.

From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.

by nitrogenon 7/23/2016, 8:24 PM

Very cool. Consistent and clear retry, backoff, and failure behaviors are an important part of designing robust systems, so it's disappointing how uncommon they are. If I were starting a new Java project today I would almost certainly want to use this library instead of the various threads and timers I had to hack together years ago.

by SwellJoeon 7/24/2016, 12:26 AM

This title would be 100% better with "for Java" on the end.

by ckugblenuon 7/23/2016, 6:11 PM

Quite interesting. It shows potential to be used in numerous use cases. Anyone know of similar projects in other languages like Python and Javascript?

by cpitmanon 7/24/2016, 3:12 AM

How is this distinct from Hystrix (https://github.com/Netflix/Hystrix)? Why should I use one over the other?

by ap22213on 7/23/2016, 10:18 PM

It seems like a well-thought, fluent interface to what lots of Java developers (especially Java 8 ones) inevitably have to write themselves.

by mandeepjon 7/24/2016, 7:00 AM

Please find some of these patterns for .net\azure\c# stack here - https://msdn.microsoft.com/en-us/library/dn568099.aspx

by fdsaafon 7/23/2016, 8:27 PM

Beware of runaway retries: https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=...

Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.