Systems That Never Stop (And Erlang) : Joe Armstrong
Systems That Never Stop (And Erlang) : Joe Armstrong
10 nines reliability?
SIX LAWS
ONE
ISOLATION
ISOLATION
CONCURRENCY
Concurrency
World is concurrent
Need at least TWO computers to make a non-stop
sytem
TWO computer is concurrent and distributed
“My first message is that
concurrency
is best regarded as a program
structuring principle”
MUST
DETECT FAILURES
Failure detection
If you can’t detect a failure you can’t fix it
Must work across machine boundaries
the entire machine might fail
Implies distributed error handling,
no shared state,
asynchronous messaging
FOUR
FAULT
IDENTIFICATION
Failure Identification
LIVE
CODE
UPGRADE
Live code upgrade
STABLE
STORAGE
Stable storage
George Santayana
GRAY
As with hardware, the key to software fault-tolerance is to
hierarchically decompose large systems into modules, each module being
a unit of service and a unit of failure. A failure of a module does
not propagate beyond the module.
...
- Jim Gray
- Why do computers stop and what can be done about it
- Technical Report, 85.7 - Tandem Computers,1985
SCHNEIDER
Halt on failure in the event of an error a processor
should halt instead of performing a possibly erroneous
operation.
Just a gentle reminder that I took some pains at the last OOPSLA to
try to remind everyone that Smalltalk is not only NOT its syntax or
the class library, it is not even about classes. I'm sorry that I long ago
coined the term "objects" for this topic because it gets many people to
focus on the lesser idea.
http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/
017019.html
GRAY
Software modularity through processes
and messages. As with hardware, the key
to software fault-tolerance is to
hierarchically decompose large systems
into modules, each module being a unit of
service and a unit of failure. A failure of a
module does not propagate beyond the
module.
Fail Fast
The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.
Gray
Why ...
Fail Early
A fault in a software system can cause one or more
errors. The latency time which is the interval between
the existence of the fault and the occurrence of the
error can be very high, which complicates the
backwards analysis of an error ...
Renzel -
Error Handling for Business Information Systems,
Software Design and Management, GmbH & Co. KG, München, 2003
ARMSTRONG
Processes are the units of error encapsulation. Errors
occurring in a process will not affect other processes in the
system. We call this property strong isolation.
Processes do what they are supposed to do or fail as soon
as possible.
Failure and the reason for failure can be detected by
remote processes.
Processes share no state, but communicate by message
passing.
Armstrong
Making reliable systems in the presence of software errors
PhD Thesis, KTH, 2003
COMMERCIAL
BREAK
Joe’s 2’nd theorem
Concurrency
Oriented
programming
Erlang
Fault Multicore
tolerance
Erlang
Very light-weight processes
Very fast message passing
Total separation between processes
Automatic marshalling/demarshalling
Fast sequential code
Strict functional code
Dynamic typing
Transparent distribution
Compose sequential AND concurrent code
Properties
No sharing
Hot code replacement
Pure message passing
No locks
Lots of computers (= fault tolerant scalable ...)
Functional programming (no side effects)
What is COP?
Machine
Process
Message
➡
Large numbers of processes
➡ Complete isolation between processes
➡ Location transparency
➡ No Sharing of data
“jolly good”
Joe Armstrong
No Mutable State
Mutable state needs locks
No mutable state = no locks = programmers bliss
Multicore ready
The rise of the cores
2 cores won't hurt you
4 cores will hurt a little
8 cores will hurt a bit
16 will start hurting
32 cores will hurt a lot (2009)
...
1 M cores ouch (2019)
(complete paradigm shift)
mnesia:transaction(
fun() ->
Val = mnesia:read(Key),
mnesia:write({Key,Val}),
...
end)
Projects
CouchDB
Amazon SimpleDB
Mochiweb (facebook chat)
Scalaris
Nitrogren
Ejabberd (xmpp)
Rabbit MQ (amqp)
....
Companies
Ericsson
Amazon
Tail-f
Kreditor
Synapse
...
Books
THE END