0% found this document useful (0 votes)

62 views

Systems That Never Stop (And Erlang) : Joe Armstrong

Making reliable distributed systems in the presence of sofware errors. Building reliable systems using Erlang and Otp

Uploaded by

kishorenayark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Systems That Never Stop (And Erlang) : Joe Armstrong

Making reliable distributed systems in the presence of sofware errors. Building reliable systems using Erlang and Otp

Uploaded by

kishorenayark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Systems that never

stop (and Erlang)

Joe Armstrong
How can we get

10 nines reliability?
SIX LAWS
ONE

ISOLATION
ISOLATION

 10 nines = 99.99999999% availability

 P(fail) = 10-10
 If P(fail | one computer) = 10-3 then
P(fail | four computers) = 10-12
 Fixed
TWO

CONCURRENCY
Concurrency

 World is concurrent
 Need at least TWO computers to make a non-stop
sytem
 TWO computer is concurrent and distributed
“My first message is that
concurrency
is best regarded as a program
structuring principle”

Structured concurrent programming

– Tony Hoare
Redmond, July 2001
THREE

MUST
DETECT FAILURES
Failure detection
 If you can’t detect a failure you can’t fix it
 Must work across machine boundaries
the entire machine might fail
 Implies distributed error handling,
no shared state,
asynchronous messaging
FOUR

FAULT
IDENTIFICATION
Failure Identification

 Fault detection is not enough - you must no why

the failure occurred
 Implies that you have sufficient information for
post hock debugging
FIVE

LIVE
CODE
UPGRADE
Live code upgrade

 Must upgrade software while it is running

 Want zero down time
SIX

STABLE
STORAGE
Stable storage

 Must store stuff forever

 No backup necessary - storage just works
 Implies multiple copies, distribution, ...
 Must keep crash reports
HISTORY

Those who cannot learn from history are

doomed to repeat it.

George Santayana
GRAY
As with hardware, the key to software fault-tolerance is to
hierarchically decompose large systems into modules, each module being
a unit of service and a unit of failure. A failure of a module does
not propagate beyond the module.

...

The process achieves fault containment by sharing no state with

other processes; its only contact with other processes is via messages
carried by a kernel message system

- Jim Gray
- Why do computers stop and what can be done about it
- Technical Report, 85.7 - Tandem Computers,1985
SCHNEIDER
Halt on failure in the event of an error a processor
should halt instead of performing a possibly erroneous
operation.

Failure status property when a processor fails,

other processors in the system must be informed. The
reason for failure must be communicated.

Stable Storage Property The storage of a processor

should be partitioned into stable storage (which
survives a processor crash) and volatile storage which
is lost if a processor crashes.
Schneider
ACM Computing Surveys 22(4):229-319, 1990
GRAY
 Fault containment through fail-fast software modules.
 Process-pairs to tolerant hardware and transient software faults.
 Transaction mechanisms to provide data and message integrity.
 Transaction mechanisms combined with process-pairs to ease
exception handling and tolerate software fault
 Software modularity through processes and messages.
KAY
Folks --

Just a gentle reminder that I took some pains at the last OOPSLA to
try to remind everyone that Smalltalk is not only NOT its syntax or
the class library, it is not even about classes. I'm sorry that I long ago
coined the term "objects" for this topic because it gets many people to
focus on the lesser idea.

The big idea is "messaging" -- that is what the kernal of Smalltalk/

Squeak is all about (and it's something that was never quite completed
in our Xerox PARC phase)....

http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/
017019.html
GRAY
Software modularity through processes
and messages. As with hardware, the key
to software fault-tolerance is to
hierarchically decompose large systems
into modules, each module being a unit of
service and a unit of failure. A failure of a
module does not propagate beyond the
module.
Fail Fast
The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.

Processes are made fail-fast by defensive programming. They check

all their inputs, intermediate results and data structures as a matter
of course. If any error is detected, they signal a failure and stop. In
the terminology of [Cristian], fail-fast software has small fault
detection latency.

Gray
Why ...
Fail Early
A fault in a software system can cause one or more
errors. The latency time which is the interval between
the existence of the fault and the occurrence of the
error can be very high, which complicates the
backwards analysis of an error ...

For an effective error handling we must detect errors and

failures as early as possible

Renzel -
Error Handling for Business Information Systems,
Software Design and Management, GmbH & Co. KG, München, 2003
ARMSTRONG
 Processes are the units of error encapsulation. Errors
occurring in a process will not affect other processes in the
system. We call this property strong isolation.
 Processes do what they are supposed to do or fail as soon
as possible.
 Failure and the reason for failure can be detected by
remote processes.
 Processes share no state, but communicate by message
passing.

Armstrong
Making reliable systems in the presence of software errors
PhD Thesis, KTH, 2003
COMMERCIAL
BREAK
Joe’s 2’nd theorem

 Whatever Joe starts talking about, He will end up

talking about Erlang
Erlang was
designed
to program
fault-tolerant
systems
Concurrent
programming Functional
programming

Concurrency
Oriented
programming
Erlang

Fault Multicore
tolerance
Erlang
 Very light-weight processes
 Very fast message passing
 Total separation between processes
 Automatic marshalling/demarshalling
 Fast sequential code
 Strict functional code
 Dynamic typing
 Transparent distribution
 Compose sequential AND concurrent code
Properties
 No sharing
 Hot code replacement
 Pure message passing
 No locks
 Lots of computers (= fault tolerant scalable ...)
 Functional programming (no side effects)
What is COP?
Machine

Process

Message

➡
Large numbers of processes
➡ Complete isolation between processes
➡ Location transparency

➡ No Sharing of data

➡ Pure message passing systems

Thread Safety
Erlang programs are
automatically thread
safe if they don't use
an external resource.
Functional
If you call the
same function twice with
the same arguments
it should return the same value

“jolly good”
Joe Armstrong
No Mutable State
 Mutable state needs locks
 No mutable state = no locks = programmers bliss
Multicore ready
The rise of the cores
 2 cores won't hurt you
 4 cores will hurt a little
 8 cores will hurt a bit
 16 will start hurting
 32 cores will hurt a lot (2009)
 ...
 1 M cores ouch (2019)
 (complete paradigm shift)

 1997 1 Tflop = 850 KW

 2007 1 Tflop = 24 W (factor 35,000)
 2017 1 Tflop = ?
LAWS
ISOLATION
CONCURRENCY
Pid = spawn(.....)
Pid = spawn(Node, ....)

Pid ! Message receive

Pattern1 -> Actions1;
Pattern2 -> Actions2;
...
end
FAULT
IDENTIFICATION
link(Pid),
receive
{Pid, ‘EXIT’, Why} ->
...
end
LIVE CODE
UPGRADE
 Can upgrade code while its running

 Existing processes continue to use original code, new

processes run new code - no mixups of namespaces

 Sophisticated roll-forward, roll-back, roll-back-on-error

functions in OTP libraries

 Properly designed systems can be rolled-forward and

back with no loss of service. Not easy, but possible
STABLE STORAGE
 Performed in libraries

mnesia:transaction(
fun() ->
Val = mnesia:read(Key),
mnesia:write({Key,Val}),
...
end)
Projects
 CouchDB
 Amazon SimpleDB
 Mochiweb (facebook chat)
 Scalaris
 Nitrogren
 Ejabberd (xmpp)
 Rabbit MQ (amqp)
 ....
Companies
 Ericsson
 Amazon
 Tail-f
 Kreditor
 Synapse
 ...
Books
THE END

NOTES Linux Basic Course by Altnix
100% (3)
NOTES Linux Basic Course by Altnix
293 pages
National University of Science and Technology
No ratings yet
National University of Science and Technology
11 pages
COA Chapter 5
No ratings yet
COA Chapter 5
16 pages
A History of Erlang Joe Armstrong Hopl-Iii
No ratings yet
A History of Erlang Joe Armstrong Hopl-Iii
45 pages
Critical Sections With Lots of Threads
No ratings yet
Critical Sections With Lots of Threads
34 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
MCP-Unit 2
No ratings yet
MCP-Unit 2
77 pages
Turing
No ratings yet
Turing
15 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Unit III Exception
No ratings yet
Unit III Exception
6 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Os Chapter Two
No ratings yet
Os Chapter Two
40 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Distributed Os
No ratings yet
Distributed Os
13 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Designing Software With Complex Configuration
No ratings yet
Designing Software With Complex Configuration
17 pages
Summary Midterm Concurrency
No ratings yet
Summary Midterm Concurrency
22 pages
FILE5 process synchronisation
No ratings yet
FILE5 process synchronisation
7 pages
Chen 07
No ratings yet
Chen 07
39 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
concurrency- mutual exclusion and synchronisation OS
No ratings yet
concurrency- mutual exclusion and synchronisation OS
21 pages
PDS Unit 1
No ratings yet
PDS Unit 1
59 pages
Elixir in Action, Third Edition (MEAP V06) Saša Jurić instant download
No ratings yet
Elixir in Action, Third Edition (MEAP V06) Saša Jurić instant download
47 pages
A Whirlwind Tour Through Concurrency: Kedar Namjoshi Bell Labs
No ratings yet
A Whirlwind Tour Through Concurrency: Kedar Namjoshi Bell Labs
37 pages
Low-Power Sensor Networks: A Case Study in Seeking Distributed Predictability
No ratings yet
Low-Power Sensor Networks: A Case Study in Seeking Distributed Predictability
59 pages
01-da24-Introduction
No ratings yet
01-da24-Introduction
55 pages
System Recovery
No ratings yet
System Recovery
38 pages
OS KCA203 Unit-2.1
No ratings yet
OS KCA203 Unit-2.1
7 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Rajib Mall Lecture Notes
No ratings yet
Rajib Mall Lecture Notes
78 pages
Concurrency Oriented Programming in Erlang
No ratings yet
Concurrency Oriented Programming in Erlang
35 pages
Interprocess Communication and Synchronization
No ratings yet
Interprocess Communication and Synchronization
33 pages
lecture 7
No ratings yet
lecture 7
57 pages
15 Synchronization
No ratings yet
15 Synchronization
120 pages
Process Management
No ratings yet
Process Management
39 pages
3_StaticAnalysisPREfast
No ratings yet
3_StaticAnalysisPREfast
36 pages
002. Lesson 2 - Fault and Error Modelling.docx
No ratings yet
002. Lesson 2 - Fault and Error Modelling.docx
7 pages
Thread_
No ratings yet
Thread_
13 pages
OS Unit - 3
No ratings yet
OS Unit - 3
14 pages
Software Architecture: P E R F O R M A N C E Error Recovery O A & M
No ratings yet
Software Architecture: P E R F O R M A N C E Error Recovery O A & M
42 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
Failure Model
No ratings yet
Failure Model
14 pages
Chapter_8-Fault_Tolerance (1)
No ratings yet
Chapter_8-Fault_Tolerance (1)
37 pages
Unit5 compressed Fault tolerance- PACE
No ratings yet
Unit5 compressed Fault tolerance- PACE
11 pages
Basics of synchronous access
No ratings yet
Basics of synchronous access
92 pages
Critical Section Problem: CIS 450 Winter 2003
No ratings yet
Critical Section Problem: CIS 450 Winter 2003
26 pages
Introduction To Concurrent Programming
No ratings yet
Introduction To Concurrent Programming
20 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
PPL Unit-4
No ratings yet
PPL Unit-4
9 pages
Table of Contents
No ratings yet
Table of Contents
6 pages
Secure Programming With Static Analysis
No ratings yet
Secure Programming With Static Analysis
56 pages
Parallel Computing
100% (1)
Parallel Computing
241 pages
Assertion Based Design 2nd PDF
No ratings yet
Assertion Based Design 2nd PDF
415 pages
Concurrent and Parallel Programming .Unit-1
No ratings yet
Concurrent and Parallel Programming .Unit-1
8 pages
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
Top Networking Terms You Should Know
From Everand
Top Networking Terms You Should Know
JOHN SMITH
No ratings yet
Fix Common Failures
From Everand
Fix Common Failures
Mei Gates
No ratings yet
Bucket Sort: Visualize, Design, and Analyse
No ratings yet
Bucket Sort: Visualize, Design, and Analyse
8 pages
How To Write Clean Code? Follow These Best Practices
No ratings yet
How To Write Clean Code? Follow These Best Practices
9 pages
Java Challenges: Low Hanging
No ratings yet
Java Challenges: Low Hanging
12 pages
CS231n Convolutional Neural Networks For Visual Recognition PDF
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition PDF
16 pages
Array Fire GPU Programming in C++
No ratings yet
Array Fire GPU Programming in C++
32 pages
ML Algorithms
No ratings yet
ML Algorithms
46 pages
Learning Bayesian Networks (Neapolitan, Richard) PDF
100% (1)
Learning Bayesian Networks (Neapolitan, Richard) PDF
704 pages
FACE: A Firewall Analysis and Configuration Engine
No ratings yet
FACE: A Firewall Analysis and Configuration Engine
16 pages
DS-A81048S_20230708
No ratings yet
DS-A81048S_20230708
5 pages
Surface Pro 5
No ratings yet
Surface Pro 5
76 pages
p62 0x09 UTF8 Shellcode by Greuff
No ratings yet
p62 0x09 UTF8 Shellcode by Greuff
16 pages
Amid Inership Mainenance DWITC - 2015
No ratings yet
Amid Inership Mainenance DWITC - 2015
33 pages
Chapter 1 Introduction to Embedded System
No ratings yet
Chapter 1 Introduction to Embedded System
64 pages
Booting Problems in Solaris
No ratings yet
Booting Problems in Solaris
3 pages
APG43L 3.2 Network Impact Report
No ratings yet
APG43L 3.2 Network Impact Report
31 pages
LA - CARATULA v3.1-1
No ratings yet
LA - CARATULA v3.1-1
24 pages
RHCSA Syllabus Day Wise
No ratings yet
RHCSA Syllabus Day Wise
4 pages
Partycjologia APA (2020-10-29)
No ratings yet
Partycjologia APA (2020-10-29)
40 pages
Explaining BGP Concepts and Terminology
No ratings yet
Explaining BGP Concepts and Terminology
25 pages
(NAS) Synology - RS1219 - Plus - Data - Sheet - Enu
No ratings yet
(NAS) Synology - RS1219 - Plus - Data - Sheet - Enu
7 pages
MVS
100% (1)
MVS
226 pages
NBG-417N V1.00 (BFM.9) C0 Release Note
No ratings yet
NBG-417N V1.00 (BFM.9) C0 Release Note
3 pages
NSA Chapter 4 -7
No ratings yet
NSA Chapter 4 -7
100 pages
Bom Check
No ratings yet
Bom Check
10 pages
Apple Platform Security Guide
No ratings yet
Apple Platform Security Guide
219 pages
Software Development Kit 2.1 Programmer's Guide 2.1: Cell Broadband Engine
No ratings yet
Software Development Kit 2.1 Programmer's Guide 2.1: Cell Broadband Engine
82 pages
Azure Admin Course Content
No ratings yet
Azure Admin Course Content
8 pages
What Is A Distributed System ??
No ratings yet
What Is A Distributed System ??
8 pages
Research Data Export Viewer
No ratings yet
Research Data Export Viewer
20 pages
Microprocessor Lab Manual - Final
100% (6)
Microprocessor Lab Manual - Final
157 pages
HP Troubleshooting Q&A
No ratings yet
HP Troubleshooting Q&A
2 pages
What Is Virtual Memory and How Is It Implemented?
No ratings yet
What Is Virtual Memory and How Is It Implemented?
38 pages
Linux Commands Cheat Sheet - Linux Training Academy
No ratings yet
Linux Commands Cheat Sheet - Linux Training Academy
19 pages
Hands On Contiki OS and Cooja Simulator: Exercises (Part II)
No ratings yet
Hands On Contiki OS and Cooja Simulator: Exercises (Part II)
15 pages
Small Computer System Interface
No ratings yet
Small Computer System Interface
13 pages
11th Computer Science EM Public Exam Model Question Paper English Medium PDF
No ratings yet
11th Computer Science EM Public Exam Model Question Paper English Medium PDF
18 pages