Cluster Basics
Cluster Basics
Cluster Basics
Jiaul Paik
Lecture 4
Recap: Old Tools for Big Data Processing
Shared Memory Message Passing
• Programming models
Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues
producer consumer
master
work queue
slaves
producer consumer
Some Difficulties
• Concurrency is difficult to reason about
• At the scale of datacenters and across datacenters
• In the presence of failures
• In terms of multiple interacting services
• The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything
Source: MIT Open Courseware
Source: MIT Open Courseware
PRINCIPLES OF MODERN
BIG DATA SYSTEMS
The Datacenter is the Computer!
Source: Google
The datacenter is the computer
• It’s all about the right level of abstraction
• Needs new “instruction set” for datacenter computers
a data
dat de
data
co
code
data
Centralized code co d e
co
de
processor
data data
da
ta
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
BUILDING BLOCKS
Building Blocks
• You have 100 Billion web pages in millions of files. Your goal is to
compute the count of each word appearing the collection.
Standard Solution
Use multiple interconnected machine
Rack
switch
1 gbps between
switch switch any pair of nodes
Rack 1 Rack 2
• Node failures
• Data loss
• Network bottleneck
• Why?
• The programmer has to manage too many low level things apart from writing
code for the task
Other Challenges
• Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)