Cluster Basics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Big Data Processing

Jiaul Paik
Lecture 4
Recap: Old Tools for Big Data Processing
Shared Memory Message Passing

• Programming models

Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues
producer consumer
master

work queue

slaves
producer consumer
Some Difficulties
• Concurrency is difficult to reason about
• At the scale of datacenters and across datacenters
• In the presence of failures
• In terms of multiple interacting services

• Debugging: Even more difficult

• The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything
Source: MIT Open Courseware
Source: MIT Open Courseware
PRINCIPLES OF MODERN
BIG DATA SYSTEMS
The Datacenter is the Computer!

Source: Google
The datacenter is the computer
• It’s all about the right level of abstraction
• Needs new “instruction set” for datacenter computers

• Hide system-level details from the developers


• No more race conditions, etc.
• No need to explicitly worry about reliability, fault tolerance, etc.

• Separating the what from the how


• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Bring computation to the data

a data
dat de
data
co
code
data
Centralized code co d e
co
de
processor
data data
da
ta

Traditional approach Modern approach: Hadoop, spark


Key Idea for Big Data Processing
Divide and Conquer

Source: Google Image search, clip art


Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
BUILDING BLOCKS
Building Blocks

Single Rack of Cluster of racks


server servers
Source: Barroso and Urs Hölzle (2009)
Source: Google
Source: Google
Source: Facebook
STORAGE HIERARCHY
Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Storage Hierarchy

Source: Barroso and Urs Hölzle (2013)


Anatomy of a Datacenter

Source: Barroso and Urs Hölzle (2013)


Why Cluster Computing for
Big Data?
A Simple Problem: Word counting

• You have 100 Billion web pages in millions of files. Your goal is to
compute the count of each word appearing the collection.
Standard Solution
Use multiple interconnected machine

BIG DATA 1. Split data into small chunks

2. Send different chunks to


different machines and process

3. Collect the results from


different machines

Distributed Data processing in Cluster of Computers


Compute Cluster Components

Rack
switch

server Rack of servers Data centre


How to Organize Cluster of
Computers?
Cluster Architecture: Rack Servers
Backbone switch
(typically 2-10 gbps)
switch

1 gbps between
switch switch any pair of nodes

computer computer computer computer

Rack 1 Rack 2

Each rack typically contains commodity computers (nodes)


Key challenges in
Cluster Computing
Challenge # 1

• Node failures

• Single server lifetime: 1000 days

• 1000 servers in a cluster => 1 failure/day

• 1M servers in clusters => 1000 failures/day


Consequences of Node Failure

• Data loss

• Node failure in the middle of long and expensive computation

• Need to restart the computation from scratch


Challenge # 2

• Network bottleneck

• Computers in a cluster exchange data through network

• Moving 10TB of data through 1 gbps network bandwidth takes 1 day


Challenge # 3

• Distributed Programming is hard!!!

• Why?

• The programmer has to manage too many low level things apart from writing
code for the task
Other Challenges

• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?


Managing Multiple Workers
• Difficult because
• We don’t know the order in which workers run
• We don’t know when workers interrupt each other
• We don’t know when workers need to communicate partial results
• We don’t know the order in which workers access shared data

• Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)

You might also like