0% found this document useful (0 votes)
2 views

DATA228 Lecture Notes Week 3

The document provides an overview of Hadoop, its history, architecture, and ecosystem, highlighting its role as a scalable and resilient framework for big data storage and processing. It discusses the evolution of Hadoop from its inception based on Google's papers to its current status as a widely adopted open-source platform in the industry. Additionally, it outlines the various components of Hadoop, including HDFS, YARN, and MapReduce, as well as how to run Hadoop in different setups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DATA228 Lecture Notes Week 3

The document provides an overview of Hadoop, its history, architecture, and ecosystem, highlighting its role as a scalable and resilient framework for big data storage and processing. It discusses the evolution of Hadoop from its inception based on Google's papers to its current status as a widely adopted open-source platform in the industry. Additionally, it outlines the various components of Hadoop, including HDFS, YARN, and MapReduce, as well as how to run Hadoop in different setups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

DATA 228

Big Data Technologies and Applications (Fall 2024)

Sangjin Lee
Hadoop: history & architecture

Ch pter 1 & p rts of 10, “H doop: the De initive Guide” 4th Edition, Tom White
a
a
a
f
Hadoop: Big Data refresher

• Store much l rger volumes of d t

• Compute/ n lyze much l rger volumes of d t

• H ndle diverse nd mostly unstructured d t

• … And do it che ply

• H doop is the irst complete open-source pl tform for Big D t


a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history

• 2003 - 2004: two semin l p pers from Google

• “The Google File System”, S nj y Ghem w t, How rd Gobio , Shun-T k Leung, 2003

• “M pReduce: Simpli ied D t Processing on L rge Clusters”, Je rey De n, S nj y


Ghem w t, 2004

• These were b sed on l rge-sc le systems th t were in wide use t Google t the time
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
ff
a
a
a
a
a
a
Hadoop: history

• 2005 - 2006: Doug Cutting t Y hoo cre tes M pReduce implement tion nd forms n
open-source project c lled H doop

• 2008: H doop becomes top-level Ap che project

• 2012: H doop 2 rele sed

• Introduced YARN: M pReduce becomes one YARN pplic tion type

• MR v.2 APIs
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history

• Since: H doop becomes ubiquitous in the industry

• Comp nies built on H doop: Clouder Hortonworks, M pR ( —> HPE)

• Almost ll comp nies in the industry tod y use nd oper te H doop

• All cloud providers o er irst-cl ss support for H doop

• H doop h s sp wned n ecosystem


a
a
a
a
a
a
a
ff
a
a
f
a
a
a
a
a
a
a
a
Hadoop in the cloud

AWS GCP

Compute Am zon EMR D t proc

El stic stor ge Am zon s3 GCS

Stre ming AWS Flink D t low

D t l ke AWS L ke Form tion BigL ke

Other AWS Redshift BigQuery, BigT ble


a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
What is Hadoop?

• H doop is two distributed systems for stor ge nd compute

• Highly sc l ble: w. r. t. horizont l sc l bility

• Highly v il ble: w. r. t. resiliency nd f ult toler nce

• H doop is fr mework with which to inter ct with Big D t

• M pReduce APIs

• HDFS APIs

• YARN APIs

• H doop is n ecosystem
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as a distributed system
Two distributed systems for storage and compute

M pReduce API

other YARN pps


M pReduce
Compute

YARN

Distributed ilesystem API

Stor ge
HDFS
a
a
a
f
a
Hadoop as a distributed system

• HDFS s distributed stor ge/ ilesystem

• YARN s distributed compute scheduler

• M pReduce s big d t processing fr mework


a
a
a
a
a
a
a
a
a
a
f
a
Hadoop as a distributed system

• H doop is compos ble: you c n use some (do not h ve to use ll)

• Ex mples

• Use only HDFS

• Use only YARN

• Use only YARN + M pReduce

• C ve t from the provider perspective: properly spec’ed h rdw re


a
a
a
a
a
a
a
a
a
a
a
Hadoop code organization

Client API

Tools
M pReduce

YARN

HDFS

H doop Common
a
a
Hadoop architecture

• M ster/centr l nodes vs. worker nodes

• HDFS: N menode nd D t nodes

• YARN: Resource M n ger nd Node M n gers

• High v il bility

• F il over to st ndby m ster nodes in c se of m ster f ilures

• Coordin ted using ZooKeeper


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop architecture

• Self-he ling: resilient g inst individu l node f ilures

• D t gets re-replic ted if d t node is lost

• A t sk gets rest rted (on nother node) if node f ils


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as an ecosystem

• Higher-level fr meworks th t cre te complex M pReduce work lows: Pig, Oozie, C sc ding,
Sc lding, …

• SQL on H doop: Hive, Phoenix, Imp l , Presto, …

• Stor ge systems on H doop: HB se

• Seri liz tion/form t libr ries: P rquet, Avro, ORC, …

• Sp rk
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
Running Hadoop
Running Hadoop

• Single-node setup

• Single-node st nd lone (“loc l”)

• Single-node pseudo-distributed setup

• Cluster setup

• Cloud setup

• Roll your own: cluster setup using VMs

• More cloud-n tive setup: on-dem nd YARN/MR + cloud stor ge


a
a
a
a
a
a
Running Hadoop

# of processes # of m chines

loc l 1 1

pseudo-distributed sever l 1

cluster m ny m ny
a
a
a
a
a
Running Hadoop
Demo

Inst ll nd run H doop in


single-node setup
a
a
a
a
Running Hadoop
Demo

• Inst ll pre-requisites: JDK, ssh, etc.

• Inst ll H doop

• Explore the H doop inst ll tion

• Try st nd lone setup

• St rt nd stop pseudo-distributed setup


a
a
a
a
a
a
a
a
a
a
a
a
Running Hadoop
Pseudo-distributed cluster

• https://h doop. p che.org/docs/st ble/h doop-project-dist/h doop-common/


SingleCluster.html

• Set up ssh for loc lhost

• Inst ll ssh (server nd client): sshd nd ssh

• M ke sshd run in the b ckground

• Do key gener tion (keygen) to do p sswordless loc lhost ssh

• “Form t” the hdfs ilesystem


a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a

You might also like