Introduction To Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Kuliah Arsitektur dan Teknologi Data Besar

Introduction to Big Data

MOH EDI WIBOWO


mediw@ugm.ac.id

Department of Computer Science and Electronics


Universitas Gadjah Mada

www.ugm.ac.id Locally Rooted, Globally Respected


What is Big Data?
● From http://www.gartner.com
○ Big data is high-volume, high-velocity and/or high-variety information assets that
demand cost-effective, innovative forms of information processing that enable
enhanced insight, decision making, and process automation
○ Used in data analytics (not in transactional data processing)
○ https://www.datasciencecentral.com/who-came-up-with-the-name-big-data/

www.ugm.ac.id Locally Rooted, Globally Respected


Big Data 3V
● Terms used to characterize big data
○ Volume, velocity, variety
○ Others: Volatility, variability, value, veracity, … (4V, 5V, … )

www.ugm.ac.id Locally Rooted, Globally Respected


Big Data Volume
● Big data has a massive volume
○ How big is big in big data?
○ The use of clusters of machines (a single machine is not sufficient)
■ Improves not only capacity but also data resilience
■ What about the CAP theory?

www.ugm.ac.id Locally Rooted, Globally Respected


Big Data Velocity
● Data are produced at a very high speed
○ How to handle data that grow continuously?
○ The need of scalability (clusters of machines that can be outscaled practically)

www.ugm.ac.id Locally Rooted, Globally Respected


Big Data Variety
● The data have different modalities and formats (without specific
format)
○ Texts, images, audio, video
○ Structured (have a fixed pre-specified format), unstructured (no format)
■ Semi-structured (varied formats e.g. html)
○ How to store such data?
○ Heterogeneous methods of data storage (data lake instead of data warehouse)

www.ugm.ac.id Locally Rooted, Globally Respected


Examples of Big Data
● Social media
● Mobile applications with massive users
● Finance/banking
● Telecommunication
● Sensors and internet of things
● Web scraping
● Log data
● Data kependudukan
● Else

www.ugm.ac.id Locally Rooted, Globally Respected


Examples of Big Data Architecture
● https://www.alibabacloud.com/blog/pravega-the-answer-of-stor
age-layer-in-flink-ecosystem_596430
● https://dataview.in/wp-content/uploads/2017/06/Lambda.jpg
● https://www.thinkdataanalytics.com/big-data-streaming-analytics/

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop
● An open source framework that helps to store and to process large
amounts of data in computer clusters
○ Co-founded by Doug Cutting and Mike Cafarella
○ Mostly written in Java
○ Inspired by Google’s papers
■ e.g. "MapReduce: Simplified Data Processing on Large Clusters"

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop Cluster

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop Modules
● Hadoop common, the base module
● Hadoop distributed file system (HDFS), the distributed storage
● Hadoop YARN, for managing resource
● Hadoop MapReduce, for parallel processing
● Hadoop Ozone, for object store

www.ugm.ac.id Locally Rooted, Globally Respected


HDFS Architecture

*https://www.integrate.io/blog/guide-to-hdfs-for-big-data-processing/

www.ugm.ac.id Locally Rooted, Globally Respected


HDFS Block Replication

*https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop SSH Accounts

hadoop-user
~/.ssh/author
ized_keys

hadoop-user slave
~/.ssh/id_rsa
~/.ssh/id_rsa.pub

hadoop-user
~/.ssh/author
ized_keys
master

slave

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop Tutorial
● https://medium.com/@vikassharma555/hadoop-installation-on-windows-wsl-2-on-u
buntu-20-04-lts-single-node-d604729ea0ca

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop Web-Based Cluster UI

www.ugm.ac.id Locally Rooted, Globally Respected


Command Lines for Hadoop
● hadoop fs, hdfs dfs
○ hadoop fs -ls …
○ hadoop fs -mkdir …
○ hadoop fs -put …
○ hadoop fs -cat …
○ hadoop fs -get …
○ etc

www.ugm.ac.id Locally Rooted, Globally Respected


HDFS Java API
● To write to and to read from HDFS programmatically
○ Example codes: https://simpan.ugm.ac.id/s/UCmyj4ctbVK3zVL

www.ugm.ac.id Locally Rooted, Globally Respected


HDFS Java API
● Read API

*https://www.guru99.com
/learn-hdfs-a-beginners-g
uide.html

www.ugm.ac.id Locally Rooted, Globally Respected


HDFS Java API
● Write API

*https://www.guru99.com
/learn-hdfs-a-beginners-g
uide.html

www.ugm.ac.id Locally Rooted, Globally Respected


Apache Flume
● A distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data

https://flume.apache.org/

www.ugm.ac.id Locally Rooted, Globally Respected


Apache Flume
● Can be seen as a message broker (a mediator between data
producers and centralized stores)
● When the rate of incoming data exceeds the rate at which data can
be written to the destination, Flume provides a steady flow of data
between them
● Reliable, fault tolerant, scalable, manageable, and customizable
● Supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc
● Supports a large set of sources and destinations types, including
HDFS

www.ugm.ac.id Locally Rooted, Globally Respected


Apache Flume
● Tutorials:
○ Installation
■ https://www.tutorialspoint.com/apache_flume/apache_flume_environmen
t.htm
○ Configuration of Flume agent
■ https://www.tutorialspoint.com/apache_flume/apache_flume_netcat_sour
ce.htm
○ HDFS sink
■ https://howtoprogram.xyz/2016/08/01/apache-flume-hdfs-sink-tutorial/

www.ugm.ac.id Locally Rooted, Globally Respected


Thank you

www.ugm.ac.id Locally Rooted, Globally Respected


Hadoop Tutorial
● An open source framework that helps to store and to process large
amounts of data in computer clusters
○ Co-founded by Doug Cutting and Mike Cafarella
○ Mostly written in Java
○ Inspired by Google’s papers
■ e.g. "MapReduce: Simplified Data Processing on Large Clusters"

www.ugm.ac.id Locally Rooted, Globally Respected


Technologies
● Hadoop (distribute file system)
● Spark (for parallel processing)
● Kafka (message broker)

www.ugm.ac.id Locally Rooted, Globally Respected


Big Data, Cloud, Blockchain
● Are they related?
○ Do they have similarities or are completely different approaches?

www.ugm.ac.id Locally Rooted, Globally Respected

You might also like