Introduction to Spark
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Preface
Content of this Lecture:
In this lecture, we will discuss the ‘framework of
EL
spark’, Resilient Distributed Datasets (RDDs) and also
discuss some of its applications such as: Page rank and
GraphX.
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it
EL
has gained a lot of attraction both in academia and in
industry.
PT
It is an another system for big data analytics
N
Isn’t MapReduce good enough?
Simplifies batch processing on large commodity clusters
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Need of Spark
Map Reduce
EL
Input
PT Output
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Need of Spark
Map Reduce
EL
Input
PT
Expensive save to disk for fault
tolerance
Output
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive
EL
Lacks efficient data sharing
PT
Specialized frameworks did evolve for different programming
N
models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records
EL
Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Need of Spark
RDD RDD RDD
Read
EL
HDFS
Read
PT Cache
N
Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records
EL
Built through coarse grained transformations (map, join …)
Fault Recovery?
Lineage! PT
Log the coarse grained operation applied to a
N
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
RDD RDD RDD
Read
EL
HDFS
Read Cache
PT
N
Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
EL
Read
HDFS Map Reduce
Lineage
PT Introduction to Spark
N
Vu Pham
RDD RDD RDD
Read
EL
HDFS RDDs track the graph of
Read transformations that built them Cache
PT
(their lineage) to rebuild lost data
N
Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …
EL
Control
PT
Partitioning: Spark also gives you control over how you can
partition your RDDs.
N
Persistence: Allows you to choose whether you want to
persist RDD onto disk or not.
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Partitioning: PageRank
EL
Joins take place
repeatedly
PT Good partitioning
reduces shuffles
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Example: PageRank
Give pages ranks (scores) based on links to them
Links from many pages high rank
EL
Links from a high-rank page high rank
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Spark Program
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
} PT
links.map(dest => (dest, rank/links.size))
ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
PageRank Performance
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Generality
RDDs allow unification of different programming models
Stream Processing
Graph Processing
Machine Learning
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Gather-Apply-Scatter on GraphX
Triplets
EL
PT
N
Graph Represented In a Table
Triplets
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Gather-Apply-Scatter on GraphX
EL
PT
N
Gather at A
Group-By A
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Gather-Apply-Scatter on GraphX
EL
PT
N
Apply
Map
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Gather-Apply-Scatter on GraphX
EL
PT
N
Scatter
Triplets Join
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
The GraphX API
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Graphs Property
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Creating a Graph (Scala)
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Graph Operations (Scala)
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Built-in Algorithms (Scala)
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
The triplets view
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
The subgraph transformation
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
The subgraph transformation
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Computation with aggregateMessages
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Computation with aggregateMessages
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Example: Graph Coarsening
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
How GraphX Works
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Storing Graphs as Tables
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Simple Operations
Reuse vertices or edges across multiple graphs
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Implementing triplets
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Implementing triplets
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Implementing aggregateMessages
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Future of GraphX
1. Language support
a) Java API
b) Python API: collaborating with Intel, SPARK-3789
EL
2. More algorithms
a) LDA (topic modeling)
PT
b) Correlation clustering
3. Research
N
a) Local graphs
b) Streaming/time-varying graphs
c) Graph database–like queries
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Other Spark Applications
i. Twitter spam classification
ii. EM algorithm for traffic prediction
EL
iii. K-means clustering
PT
iv. Alternating Least Squares matrix factorization
N
v. In-memory OLAP aggregation on Hive data
vi. SQL on Spark
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Reading Material
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”
EL
Matei Zaharia, Mosharaf Chowdhury et al.
PT
“Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing”
N
https://spark.apache.org/
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Conclusion
RDDs (Resilient Distributed Datasets (RDDs) provide
a simple and efficient programming model
EL
Generalized to a broad set of applications
PT
Leverages coarse-grained nature of parallel
algorithms for failure recovery
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Introduction to Kafka
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Preface
Content of this Lecture:
Define Kafka
EL
Describe some use cases for Kafka
PT
Describe the Kafka data model
Describe Kafka architecture
N
List the types of messaging systems
Explain the importance of brokers
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Batch vs. Streaming
Batch Streaming
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
EL
PT
N
Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.
EL
The characteristics of Kafka are:
PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Kafka History
Apache Kafka was originally developed by LinkedIn and
later, handed over to the open source community in early
2011.
EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Kafka Use Cases
Kafka can be used for various purposes in an organization,
such as:
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
Most of what a business does can be thought as event
streams. They are in a
• Retail system: orders, shipments, returns, …
• Financial system: stock ticks, orders, …
EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …
PT
and so on.N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Enter Kafka
Adopted at 1000s of companies worldwide
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Aggregating User Activity Using Kafka-Example
Kafka can be used to aggregate user activity data such as clicks,
navigation, and searches from different websites of an
organization; such user activities can be sent to a real-time
monitoring system and hadoop system for offline processing.
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Kafka Data Model
The Kafka data model consists of messages and topics.
Messages represent information such as, lines in a log file, a row of stock
market data, or an error message from a system.
Messages are grouped into categories called topics.
Example: LogMessage and Stock Message.
EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.
Vu Pham Introduction to Kafka
Topics
A topic is a category of messages in Kafka.
The producers publish the messages into topics.
The consumers read the messages from topics.
A topic is divided into one or more partitions.
EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.
PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Partitions
Topics are divided into partitions, which are the unit of
parallelism in Kafka.
Partitions allow messages in a topic to be distributed to
EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Partition Distribution
Partitions can be distributed across the Kafka cluster.
Each Kafka server may handle one or more partitions.
A partition can be replicated across several servers fro fault-tolerance.
One server is marked as a leader for the partition and the others are
EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.
PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Producers
The producer is the creator of the message in Kafka.
The producers place the message to a particular topic.
The producers also decide which partition to place the message into.
EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Consumers
The consumer is the receiver of the message in Kafka.
Each consumer belongs to a consumer group.
A consumer group may have one or more consumers.
EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Kafka Architecture
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topic. Brokers provide the
messages to the consumers from the partitions.
• A topic is divided into multiple partitions.
• The messages are added to the partitions at one end and consumed in
EL
the same order.
• Each partition acts as a message queue.
PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Types of Messaging Systems
Kafka architecture supports the publish-subscribe and queue system.
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Example: Queue System
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Example: Publish-Subscribe System
EL
PT
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Brokers
Brokers are the Kafka processes that process the messages in Kafka.
• Each machine in the cluster can run one broker.
EL
• They coordinate among each other using Zookeeper.
PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Kafka Guarantees
Kafka guarantees the following:
1. Messages sent by a producer to a topic and a partition
EL
are appended in the same order
PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Replication in Kafka
Kafka uses the primary-backup method of replication.
One machine (one replica) is called a leader and is chosen
as the primary; the remaining machines (replicas) are
EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.
replicas. PT
The leader waits until the writes are completed on all the
If a replica is down, it is skipped for the write until it
N
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads
EL
and writes.
All the data is immediately written to a file in file system.
writes. PT
Messages are grouped as message sets for more efficient
Message sets can be compressed to reduce network
N
bandwidth.
A standardized binary message format is used among
producers, brokers, and consumers to minimize data
modification.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
Apache Kafka is an open source streaming data platform (a new
category of software!) with 3 major components:
1. Kafka Core: A central hub to transport and store event
streams in real-time.
EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.
they occur.
PT
Kafka Streams: A Java library to process event streams live as
N
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Further Learning
o Kafka Streams code examples
o Apache Kafka
https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/
streams/examples
o Confluent https://github.com/confluentinc/examples/tree/master/kafka-streams
EL
o Source Code https://github.com/apache/kafka/tree/trunk/streams
Kafka Streams Java docs
PT
o
http://docs.confluent.io/current/streams/javadocs/index.html
o First book on Kafka Streams (MEAP)
o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action
N
o Kafka Streams download
o Apache Kafka https://kafka.apache.org/downloads
o Confluent Platform http://www.confluent.io/download
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Conclusion
Kafka is a high-performance, real-time messaging system.
Kafka can be used as an external commit log for distributed
systems.
EL
Kafka data model consists of messages and topics.
PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.
Brokers are the Kafka processes that process the messages in Kafka.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka