Spark Tutorial
Spark Tutorial
Contents
1. Introduction .............................................................................................................................................. 3
2. What is Spark? .......................................................................................................................................... 3
3. History of Apache Spark ............................................................................................................................ 3
4. Why Spark? ............................................................................................................................................... 3
5. Apache Spark Components ....................................................................................................................... 4
5.1. Spark Core .......................................................................................................................................... 4
5.2. Spark SQL ........................................................................................................................................... 4
5.3. Spark Streaming ................................................................................................................................. 4
5.4. Spark MLlib......................................................................................................................................... 5
5.5. Spark GraphX ..................................................................................................................................... 5
5.6. SparkR ................................................................................................................................................ 5
6. Resilient Distributed Dataset – RDD ......................................................................................................... 5
7. Spark Shell ................................................................................................................................................. 7
8. Conclusion ................................................................................................................................................. 7
2
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
1. Introduction
What is Spark? Why there is a serious buzz going on about this technology? I hope this Spark
introduction tutorial will help to answer some of these questions. Apache Spark is an open-source
cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data
from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone,
YARN and Mesos cluster manager.
What this Spark tutorial will cover is Spark ecosystem components, Spark video tutorial, Spark
abstraction – RDD, transformation, and action in Spark RDD. The objective of this introductory guide is
to provide Spark Overview in detail, its history, Spark architecture, deployment model and RDD in Spark.
2. What is Spark?
Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high-level API.
For example, Java, Scala, Python and R. Apache Spark is a tool for Running Spark Applications. Spark is
100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk.
Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.
It can be integrated with Hadoop and can process existing Hadoop HDFS data. Follow this guide to learn
How Spark is compatible with Hadoop?
It is saying that the images are the worth of thousand words. To keep this in mind we have also provided
Spark video tutorial for more understanding of Apache Spark.
4. Why Spark?
After studying Apache Spark introduction lets discuss, why Spark come into existence?
In the industry, there is a need for general-purpose cluster computing tool as:
3
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
Hence in the industry, there is a big demand for a powerful engine that can process the data in real-time
(streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and
perform in-memory processing.
Apache Spark Definition says it is a powerful open-source engine that provides real-time stream
processing, interactive processing, graph processing, in-memory processing as well as batch processing
with very fast speed, ease of use and standard interface. This creates the difference between Hadoop vs
Spark and also makes a huge comparison between Spark vs Storm.
In what is Spark tutorial, we discussed a definition of spark, history of spark and importance of spark.
Now let’s move towards spark components.
4
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
5.6. SparkR
It is R package that gives light-weight frontend to use Apache Spark from R. It allows data scientists to
analyze large datasets and interactively run jobs on them from the R shell. The main idea
behind SparkR was to explore different techniques to integrate the usability of R with the scalability of
Spark.
5
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
Resilient Distributed Dataset (RDD) is the fundamental unit of data in Apache Spark, which is a
distributed collection of elements across cluster nodes and can perform parallel operations. Spark RDDs
are immutable but can generate new RDD by transforming existing RDD.
Parallelized collections – We can create parallelized collections by invoking parallelize method in the
driver program.
External datasets – By calling a textFile method one can create RDDs. This method takes URL of the file
and reads it as a collection of lines.
Existing RDDs – By applying transformation operation on existing RDDs we can create new RDD.
Transformation – Creates a new RDD from the existing one. It passes the dataset to the function and
returns new dataset.
Action – Spark Action returns final result to driver program or write it to the external data store.
Refer this link to learn RDD Transformations and Actions APIs with examples.
6
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
7. Spark Shell
Apache Spark provides an interactive spark-shell. It helps Spark applications to easily run on the
command line of the system. Using Spark shell we can run/test our application code interactively. Spark
can read from many types of data sources so that it can access and process a large amount of data.
8. Conclusion
Spark tutorial provides a collection of technologies that increase the value of big data and permits new
Spark the use cases. It gives us a unified framework for creating, managing and implementing Spark big
data processing requirements. Spark video tutorial provides you detailed information about Spark.
In addition to the MapReduce operations, one can also implement SQL queries and process streaming
data through Spark, which were the drawbacks for Hadoop-1. With Spark, developers can develop with
Spark features either on a stand-alone basis or, combine them with MapReduce programming
techniques.
This conclusion is not the end but a foundation to learn new Theories of Apache Spark. Here are the
next steps after you are through with this Apache Spark Tutorial:
7
https://data-flair.training/hadoop-spark-developer-course/
Apache Spark Tutorial
8
https://data-flair.training/hadoop-spark-developer-course/