RDD

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

1.

**Resilient Distributed Datasets (RDDs) in Apache Spark**

**Overview:**
- RDDs are a core data structure in Apache Spark, introduced from its initial release.
- They are distributed across cluster nodes and stored in RAM, providing fast processing compared to
other applications.
.........................
2. **Features of RDDs:**
1. **Immutable:** RDDs are immutable collections; changes result in the creation of new RDDs.
2. **Resilient:** They offer fault tolerance through lineage graphs, allowing recomputation of lost partitions
due to node failures.
3. **Lazy Evaluation:** Transformations are executed only when actions are triggered, enhancing
performance by avoiding unnecessary computations.
4. **Distributed:** Data is spread across multiple nodes, ensuring high availability.
5. **Persistence:** Intermediate results can be stored in-memory or on-disk for reuse and faster
computations.
.........................
3.**Ways to Create RDDs:**
- Using the ‘parallelize()‘ method.
- From existing RDDs.
- From external storage systems (e.g., HDFS, Amazon S3, HBase).
- From existing DataFrames and Datasets.
..............................
4.**Operations on RDDs in Apache Spark**

**1. Transformations:**
- **Purpose:** Used to manipulate RDD data and return new RDDs.
- **Evaluation:** Lazy; they are not executed until an action is performed.
- **Lineage:** Each transformation is linked to its parent RDD through a lineage graph.
- **Types:**
- **Narrow Transformations:** No data movement between RDD partitions. Examples: ‘map()‘,
‘flatMap()‘, ‘filter()‘, ‘union()‘, ‘mapPartitions()‘.
- **Wide Transformations:** Data movement (shuffling) between RDD partitions. Examples:
‘groupByKey()‘, ‘reduceByKey()‘, ‘aggregateByKey()‘, ‘join()‘, ‘repartition()‘.

**2. Actions:**
- **Purpose:** Return a non-RDD value and store the result in the driver program.
- **Examples:** ‘count()‘, ‘collect()‘, ‘first()‘, ‘take()‘.
.........................
5.**Introduction to PySpark**

PySpark is an open-source, distributed computing framework designed for real-time, large-scale data
processing. It serves as the Python API for Apache Spark, enabling Python users to leverage Spark’s
capabilities.

**Use Cases:**
- **Batch Processing:** Handling large volumes of data in batch pipelines.
- **Real-time Processing:** Processing data as it arrives.
- **Machine Learning:** Implementing machine learning algorithms on big data.
- **Graph Processing:** Analyzing and processing graph data structures.

**Key Features:**
- **Real-time Computations:** Supports real-time data processing.
- **Caching and Disk Persistence:** Allows for caching intermediate results and storing data on disk.
- **Fast Processing:** Provides high-speed data processing capabilities.
- **Works Well with RDDs:** Efficiently processes data using Resilient Distributed Datasets (RDDs).
..........................
6.Loading Dataset Into PySpark
Loading libraries
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
Creating a Spark Session
spark = SparkSession.builder.appName("Spark-Introduction").getOrCreate()
spark
Loading dataset to PySpark
To load a dataset into Spark session, we can use the spark.read.csv( ) method and save inside
df_pyspark.

df_pyspark = spark.read.csv("Example.csv")
df_pyspark
.............................
7.**SparkSession Overview**

**Introduction:**
- **SparkSession** is the entry point for Spark, introduced in Spark 2.0.
- It allows the creation of Spark RDDs, DataFrames, and Datasets.
- It replaces older contexts like ‘SQLContext‘, ‘HiveContext‘, and others.

**Key Points:**

1. **SparkSession in Spark 2.0:**


- Combines functionality of ‘SQLContext‘, ‘HiveContext‘, and other contexts into one unified class.
- Created using ‘SparkSession.builder()‘.

2. **SparkSession in Spark-Shell:**
- The default ‘spark‘ object in the Spark shell is an instance of ‘SparkSession‘.
- Use ‘spark.version‘ to check Spark version.

3. **Creating SparkSession:**
- **From Scala or Python:** Use ‘builder()‘ and ‘getOrCreate()‘ methods.
- **Multiple Sessions:** Create additional sessions using ‘newSession()‘.

4. **Setting Configurations:**
- Use ‘config()‘ method to set Spark configurations.

5. **Hive Support:**
- Enable Hive support with ‘enableHiveSupport()‘.

6. **Other Usages:**
- **Set & Get Configs:** Manage configurations with ‘spark.conf‘.
- **Create DataFrame:** Use ‘createDataFrame()‘ for building DataFrames.
- **Spark SQL:** Use ‘spark.sql()‘ to execute SQL queries on temporary views.
- **Create Hive Table:** Use ‘saveAsTable()‘ to create and query Hive tables.
- **Catalogs:** Access metadata with ‘spark.catalog‘.

**Commonly Used Methods:**


- ‘version()‘: Returns Spark version.
- ‘catalog()‘: Access catalog metadata.
- ‘conf()‘: Get runtime configuration.
- ‘builder()‘: Create a new ‘SparkSession‘.
- ‘newSession()‘: Create an additional ‘SparkSession‘.
- ‘createDataFrame()‘: Create a DataFrame from a collection or RDD.
- ‘createDataset()‘: Create a Dataset from a collection, DataFrame, or RDD.
- ‘emptyDataFrame()‘: Create an empty DataFrame.
- ‘emptyDataset()‘: Create an empty Dataset.
- ‘sparkContext()‘: Get the ‘SparkContext‘.
- ‘sql(String sql)‘: Execute SQL queries and return a DataFrame.
- ‘sqlContext()‘: Return SQLContext.
- ‘stop()‘: Stop the current ‘SparkContext‘.
- ‘table()‘: Return a DataFrame of a table or view.
- ‘udf()‘: Create a Spark UDF (User-Defined Function).
................................
8.introduction to Shared Variables

**Shared variables** are used to efficiently manage and share data across multiple nodes in a distributed
computing environment. Instead of creating separate copies of variables for each node, shared variables
allow nodes to access and update shared data consistently.

### Types of Shared Variables

1. **Broadcast Variables**:
- **Purpose**: Efficiently share read-only variables across all nodes.
- **Usage**: Useful for distributing large, read-only data (like lookup tables) to all nodes in a cluster.
- **Creation**: Use ‘SparkContext.broadcast(value)‘ to create a broadcast variable.

2. **Accumulators**:
- **Purpose**: Accumulate values across multiple tasks.
- **Usage**: Useful for counters or sums where tasks can update the accumulator.
- **Creation**: Use ‘SparkContext.accumulator(initialValue)‘ to create an accumulator.

implention to see in techm resoruce


.....................................
9.**Actions in Spark RDDs**

**Purpose:**
- Actions perform operations on RDDs that return non-RDD values and store the result in the driver
program.

**Common Actions:**
- **‘collect()‘:** Returns all data from the RDD as a list.
- **‘count()‘:** Returns the total number of elements in the RDD.
- **‘reduce()‘:** Computes a summarized result based on a function applied to the RDD.
- **‘first()‘:** Retrieves the first item from the RDD.
- **‘take(n)‘:** Retrieves the first ‘n‘ items from the RDD.

implementation in tech resource

........................
10.Transformations in Spark RDD

**Transformations Overview**:
- Transformations manipulate RDD data and return a new RDD.
- They are evaluated lazily, meaning they are not executed until an action is performed.
**Key Transformations**:
1. **distinct**: Retrieves unique elements from an RDD.
2. **filter**: Selects elements based on a condition.
3. **sortBy**: Sorts the data in ascending or descending order based on a condition.
4. **map**: Applies an operation to each element in an RDD, returning a new RDD with the same length
as the original.
5. **flatMap**: Similar to ‘map‘, but flattens the result, which may change the length of the RDD.
6. **union**: Combines data from two RDDs into a new RDD.
7. **intersection**: Retrieves common data from two RDDs into a new RDD.
8. **repartition**: Increases the number of partitions in an RDD.
9. **coalesce**: Decreases the number of partitions in an RDD.

**Grouping Operations**:
- **groupByKey()**: Groups each input key to an iterable value in the RDD, resulting in data shuffling over
the network.
- **reduceByKey()**: Similar to ‘groupByKey()‘, but it performs a map-side combine to optimize
processing.

implementation in techm resource


.................................................

You might also like