0% found this document useful (0 votes)

267 views3 pages

Spark RDD Dataframes SQL

Apache Spark can be used with Scala via the spark-shell or with Python via PySpark. It allows processing of structured data using DataFrames and SQL. Examples show how to create RDDs from text files, transform them using Map, FlatMap, Reduce and ReduceByKey. RDDs can be converted to DataFrames for SQL queries. DataFrames allow grouping, filtering, and aggregating structured data.

Uploaded by

leongladxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

267 views3 pages

Spark RDD Dataframes SQL

Uploaded by

leongladxton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Apache Spark:

language used scala:

to start spark

spark-shell for scala

pyspark for python

include modules :

import org.apache.spark.sql.SQLContext
import sqlContext.implicits._

import com.databricks.spark.xml._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types.{StructType, StructField, StringType,
DoubleType};

type(df.describe())
df.select("_id","author","description").show()

hdfs://quickstart.cloudera:8020/user/cloudera/

MAP and FlapMaP example:

mountain@mountain:~/sbook$ cat words.txt

line1 word1
line2 word2 word1
line3 word3 word4
line4 word1

scala> val lines = sc.textFile("words.txt");

...
scala> lines.map(_.split(" ")).take(3)
res4: Array[Array[String]] = Array(Array(line1, word1), Array(line2, word2, word1), Array(line3,
word3, word4))

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

res5: Array[String] = Array(line1, word1, line2)

REDUCE :
val rdd1 = sc.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)

ReduceByKey:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data_RDD = sc.parallelize(words)
val mapped_RDD = data_RDD.map(w => (w,1))
mapped_RDD.take(10)

val reduced_RDD = mapped_RDD.reduceByKey(_+_)

reduced_RDD.take(10)

FilTER:

val data_RDD =
sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/temperature_2014.csv")
data_RDD.take(100)

val FL_mapped_RDD = data_RDD.flatMap(lines => lines.split(","))

FL_mapped_RDD.take(20)

RDD to DATAFRAME:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import org.apache.spark.sql._
/ this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
/ load the data into a new RDD
val ebayText = sc.textFile("/home/jovyan/work/datasets/spark-ebook/ebay.csv")

// Return the first element in this RDD

ebayText.first()

//define the schema using a case class

//class name starts with capital letter
case class Auction(auctionid: String, bid: Float, bidtime: Float,bidder: String,
bidderrate: Integer, openbid: Float, price: Float,item: String, daystolive: Integer)

// create an RDD of Auction objects

val ebay = ebayText.map(_.split(",")).map(p
=>Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,
p(6).toFloat,p(7),p(8).toInt))

/ Return the first element in this RDD

ebay.first()
// change ebay RDD of Auction objects to a DataFrame
val auction = ebay.toDF()

/ How many bids per item?

auction.groupBy("auctionid", "item").count.show
auction.select("auctionid").distinct.count()

// Get the auctions with closing price > 100

val highprice = auction.filter("price > 100")
highprice.show()

/ register the DataFrame as a temp table

auction.registerTempTable("RDD_table")

import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// How many bids per auction?
val results = sqlContext.sql("SELECT auctionid, item, count(bid) FROM
RDD_table GROUP BY auctionid, item")

referrences:
https://mapr.com/ebooks/spark/05-processing-tabular-data-with-spark-sql.html
https://www.supergloo.com/fieldnotes/spark-sql-csv-examples-python/
http://sparktutorials.net/Opening+CSV+Files+in+Apache+Spark+-
+The+Spark+Data+Sources+API+and+Spark-CSV

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
HQ France Dorks
100% (3)
HQ France Dorks
87 pages
Latihan Azure Microsoft-1
No ratings yet
Latihan Azure Microsoft-1
33 pages
Lesson 1 Temenos Products
No ratings yet
Lesson 1 Temenos Products
16 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark
No ratings yet
Pyspark
31 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Py Spark
No ratings yet
Py Spark
10 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Spark
No ratings yet
Spark
13 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Spark
No ratings yet
Spark
96 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
PL SQL Exercise by Unsw
No ratings yet
PL SQL Exercise by Unsw
5 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
DDL Commands
No ratings yet
DDL Commands
65 pages
Spark
No ratings yet
Spark
65 pages
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
32 AI Exp2
No ratings yet
32 AI Exp2
5 pages
Yukt Kataria-Resume PDF
No ratings yet
Yukt Kataria-Resume PDF
1 page
BETSE Documentation April10th2019
No ratings yet
BETSE Documentation April10th2019
77 pages
CS153: Compilers Lecture 19: Optimization: Stephen Chong
No ratings yet
CS153: Compilers Lecture 19: Optimization: Stephen Chong
38 pages
C Progms List
No ratings yet
C Progms List
32 pages
GitHub Fabacab Awesome Cybersecurity Blueteam ? A Curated Collection
100% (1)
GitHub Fabacab Awesome Cybersecurity Blueteam ? A Curated Collection
19 pages
FFFIScom User Guide
No ratings yet
FFFIScom User Guide
18 pages
Unit-3 Software Effort Estimation
No ratings yet
Unit-3 Software Effort Estimation
83 pages
Omsdk Session Client
No ratings yet
Omsdk Session Client
32 pages
C# .Net Prog 515cie05
No ratings yet
C# .Net Prog 515cie05
2 pages
Operators of SQL
No ratings yet
Operators of SQL
7 pages
Web Services
No ratings yet
Web Services
4 pages
2023 Specimen Mark Scheme
No ratings yet
2023 Specimen Mark Scheme
14 pages
Oakridge International School: Student Grade Report System
No ratings yet
Oakridge International School: Student Grade Report System
29 pages
Engineeringinterviewquestions Com Mcqs On Recursion Answers
No ratings yet
Engineeringinterviewquestions Com Mcqs On Recursion Answers
5 pages
Spring 2024 - CS508 - 1
No ratings yet
Spring 2024 - CS508 - 1
4 pages
Connectivity Service
No ratings yet
Connectivity Service
870 pages
14 - Scenarios
No ratings yet
14 - Scenarios
13 pages
Python Strings
No ratings yet
Python Strings
12 pages
92 - 93 Daprogrameba - VB
No ratings yet
92 - 93 Daprogrameba - VB
279 pages
Optimization Problems Minimum Spanning Tree Behavioral Abstraction
No ratings yet
Optimization Problems Minimum Spanning Tree Behavioral Abstraction
114 pages
Deepak Panchal CV
No ratings yet
Deepak Panchal CV
1 page
Experiment No.2
No ratings yet
Experiment No.2
4 pages
V CSE CS3551 DC QB Unit-1
No ratings yet
V CSE CS3551 DC QB Unit-1
5 pages
S Coke T Connection
No ratings yet
S Coke T Connection
6 pages
Document Templates For Project
No ratings yet
Document Templates For Project
4 pages
Gantt Chart Template For QA Projects
No ratings yet
Gantt Chart Template For QA Projects
1 page

Spark RDD Dataframes SQL

Uploaded by

Spark RDD Dataframes SQL

Uploaded by

Apache Spark:

language used scala:

spark-shell for scala

MAP and FlapMaP example:

mountain@mountain:~/sbook$ cat words.txt

scala> val lines = sc.textFile("words.txt");

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

val reduced_RDD = mapped_RDD.reduceByKey(_+_)

val FL_mapped_RDD = data_RDD.flatMap(lines => lines.split(","))

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Return the first element in this RDD

//define the schema using a case class

// create an RDD of Auction objects

/ Return the first element in this RDD

/ How many bids per item?

// Get the auctions with closing price > 100

/ register the DataFrame as a temp table

You might also like