MapR Certified Spark Developer Study Guide (MCSD)
MapR Certified Spark Developer Study Guide (MCSD)
MapR Certified Spark Developer Study Guide (MCSD)
Study Guide
1
CONTENTS
About MapR Study Guides .................................................................................................................................... 3
MapR Certified Spark Developer (MCSD) .......................................................................................................... 3
Whats on the
Exam?
1.1
1.2
1.3
1.4
1.5
Describe how MapReduce jobs are executed and monitored in both MapReduce v.1 and
in YARN
2.2
2.3
2.4
Describe the differences between running Spark in the interactive shell vs. building a
standalone application
2.5
3.1
Define ways to create Pair RDDs from existing RDDs and from automatically loaded
data
3.2
3.3
Describe the differences between groupByKey and reduceByKey and how each
are used
3.4
3.5
3.6
4.1
Create DataFrames from existing RDDs, data sources, and using reflection
4.2
Demonstrate how to use DataFrame operations including domain methods and SQL
4.3
4.4
5.1
Describe components of the Spark execution model such as stages, tasks, and jobs
5.2
5.3
Debug and tune Spark application logic, as well as Spark configurations, including
optimizations, memory cache, and data locality
6.2
6.3
Demonstrate how to create DStreams using standard RDD operations and stateful
operations
6.4
6.5
6.6
Use accumulators as atomic aggregates and broadcast variables as a way to share data
7.2
7.3
7.4
Sample Questions
The following questions represent the kinds of questions you will see on the exam. The
answers to these sample questions can be found in the answer key following the sample
questions.
1.
Which of the following Scala statement would be most appropriate to load the data
(sfpd.txt) into an RDD? Assume that SparkContext is available as the variable sc
and SQLContext as the variable sqlContext.
a) val sfpd=sqlContext.loadText(/path to file/sfpd.txt)
b) val sfpd=sc.loadFile(/path to file/sfpd.txt)
c) val sfpd=sc.textFile(/path to file/sfpd.txt)
d) val sfpd=sc.loadText(/path to file/sfpd.txt)
2. Given the following lines of code in Scala, identify at which step the input RDD is
actually computed.
val inputRDD=sc.textFile(/user/user01/data/salesdat.csv)
val elecRD=inputRDD.map(line=>line.split(,))
val elecRDD=inputRDD.filter(line=>line.contains(ELECTRONICS))
val electCount=elecRDD.count()
c)
10
5. An existing RDD, unhcrRDD contains refugee data from the UNHCR. It contains the
following fields: (Country of residence, Country of origin, Year, Number of refugees).
Sample data is shown below. Assume that number of refugees is of type Int and all
other values are of type String.
Array(Array(Afghanistan, Pakistan, 2013, 34), Array(Albania,
Algeria, 2013, 0), Array(Albania, China, 2013, 12)).
To get the count of all refugees by country of residence, use which of the following
in Scala?
a) val country = unhcrRDD.map(x=>(x(0), x(3))).reduceByKey((a,b)=>a+b)
b) val country = unhcrRDD.map(x=>(x(0),1)).reduceByKey((a,b)=>a+b)
c) val country = unhcrRDD.map(x=>x.parallelize())
6. Given the pair RDD country that contains tuples of the form ((Country, count)),
which of the following is used to get the country with the lowest number of
refugees in Scala?
a) val low = country.sortByKey().first
b) val low = country.sortByKey(false).first
c) val low = country.map(x=>(x._2,x._1)).sortByKey().first
d) val low = country.map(x=>(x._2,x._1)).sortByKey(false).first
11
7. There are two datasets on online auctions. D1 has the number of the bids for an
auction item. D2 contains the seller rating for an auction item. Not every seller has a
seller rating. What would you use to get all auction items with the number of bids
count and the seller rating (if the data exists) in Scala?
a) D2.join(D1)
b) D1.join(D2)
c) D1.leftOuterJoin(D2)
d) D2.leftOuterJoin(D1)
8. A DataFrame can be created from an existing RDD. You would create the DataFrame
from the existing RDD by inferring the schema using case classes in which case?
a) If your dataset has more than 22 fields
b) If all your users are going to need the dataset parsed in the same way
c) If you have two sets of users who will need the text dataset parsed differently
9. In Scala which of the following would be used to specify a User Defined Function
(UDF) that can be used in a SQL statement on Apache Spark DataFrames?
a) registerUDF(func name, func def)
b) sqlContext.udf(function definition)
c) udf((arguments)=>{function definition})
d) sqlContext.udf.register(func name, func def)
12
Option C
2. Option B
3. Option A
4. Option C
5. Option A
6. Option C
7. Option C
8. Option B
9. Option D
13
Preparing for
the Certification
14
Certification exam fee included one exam try only, done on the
students own time (not in class)
15
16
17
5.
18
19
Datasets
These are some datasets that we recommend for experimenting with Spark.
1.
20
21
22
23
6) Once confirmed, your reservation will be in your My Exams tab of Innovative Exams
24
25
26
We recommend that you sign in 30 minutes in advance of your testing time so that you
can communicate with your proctor, and get completely set up well in advance of your
test time.
You will be required to share your desktop and your webcam prior to the exam start.
YOUR EXAM SESSION WILL BE RECORDED. If the Proctor senses any misconduct, your
exam will be paused and you will be notified by the proctor of your misconduct.
If your misconduct is not corrected, the Proctor will shut down your exam, resulting in a
Fail.
Examples of misconduct and/or misuse of the exam include, but are not limited to, the
following:
27
28
Your certificate is available as a PDF. You can download and print your certificate from
your profile in learn.mapr.com.
Your credential contains a unique Certificate Number and a URL. You can share your
credential with anyone who needs to verify your certification.
If you happen to fail the exam, you will automatically qualify for a discounted exam
retake. Retakes are $100 USD and can be purchased in learn.mapr.com only after youve
failed that exam.
Exam Retakes
If you fail an exam, you are eligible to purchase immediately and retake the exam in 14
days. Once you have passed the exam, you may not take that version (e.g., v.4.0) of the
exam again, but you may take any newer version of the exam (e.g., v.4.1).
29