Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 276

Which command is used to populate a Big SQL

table?
Load
You need to monitor and manage data security
across a Hadoop platform. Which tool would you
use?
a. HDFS
b. Hive
c. SSL
d. Apache Ranger

Which type of function promotes code re-use and


reduces query complexity?
a. Scalar
b. OLAP
c. User defined
d. Built in

You need to create a table that is not managed by


the Big SQL database manager. Which keyword
would you use to create the table?
a. boolean
b. string
c. external
d. smallint

Which feature allows the bigsql user to securely


access data in Hadoop on behalf of another user?
a. schema
b. impersonation
c. rights
d. privilege

Which statement describes the purpose of Ambari?

C. It is used for provisioning, managing, and monitoring Hadoop clusters.

What ZK CLI command is used to list all the ZNodes


at the top level of the ZooKeeper hierarchy, in the
ZooKeeper command-line interface?
ls

What must be done before using Sqoop to import


from a relational database?
$SQOOP_HOME/lib

What is the default number of rows Sqoop will


export per transaction?
1000

When sharing a notebook, what will always point to


the most recent version of the notebook?
A. Watson Studio homepage

B. ​The permalink (URL)

C. The Spark service

D. PixieDust visualization

Which of the "Five V's"of Big Data describes the real


purpose of deriving business insight from Big Data?

Value

Which Spark RDD operation returns values after


performing the evaluations?
a. actions

Which statement describes "Big Data" as it is used


in the modern business world?

A. The summarization of large indexed data stores to provide information


about potential problems or opportunities.

B. Indexed databases containing very large volumes of historical data used for
compliance reporting purposes.
C. Non-conventional methods used by businesses and organizations to
capture, manage, process, and make sense of a large volume of data.

D. Structured data stores containing very large data sets such as video and
audio streams.

Which two descriptions are advantages of Hadoop?

A. intensive calculations on small amounts of data

B. processing random access transactions

C. processing a large number of small files

D. able to use inexpensive commodity hardware

E. processing large volumes of data with high throughput

Which statement is true about the Hadoop


Distributed File System (HDFS)?

Select an answer

A. HDFS is a software framework to support computing on large clusters of


computers.

B. HDFS is the framework for job scheduling and cluster resource


management.
C. HDFS provides a web-based tool for managing Hadoop clusters.

D. HDFS links the disks on multiple nodes into one large file system.

Which two of the following are row-based data


encoding formats?
a. avro
b. csv

What is the default number of rows Sqoop will


export per transaction?

Select an answer

A. 100,000

B. 1,000

C. 100

Which element of Hadoop is responsible for


spreading data across the cluster?
a. MapReduce

Under the MapReduce v1 programming model, what


happens in the "Map" step?

Input is processed as individual splits.


Under the MapReduce v1 architecture, which
element of MapReduce controls job execution on
multiple slaves?
jobTracker

Which type of function promotes code re-use and


reduces query complexity?

A. Scalar

B. User-Defined

C. OLAP

D. Built-in

***
The Spark configuration must be set up first through IBM Cloud
***
You can import preinstalled libraries if you are
using which languages?
python and R
Who can control Watson Studio project assets?
editors

Who can access your data or notebooks in your


Watson Studio project?

collaborateurs

Editors

***
Which visualization library is developed by IBM as
an add-on to Python notebooks?
PixieDust
***

You need to define a server to act as the medium


between an application and a data source in a Big SQL
federation. Which command would you use?
A. SET AUTHORIZATION

B. CREATE WRAPPER

C. CREATE NICKNAME

D. CREATE SERVER

Which file format has the highest performance?

A. ​ ​ORC

B. Sequence

C. Delimited

D. Parquet

enter choices
Centralized security framework to enable, monitor and manage
comprehensive data security across the Hadoop platform
• Manage fine-grained access control over Hadoop data access
components like Apache Hive and Apache HBase
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
• Policies can be set for individual users or groups
Which is the primary advantage of using column-based
data formats over record-based formats?

please : (MAYBE) voila :p


orc , parquet

A. facilitates SQL-based queries

B. ​faster query execution

C. supports in-memory processing

D. better compression using GZip (maybe)

In Big SQL, what is used for table definitions, location,


and storage format of input files?
Hive MetaStore

A. Ambari

B. Scheduler

C. Hadoop Cluster

D. The Hive Metastore


Which of the following is a data encoding format is a
compact, binary format that supports interoperability
with multiple programming languages and versioning?

AVRO

What Python statement is used to add a library to the


current code cell?

A. using

B. pull

C. import

D. load
Which command would you run to make a remote table
accessible using an alias?

A. CREATE NICKNAME

B. CREATE SERVER

C. SET AUTHORIZATION

D. CREATE WRAPPER

You need to define a server to act as the medium


between an application and a data source in a Big SQL
federation. Which command would you use?
CREATE SERVER

Which two options can be used to start and stop Big


SQL?
command line
ambari web interface
Which data type can cause significant performance
degradation and should be avoided?
STRING
Which Big SQL authentication mode is designed to
provide strong authentication for client/server
applications by using secret-key cryptography?
kerberos
Which command is used to populate a Big SQL table?
load

Which tool should you use to enable Kerberos


security?
ambari

Which data type is BOOLEAN defined as in a Big SQL


database?
SMALLINT

You need to monitor and manage data security across


a Hadoop platform. Which tool would you use?
ranger

Which file format has the highest performance?


parquet

Which feature allows application developers to easily


use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities
into their own applications?
rest apis

Which two Spark libraries provide a native shell?


scala
python

Which component of the Spark Unified Stack provides


processing of data arriving at the system in real-time?
spark streaming
Which two are the driving principles of MapReduce?
Spread data across a large cluster of computers.
Run your programs on the nodes that have the data.

Which two of the following are column-based data


encoding formats?
ORC
parquet

Which component of the Spark Unified Stack supports


learning algorithms such as, logistic regression, naive
Bayes classification, and SVM?
MLlib

Which two are features of the Hadoop Distributed File


System (HDFS)?
Files are split into blocks.
Data is accessed through Apache Ambari.

Which statement describes "Big Data" as it is used in


the modern business world?
Non-conventional methods used by businesses and organizations to capture, manage,
process, and make sense of a large volume of data.

What are three examples of Big Data?


photos posted on Instragram
messages tweeted on Twitter
banking records

Under the MapReduce v1 architecture, which element


of the system manages the map and reduce functions?
TaskTracker
What ZK CLI command is used to list all the ZNodes at
the top level of the ZooKeeper hierarchy, in the
ZooKeeper command-line interface?
ls /

Which Spark RDD operation creates a directed acyclic


graph through lazy evaluations?
GraphX

Which two descriptions are advantages of Hadoop?


able to use inexpensive commodity hardware
processing large volumes of data with high throughput

Which component of the HDFS architecture manages


storage attached to the nodes?
datanode

What must be done before using Sqoop to import from


a relational database?
Copy any appropriate JDBC driver JAR to $SQOOP_HOME/lib.

Under the HDFS storage model, what is the default


method of replication?
3 replicas, 2 on the same rack, 1 on a different rack

Under the MapReduce v1 programming model, what


happens in the "Map" step?
Input is processed as individual splits.

Which two of the following are row-based data


encoding formats?
CSV
AVRO

What is the default number of rows Sqoop will export


per transaction?
10 000

Under the MapReduce v1 architecture, which function


is performed by the JobTracker?
Accepts MapReduce jobs submitted by clients.​What
Python package has
support for linear algebra, optimization, mathematical
integration, and statistics?
SciPy

What must surround LaTeX code so that it appears on


its own line in a Juptyer notebook?
$$

Which visualization library is developed by IBM as an


add-on to Python notebooks?

Select an answer

A. Scala

B. ​PixieDust

C. Spark

D. Watson Studio
Which feature allows application developers to easily
use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities
into their own applications?

Which Hortonworks Data Platform (HDP) component


provides a common web user interface for
applications running on a Hadoop cluster?

A. Ambari

B. HDFS

C. YARN

D. MapReduce

Which two are the driving principles of


MapReduce?

What are three examples of "Data Exhaust"?


A. browser cache​ i think

B. video streams

C. banner ads

D. ​ log files

E. ​cookies

F. javascript

Which component of the Spark Unified Stack


provides processing of data arriving at the system
in real-time?

A. Spark SQL

B. Spark Live

C. Spark Streaming (oui)

D. MLlib
The Big SQL head node has a set of processes
running. What is the name of the service ID running
these processes?

A. user1

B. bigsql

C. hdfs

D. Db2

What is the default web location for a local Juptyer


instance?
localhost:8888

What is the native programming language for


Spark?
SCALA

What Python package has support for linear algebra,


optimization, mathematical integration, and statistics?
SciPy

Which two are examples of personally identifiable


information (PII)?
A. ​ Email address

B. ​Medical record number

C. IP address

D. Time of interaction

Which statement is true about Spark's Resilient


Distributed Dataset (RDD)?

It is the center of the Spark Unified Stack.

google said : ​It is a distributed collection of elements that are parallelized


across the cluster.

Question 3
Which component of the HDFS architecture
manages storage attached to the nodes?

Select an answer

A. NameNode

B. MasterNode

C. DataNode
D. StorageNode

Which two Spark libraries provide a native shell?

A. C++

B. Scala

C. Python

D. C#

E. Java
Which two of the following are column-based data
encoding formats?

A. ​ORC

B. JSON

C. ​Parquet

D. Flat

E. Avro

Which Spark RDD operation creates a directed


acyclic graph through lazy evaluations?

A. GraphX

B. Distribution

C. Actions

D. Transformations

What is the name of the Scala programming feature


that provides functions with no names?

Select an answer
A. Syntactical functions

B. Lambda functions

C. Persistent functions

D. Distributed functions

Which file format contains human-readable data


where the column values are separated by a
comma?

A. Parquet

B. ORC

C. Sequence

D. Delimited (95% sure)

What is the name of the Scala programming feature


that provides functions with no names?

Select an answer

A. Syntactical functions
B. Lambda functions

C. Distributed functions

D. Persistent functions

What OS command starts the ZooKeeper


command-line interface?
zkCli.sh

Which type of foundation does Big SQL build on?

A. RStudio

B. Jupyter

C. ​Apache HIVE

D. MapReduce

What must surround LaTeX code so that it appears


on its own line in a Juptyer notebook?
dollar dollar

Under the MapReduce v1 architecture, which


function is performed by the JobTracker?
A. Runs map and reduce tasks.

Which two of the following data sources are currently


supported by Big SQL?
Oracle / teradat

What is the Hortonworks DataFlow package used


for?

A. Analyzing at-rest data in batches.

B. Backup and recovery of all HDP data.

C. ​Data stream management and processing.

D. Searching HDP data for PII information.

Which two of the following can Sqoop import from a


relational database?

Can be all rows of a table ▪ Can limit the rows and columns ▪ Can specify your own
query to access relational data
A. Database native indexes.

B. All rows of a table.

C. Stored procedure code.

D. Specific rows and columns using a query.

Which three main areas make up Data Science


according to Drew Conway?

(select 3 )

A. Machine learning

B. Hacking skills

C. Substantive expertise

D. Math and statistics knowledge

E. Traditional research

Which two options can be used to start and stop Big


SQL?
AMbari/ command line
What is Hortonworks DataPlane Services (DPS)
used for?

Select an answer

A. Manage, secure, and govern data stored across all storage environments.

The Big SQL head node has a set of processes


running. What is the name of the service ID running
these processes?
bigsql

For what are interactive notebooks used by data


scientists?
Quick data exploration tasks that can be reproduced.

Which two areas of expertise are attributed to a data


scientist?
Machine learning
Data Modeling

What is one disadvantage to using CSV formatted


data in a Hadoop data store?
I​t is difficult to represent complex data structures such as maps.

Which statement describes the action performed by


HDFS when data is written to the Hadoop cluster?
(HELP)

A. The data is spread out and replicated across the cluster.


B. The MasterNodes write the data to disk.
C. The data is replicated to at least 5 different computers.
D. The FsImage is updated with the new data map. ​ i think

Under the MapReduce v1 architecture, which function


is performed by the TaskTracker?
Manages storage and transmission of intermediate output.

How does MapReduce use ZooKeeper?

A. Coordination between servers.

B. Aid in the high availability of Resource Manager.

C. Server lease management of nodes.

D. Master server election and discovery.


Which statement describes "Big Data" as it is used
in the modern business world?

B. Non-conventional methods used by businesses and organizations to


capture, manage, process, and make sense of a large volume of data.

What is the default data format Sqoop parses to


export data to a database?

A. JSON

B. CSV

C. XML

D. SQL

What is the term for the process of converting data


from one "raw" format to another format making it
more appropriate and valuable for a variety of
downstream purposes such as analytics and that
allows for efficient consumption of the data?

Select an answer

A. MapReduce

B. Data mining

C. Data munging

D. YARN

Which feature allows the bigsql user to securely access


data in Hadoop on behalf of another user?
impersonition

What are three examples of Big Data?

A. messages tweeted on Twitter

B. bank records

C. photos posted on Instragram

D. web server logs


E. inventory database records

F. cash register receipts

Under the MapReduce v1 architecture, which function


is performed by the JobTracker?

Which component of the HDFS architecture


manages the file system namespace and metadata?

Select an answer

A. NameNode

B. SlaveNode

C. WorkerNode

D. DataNode

What is the primary purpose of Apache NiFi?

A. Identifying non-compliant data access.

B. Finding data across the cluster.


C. Connect remote data sources via WiFi.

D. Collect and send data into a stream.

- When creating a Watson Studio project, what do


you need to specify?

Spark service

Which statement describes a sequence file?


B. ​ The data is not human readable.

What must surround LaTeX code so that it appears on


its own line in a Juptyer notebook?
$$

Which component of the Spark Unified Stack supports


learning algorithms such as, logistic regression, naive
Bayes classification, and SVM?
MLIB
Under the MapReduce v1 architecture, which element of the system manages the map
and reduce functions?
A. TaskTracker
B. JobTracker
C. StorageNode
D. SlaveNode
E. MasterNode
Which component of the Apache Ambari
architecture stores the cluster configurations?

D. Postgres RDBMS

In Big SQL, what is used for table definitions, location,


and storage format of input files?

The Hive Metastore


Which environmental variable needs to be set to
properly start ZooKeeper?

Select an answer

A. ZOOKEEPER_HOME

B. ZOOKEEPER_DATA

C. ZOOKEEPER_APP

D. ZOOKEEPER

In a Hadoop cluster, which two are the result of


adding more nodes to the cluster?

It adds capacity to file system


Increases available processing power

What is an authentication mechanism in Hortonworks Data Platform

Select an answer

A. Hardware token

B. Preshared keys

C. IP address
D. Kerberos

Where must a Spark configuration be set up first?


- ibm cloud

What can be used to surround a multi-line string in a


Python code cell by appearing before and after the
multi-line string?

- “””
Under the HDFS storage model, what is the default
method of replication?

Select an answer

A. 3 replicas, 2 on the same rack, 1 on a different rack

B. 4 replicas, 2 on the same rack, 2 on a separate rack

C. 3 replicas, each on a different rack

D. 2 replicas, each on a different rack

E. 4 replicas, each on a different rack

What is an authentication mechanism in


Hortonworks Data Platform?

B. Kerberos
What is meant by data at rest?

Select an answer (maybe)

A. A file that has been processed by Hadoop.

B. A file that has not been encrypted.

C. Data in a file that has expired.

D. ​A data file that is not changing​.

Who can access your data or notebooks in your


Watson Studio project?
(help)

Collaborators

Tenants

Anyone
Teams

Which type of function promotes code re-use and


reduces query complexity?

Select an answer

A. OLAP

B. Scalar

C. User-Defined

D. Built-in

Which component of the HDFS architecture


manages the file system namespace and metadata?

Select an answer

A. NameNode

B. SlaveNode

C. WorkerNode

D. DataNode
Which statement is true about the Hadoop Distributed
File System (HDFS)?
HDFS links the disks on multiple nodes into one large file system.
What is one disadvantage to using CSV formatted
data in a Hadoop data store?

Select an answer

A. Data must be extracted, cleansed, and loaded into the data warehouse.

B. ​ It is difficult to represent complex data structures such as maps.

C. Fields must be positioned at a fixed offset from the beginning of the record.

D. Columns of data must be separated by a delimiter.

Which two are use cases for deploying ZooKeeper?

(HELP) HEELLPPP help

A. Managing the hardware of cluster nodes.

B. Storing local temporary data files.

C. ​Simple data registry between nodes.

D. Configuration bootstrapping for new nodes.


Which statement describes the action performed by
HDFS when data is written to the Hadoop cluster?
(HELP)

A. The data is spread out and replicated across the cluster.


B. The MasterNodes write the data to disk.
C. The data is replicated to at least 5 different computers.
D. The FsImage is updated with the new data map. i think

The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. ​When a file is written to the cluster, blocks of the file are spread out and replicated
across the whole cluster​ (in the diagram, notice that every block of the file is replicated to
three different machines).

What OS command starts the ZooKeeper command-line


interface?
zkCli.sh
Which IBM Big Data solution provides
InfoSphere Streams

Which type of partitioning is supported by Hive?


value partitioning

Which type of application can be published when


using the BigInsights Application Publish wizard in
Eclipse?
BigSheets

Which command is used for starting all the


BigInsights components?
start-all.sh

Which two commands are used to copy a data file


from the local file system to HDFS? (Choose two.)
hadoop fs -copyFromLocal test_file test_file
hadoop fs -put test_file test_file

Under IBM's Text Analytics framework, what


programming language is used to write rules that can
extract information from unstructured text sources?
AQL

Which statement is true about the following


command?
hadoop dfsadmin -report:

It displays the list of the all files and directories in HDFS.


Which Hive tool is used to view and manipulate
table metadata?
Hive shell

Which Hadoop query language can manipulate semi-


structured data and support the JSON format?
Jql

A user enters following command:


hadoop fs -ls /mydir/
test_file
and receives the following output:
rw-r--r-- 3 biadmin supergroup 714002200 2014-
11-21 14:21 /mydir/test_file
What does the 3 indicate?
expected replication factor for this file

When creating a master workbook in BigSheets,


what is responsible for formatting the data output in
the workbook?
Reader

Which part of the BigInsights web console, provides


information that would help a user troubleshoot and
diagnose a failed application job?
Application Status tab
What is the primary benefit of using Hive with
Hadoop?
Hadoop data can be accessed through SQL statements.
After the following sequence of commands are executed:

create 'table1', 'columnfamily1', 'columnfamily2', 'columnfamily3'


put 'table1', 'row1', 'columnfamily1:c11', 'r1v11'
put 'table1', 'row1', 'columnfamily1:c12', 'r1v12'
put 'table1', 'row1', 'columnfamily2:c21', 'r1v21'

put 'table1', 'row1', 'columnfamily3:c31', 'r1v31'


put 'table1', 'row2', 'columnfamily1:d11', 'r2v11'
put 'table1', 'row2', 'columnfamily1:d12', 'r2v12'
put 'table1', 'row2', 'columnfamily2:d21', 'r2v21'

What value will the count 'table_1'command return?

Which Big Data technology delivers real-time


analytic processing on data in motion?
Stream computing

Which of the four key Big Data Use Cases is used to


lower risk and detect fraud?
Security/Intelligence Extension

Which statement is true about where the output of


MAP task is stored?
It is stored on the local disk.

Under IBM's Text Analytics framework, what


programming language is used to write rules that
can extract information from unstructured text
sources?
AQL
Which type of client interface does Hive support?
JDBC

Which of the four characteristics of Big Data deals


with trusting data sources?
Veracity

Which file allows you to update Hive configuration?


hive-site.xml

IBM Text Analytics is embedded into the key


components of which IBM solution?
InfoSphere BigInsights

What is the primary reason that InfoSphere Streams


is able to process data at the rate of terabytes per
second?
Data is processed in memory.

Which file system spans all nodes in a Hadoop


cluster?
HDFS

Which well known Big Data source has become the


most popular source for business analytics?
social networks
In 2003, IBM's System S was the first prototype of a
new IBM Stream Computing solution that performed
which type of processing?
Real Time Analytic Processing

Which file allows you to update Hive configuration?


hive-site.xml

The Enterprise Edition of BigInsights offers which


feature not available in the Standard Edition?
Adaptive MapReduce

What do the query languages Hive and Big SQL


have in common?
Both require schema.

What is a primary reason that business users want


to use BigSheets?
It includes a very easy-to-use command-line interface.

Which two Hadoop query languages do not require


data to have a schema? (Choose two.)
pig
jaql

Which BigSheets component applies a schema to


the underlying data at runtime?
Reader
Which statement is true about storage of the output
of REDUCE task?
It is stored in HDFS, but only one copy on the local machine.

What happens if a child task hangs in a MapReduce


job?
JobTracker reschedules the task on another machine.

What is IBM's SQL interface to InfoSphere


BigInsights?
Big SQL

What is the default data block size in BigInsights?


128 MB

What is the process in MapReduce that moves all


data from one key to the same worker node?
Map

In the master/slave architecture, what is considered


the slave?
DATANODE

Under the MapReduce architecture, how does a


JobTracker detect the failure of a TaskTracker?
receives no heartbeat

Which capability of Jaql gives it a significant


advantage over other query languages?
It can handle deeply nested, semi-structured data.
What is the default number of replicas in HDFS
replication?
3

Which AQL statement can be used to create


components of a Text Analytics extractor?
create view <view_name< as <select or extract statement<;
Which BigInsights tool is used to access the
BigInsights Applications Catalog?
Web Console

In Hadoop's rack-aware replica placement, what is


the correct default block node placement?
1 block in 1 rack, 2 blocks in a second rack

Which file formats can Jaql read?


JSON, Avro, Delimited

Which statement is true about the data model used


by HBase?
The table schema only defines column families.

What is the open-source implementation of


BigTable, Google's extremely scalable storage
system?
HBase

Where does HDFS store the file system namespace


and properties?
FsImage
What does GPFS offer that HDFS does not?
MapReduce split on one local disk

High availability was added to which part of the


HDFS file system in HDFS 2.0 to prevent loss of
metadata?
NameNode

Which administrative console feature of BigInsights


is a visualization and analysis tool designed to work
with structured and unstructured data?
BigSheets

Which Hadoop query language was developed by


Yahoo to handle almost any type of data?
PIG
Which two file actions can HDFS complete? (Choose
two.)
get
list

Which BigSheets component presents a spreadsheet-


like representation of data
Workbook

which component of IBM Watson forms the foundation


of the framework and allows Waston to extract and
index data from any source
search engine

Which development environment can be used to


develop programs for Text Analytics
Eclipce

Your company wants to utilize text analytics to gain an


understanding of general public opinion concerning
company products.
which type of input source would you analyze for that
information
Twitter feeds

which BigInsights feature helps to extract information


from text data
Text Analytics Engine

which class of software can store and manage only


structerd data
HDFS

what is the process in mapreduce that move all data


from one key to the same worker node
Shuffle
----------------------------------------------------------------------------------------------------------------
--------

Which two pre-requisites must be fulfilled when


running a Java MapReduce program on the Cluster,
using Eclipse? (Choose two.)
(Please select ALL that apply)

Select an answer

A connection to the BigInsights Server


must be defined.
A

A repository connection under Data Source


B
Explorer panel must be defined.

C Hadoop services must be running.

D BigInsights services must be running.

E Zookeeper must be running.

Which two commands are used to retrieve data from


an HBase table? (Choose two.)
(Please select ALL that apply)

Select an answer

read
scan

list

get

describe

You have been asked to create an HBase table and


populate it with all the sales transactions, generated
in the company in the last quarter. Currently, these
transactions reside in a 300 MB tab delimited file in
HDFS. What is the most efficient way for you to
accomplish this task?

Select an answer

pre-create the column families when creating the table and bulk
loading the data

pre-create regions by specifying splits in create table command and


use the insert command to load data

pre-create regions by specifying splits in create table command and


bulk loading the data

pre-create the column families when creating the table and use the put
command to load the data

What are two main components of a Java


MapReduce job? (Choose two.)
(Please select ALL that apply)

Select an answer

Mapper class which should extend


A
org.apache.hadoop.mapreduce.Mapper class

Configuration class which should extend


B org.apache.hadoop.mapreduce.JobConfiguration
class

Job class which should extend


C
org.apache.hadoop.mapreduce.Application class

Reducer class which should extend


D
org.apache.hadoop.mapreduce.Reducer class

Scheduler class which must should


E
org.apache.hadoop.mapreduce.Scheduler class

Which element(s) must be specified when creating


an HBase table?

Select an answer

only the table name

only the table name and column family(s)

table name, column names and column types


table name, column family(s) and column names

Hadoop is the primary software tool used for which


class of computing?

Select an answer

A Online Transaction processing

B Decision Support Systems

C Big Data Analytics

D Online Analytical processing

Which BigSheets component presents a


spreadsheet-like representation of data?

Select an answer

Workbook

Table

Reader

View

Which two technologies form the foundation of


Hadoop? (Choose two.)
(Please select ALL that apply)

Select an answer
A HBase

B MapReduce

C HDFS

D GPFS

Which tool is included as part of the IBM BigInsights


Eclipse development environment?

Select an answer

code generator

java script compiler

test results analyzer

code converter

Which class of software can store and manage only


structured data?

Select an answer

A MapReduce

B HDFS

C Data Warehouse

D Hadoop
Why does Big SQL perform better than Hive?

Select an answer

It uses sub-queries.

It has better storage handlers.

It has a better optimizer.

It supports Hcatalog.

Which BigInsights feature helps to extract


information from text data?

Select an answer

A Data Warehouse

B Hive

C Text Analytics Engine

D Zookeeper

BigSQL has which advantage over Hive?

Select an answer

It supports standard SQL statements.


It uses better storage handlers.

It provides better SerDe drivers.

It uses the superior HCatalog table manager.

What is the name of the interface that allows Hive to


read in data from a table, and write it back out to
HDFS in any custom format?

Select an answer

JDBC

BigSQL

Thrift

What is the name of the interface that


allows Hive to read in data from a table, and
write it back out to HDFS in any custom
format?

Select an answer

JDBC
BigSQL

Thrift

SerDe

What is one of the primary reasons that Hive is often


used with Hadoop?

Select an answer

It provides a graphical client to access


A
Hadoop data.

B MapReduce is difficult to use.

C It provides a hierarchical schema for


Hadoop data sets.

Hive creates indexes for tables in Hadoop


D
data sets.

Which built-in Hive storage file format improves


performance by providing semi-columnar data
storage with good compression?

Select an answer

A TEXTFILE

B SEQUENCEFILE

C DBOUTPUTFILE

D RCFILE

When the following HIVE command is executed:


LOAD DATA INPATH '/tmp/department.del'
OVERWRITE
INTO TABLE department;

What happens?

Select an answer

The department.del file is moved from the


A HDFS /tmp directory to the location
corresponding to the Hive table.
The department.del file is copied from the
local file system /tmp directory to the
B
location corresponding to the Hive table in
HDFS.

The department.del file is moved from the


C local file system /tmp directory in the local
file system to HDFS.

The department.del file is copied from the


D HDFS /tmp directory to the location
corresponding to the Hive table.

SERBIW COPIW DAKSHI DEGHYASERBIW COPIW DAKSHI DEGHYA

Which Advanced Analytics toolkit in InfoSphere


Streams is used for developing and building
predictive models?

Select an answer

A CEP

B Geospatial

C SPSS

D Time Series

Which method is used by Jaql to operate on large


arrays?

Select an answer
A Buckets

B SQL

C Partitions

D Parallelization

Which of the four characteristics of Big Data


indicates that many data formats can be stored and
analyzed in Hadoop?

Select an answer

Velocity

Variety

Volume

Volatility

Given the following array:


data = [ { from: 101, to: 102, msg:
"Hello" }, { from: 103, to: 104, msg:
"World!" }, { from: 105, to:106, msg:
"Hello World" } ];
And the following example of expected output:
[
{
"message": "Hello"
}
]

What is the correct sequence of JAQL commands to


select only the message text from sender 101?

Select an answer

data < filter $.from == 101 < expand


A
{message: $.msg};

B data < filter $.from == 101

data < filter $.from == 101 < transform


C
{message: $.msg};

data < transform{message: $msg < filter


D
$.from == 101};

Which JSON-based query language was developed


by IBM and donated to the open-source Hadoop
community?

Select an answer

A Jaql
B Big SQL

C Avro

D Pig

BigInsights offers which added value to Hadoop?

Select an answer

increased agility and flexibility

enhanced web-based UI and tools

higher scalability

higher levels of fault tolerance

Which feature of Jaql provides native functions and


modules that allow you to build re-usable
packages?

Select an answer

A JSON data structures

B support for XML data sources

C extensibility

D MapReduce-based query language


For scalability purposes, data in an HDFS cluster is
broken down into what default block size?

Select an answer

A 16 MB

B 32 MB

C 64 MB

D 128 MB

Which two Hadoop features make it very cost-


effective for Big Data analytics? (Choose two.)
(Please select ALL that apply)

Select an answer

processes highly structured data

processes transactional data

processes large data sets

processes several small files

runs on commodity hardware

In HDFS 2.0, how much RAM is required in a


Namenode for one million blocks of data?
Select an answer

A 1 GB

B 2 GB

C 5 GB

D 10 GB

Which two file actions can HDFS complete? (Choose


two.)
(Please select ALL that apply)

Select an answer

A Index

B Delete

C Create

D Update

E Execute

Which statement describes Hadoop?

Select an answer

an open source software framework used to manage large volumes of


unstructured, semi-structured, and structured data

a strategic platform for managing and analyzing structured data


an open source RDBMS platform that can index massive volumes of
unstructured and semi-structured data

a browser-based analytic tool designed to gather and organize data

Which statement describes the data model used by


HBase?

Select an answer

A a multidimensional sorted map

B a hierarchical tree structure

C a record/set-based network model

D an object-relational data structure

What is the primary, outstanding feature offered by


HBase?

Select an answer

It is optimized for high-speed, sequential


A
batch operations.

It can scale up to 512 terabytes per


B
Hadoop cluster.

C It handles sharding automatically.

D It is designed to run on high-end servers


like IBM i or IBM z.

What is the company/organization that developed


HBase?

Select an answer

A Microsoft

B IBM

C Oracle

D Apache

Which statement is true about the following


command?
hadoop dfsadmin -report:

Select an answer

It displays the list of users having


A
administration privileges.

B It displays basic file system information


and statistics.

C It is not a valid Hadoop command.

It displays the list of the all files and


D
directories in HDFS.

A user enters following command:


hadoop fs -ls /mydir/test_file

and receives the following output:

rw-r--r-- 3 biadmin supergroup


714002200 2014-11-21 14:21
/mydir/test_file

What does the 3 indicate?

Select an answer

A the version of the file stored with this name

B expected replication factor for this file

the number of times blocks in file were


C
replicated

the number of blocks in which this file was


D
stored
In the context of a Text Analytics project, which set
of AQL commands, will identify and extract any
matching person names, when run across an input
data source?

Select an answer

create view Names as


extract 'John', 'Mary', 'Eric',
'Eva'
A
from Document R;
export view Names;

create dictionary NamesDict as


('John', 'Mary', 'Eric', 'Eva');
create view Names as
extract dictionary 'NamesDict'
B
on R.text as match
from Document R;
output view Names;

create dictionary NamesDict as


('John', 'Mary', 'Eric', 'Eva');
C
export dictionary NamesDict;

create dictionary NamesDict as


extract 'Names'
D
on R.text as match
from Document R;
Which development environment can be used to
develop programs for Text Analytics?

Select an answer

A CodeEnvy

B Eclipse

C Java NetBeans

D VisualAge

In the master/slave architecture, what is considered


the slave?

Select an answer

EditLog

FsImage

NameSpace

DataNode

Your company wants to utilize text analytics to gain


an understanding of general public opinion
concerning company products. Which type of input
source would you analyze for that information?

Select an answer

A Twitter feeds

B email

C server logs

D voicemail

What is the correct sequence of the three main


steps in Text Analytics?

Select an answer

index text; categorize subjects; parse the


A
data

categorize subjects; index columns; derive


B
patterns

structure text; derive patterns; interpret


C
output

D sort tables; index text; derive patterns


What does NameNode use as a transaction log to
persistently record changes to file system
metadata?

Select an answer

DataNode

CheckPoint

FSImage

EditLog

After the following sequence of commands are


executed:
create 'table1', 'columnfamily1',
'columnfamily2', 'columnfamily3'
put 'table1', 'row1',
'columnfamily1:c11', 'r1v11'
put 'table1', 'row1',
'columnfamily1:c12', 'r1v12'
put 'table1', 'row1',
'columnfamily2:c21', 'r1v21'
put 'table1', 'row1',
'columnfamily3:c31', 'r1v31'
put 'table1', 'row2',
'columnfamily1:d11', 'r2v11'
put 'table1', 'row2',
'columnfamily1:d12', 'r2v12'
put 'table1', 'row2',
'columnfamily2:d21', 'r2v21'

What value will the count 'table_1'command


return?

Select an answer

A 2

B 3

C 4

D 7

Under the MapReduce architecture, when a line of


data is split between two blocks, which class will
read over the split to the end of the line?

Select an answer

InputSplitter

LineRecordReader

InputFormat

RecordReader
Which class in MapReduce takes each record and
transforms it into a <key, value< pair?

Select an answer

A RecordReader

B InputSplitter

C InputFormat

D LineRecordReader

Which MapReduce task is responsible for reading a


portion of input data and producing <key, value<
pairs?

Select an answer

A Combiner
B Map

C Reduce

D Shuffle

Which part of the MapReduce engine controls job


execution on multiple slaves?

Select an answer

A Job Client

B MapTask

C JobTracker

D TaskTracker

Which component of IBM Watson Explorer forms the


foundation of the framework and allows Watson to
extract and index data from any source?

Select an answer

A search engine

B application framework
C meta data

D connector framework

Which Hadoop query language can manipulate


semi-structured data and support the JSON format?

Select an answer

Ruby

Jaql

Python

AQL

What is the name of the Hadoop-based query


language developed by Facebook that facilitates
SQL-like queries?

Select an answer

Pig

BigSql

Jaql

Hive
When the following HIVE command is executed:
LOAD DATA INPATH '/tmp/department.del'
OVERWRITE
INTO TABLE department;

What happens?

Select an answer

The department.del file is moved from the


A HDFS /tmp directory to the location
corresponding to the Hive table.

The department.del file is copied from the


local file system /tmp directory to the
B
location corresponding to the Hive table in
HDFS.

The department.del file is moved from the


C local file system /tmp directory in the local
file system to HDFS.

The department.del file is copied from the


D HDFS /tmp directory to the location
corresponding to the Hive table.

When the following HIVE command is executed:


LOAD DATA INPATH '/tmp/department.del'
OVERWRITE
INTO TABLE department;
What happens?

Select an answer

The department.del file is copied from the local file system /tmp
directory to the location corresponding to the Hive table in HDFS.

The department.del file is moved from the local file system /tmp
directory in the local file system to HDFS.

The department.del file is moved from the HDFS /tmp directory to the
location corresponding to the Hive table.

The department.del file is copied from the HDFS /tmp directory to the
location corresponding to the Hive table.

Question: 20
Which built-in Hive storage file format improves
performance by providing semi-columnar data
storage with good compression?

Select an answer

SEQUENCEFILE

TEXTFILE

RCFILE

DBOUTPUTFILE

drari ma t3awdoxi as2ila


Given the following array:
data = [ { from: 101, to: 102, msg:
"Hello" }, { from: 103, to: 104, msg:
"World!" }, { from: 105, to:106, msg:
"Hello World" } ];

And the following example of expected output:


[
{
"message": "Hello"
}
]

What is the correct sequence of JAQL commands to


select only the message text from sender 101?

Select an answer

data < filter $.from == 101 < expand


A
{message: $.msg};

B data < filter $.from == 101

data < filter $.from == 101 < transform


C
{message: $.msg};

data < transform{message: $msg < filter


D
$.from == 101};
Which statement is true about the following
command?
hadoop dfsadmin -report:

Select an answer

It displays the list of users having


A
administration privileges.

It displays basic file system information


B
and statistics.

C It is not a valid Hadoop command.

It displays the list of the all files and


D
directories in HDFS.

In which two use cases does IBM Watson Explorer


differentiate itself from competing products?
(Choose two.)
(Please select ALL that apply)

Select an answer

Security/Intelligence extension
360-degree view of the customer

Operations analysis

Data Warehouse augmentation

Big Data exploration

A user enters following command:


hadoop fs -ls /mydir/test_file

and receives the following output:

rw-r--r-- 3 biadmin supergroup


714002200 2014-11-21 14:21
/mydir/test_file

What does the 3 indicate?

Select an answer

A the version of the file stored with this name

B expected replication factor for this file

the number of times blocks in file were


C
replicated
the number of blocks in which this file was
D
stored

After the following sequence of commands are


executed:
create 'table1', 'columnfamily1',
'columnfamily2', 'columnfamily3'
put 'table1', 'row1',
'columnfamily1:c11', 'r1v11'
put 'table1', 'row1',
'columnfamily1:c12', 'r1v12'
put 'table1', 'row1',
'columnfamily2:c21', 'r1v21'
put 'table1', 'row1',
'columnfamily3:c31', 'r1v31'
put 'table1', 'row2',
'columnfamily1:d11', 'r2v11'
put 'table1', 'row2',
'columnfamily1:d12', 'r2v12'
put 'table1', 'row2',
'columnfamily2:d21', 'r2v21'

What value will the count 'table_1'command


return?

Select an answer
A 2

B 3

C 4

D 7

What is the name of the interface that allows Hive to


read in data from a table, and write it back out to
HDFS in any custom format?

Select an answer

SerDe

Thrift

JDBC

BigSQL

Which two file actions can HDFS complete? (Choose


two.)
(Please select ALL that apply)

Select an answer
A Index

Delete

C Create

D Update

E Execute

baqi xi as2ila akhrin awla safi?


li sala yji andna lsalle E25

a blati ydouzchwiya d wa9t ra yalah dazt 30min


hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhh
When using BigInsights Eclipse to develop a new
application, what must be done prior to testing the
application in the cluster?

Select an answer

configure runtime properties

compile the project

authenticate with Administrator permissions

install test environment

Which tool is included as part of the IBM BigInsights


Eclipse development environment?

Select an answer

code converter

code generator

test results analyzer

java script compiler

Which element of the Big Data Platform can cost-


effectively store and manage many petabytes of
structured and unstructured information?

Select an answer
Stream Computing

Contextual Discovery

Hadoop System

Data Warehous

Which well known Big Data source has become the


most popular source for business analytics?

Select an answer

RFID

GPS

social networks

cell phones

Which IBM Business Analytics solution facilitates


collaboration and unifies disparate data across
multiple systems into a single access point?

Select an answer

InfoSphere Streams

HBase

IBM Watson Explorer

InfoSphere Text Analytics


When the following HIVE command is executed:
LOAD DATA INPATH '/tmp/department.del'
OVERWRITE
INTO TABLE department;

What happens?

Select an answer

The department.del file is moved from the local file system /tmp
directory in the local file system to HDFS.

The department.del file is copied from the local file system /tmp
directory to the location corresponding to the Hive table in HDFS.

The department.del file is moved from the HDFS /tmp directory to the
location corresponding to the Hive table.

The department.del file is copied from the HDFS /tmp directory to the
location corresponding to the Hive table.

What happens when a user runs a workbook in


BigSheets?

Select an answer

Data is filtered and transformed as desired by the user.


Jaql scripts are compiled and run in the background.

Results on the real data are computed and the output is explored.

Sample data is processed in a simulated environment.

Hadoop is designed for which type of work?

Select an answer

low latency

batch-oriented parallel processing hadi lis7i7a

processing many small files

random acce

What is the name of the interface that allows Hive to


read in data from a table, and write it back out to
HDFS in any custom format?
Select an answer

JDBC

SerDe

Thrift

BigSQL
What is the primary benefit of using Hive with
Hadoop?

Select an answer

Queries perform much faster than with MapReduce.

Hadoop data can be accessed through SQL statements.

It supports materialized views.

It provides support for transactions.


Which data type is BOOLEAN defined as in a Big
SQL database?

SMALLINT

Which Big SQL authentication mode is designed to


provide strong authentication for client/server
applications by using secret-key cryptography?
( je pense Kerberos) ​oui c’est ca

You need to define a server to act as the medium


between an application and a data source in a Big SQL
federation. Which command would you use?
CREATE SERVER

Which tool should you use to enable Kerberos


security?
Ambari

Which of the following is a data encoding format is a


compact, binary format that supports interoperability
with multiple programming languages and versioning?
Avro

When sharing a notebook, what will always point to


the most recent version of the notebook?

A. The permalink
Which file format has the highest performance?

Which file format has the highest performance?

Parquet

Which three main areas make up Data Science


according to Drew Conway?
-Math and statistics knowledge
- Hacking skills
-Substantive expertise

For what are interactive notebooks used by data


scientists?
Quick data exploration tasks

Which two options can be used to start and stop Big


SQL?

Ambari web interface


Command line

What can be used to surround a multi-line string in a


Python code cell by appearing before and after the
multi-line string?

B-””

Which command is used to populate a Big SQL


table?

load

Which areas of expertise are attributed to a data


scientist
A. Data modeling
D. Machine learning

Which statement describes a sequence file?


The data is not human readable.
Which command would you run to make a remote
table accessible using an alias?

CREATE NICKNAME

Which data type can cause significant performance


degradation and should be avoided?

STRING

What is the native programming language for


Spark?

A. Scala

Which of the following is a data encoding format is a


compact, binary format that supports interoperability
with multiple programming languages and versioning?
Avro

Which type of function promotes code re-use and


reduces query complexity?

User-Defined
Which two of the following data sources are
currently supported by Big SQL?

Teradata
Oracle

Where must a Spark configuration be set up first?

---IBM CLOUD

You need to create a table that is not managed by


the Big SQL database manager. Which keyword
would you use to create the table?

EXTERNAL
Which visualization library is developed by IBM as
an add-on to Python notebooks?

---PixieDust

In Big SQL, what is used for table definitions, location,


and storage format of input files?
-Hive MetaStore

Who can access your data or notebooks in your


Watson Studio project?
Collaborators
You need to monitor and manage data security
across a Hadoop platform. Which tool would you
use?

Apache Ranger

The Big SQL head node has a set of processes running. What is the name of
the service ID running these processes?

bigsql

Under the MapReduce v1 architecture, which element


of the system manages the map and reduce functions?
TaskTracker

What is meant by data at rest


A data file that is not changing.

What is the name of the Scala programming feature


that provides functions with no names?
Lambda functions

Who can control Watson Studio project assets?


Editors
Question 5
When creating a Watson Studio project, what do you
need to specify?
Select an answer

A. Data service

B. Data assets

C. Collaborators

D. ​Spark service

Under the MapReduce v1 architecture, which function


is performed by the JobTracker?
Accepts MapReduce jobs submitted by clients.

What must surround LaTeX code so that it appears


on its own line in a Juptyer notebook?
$$

Question 20
Which file format contains human-readable data
where the column values are separated by a
comma?

A. Delimited
Which component of the Apache Ambari
architecture stores the cluster configurations?

C. Postgres RDBMS

Which type of foundation does Big SQL build on?


Apache HIVE

Which Spark RDD operation returns values after


performing the evaluations?
actions

You can import preinstalled libraries if you are using


which languages?

B. R

C. Python

Which statement is true about Spark's Resilient Distributed Dataset (RDD)?

It is a distributed collection of elements that are parallelized across the cluster.


In a Hadoop cluster, which two are the result of
adding more nodes to the cluster?

C. It increases available processing power.

E. It adds capacity to the file system.

What are three examples of "Data Exhaust"?

select_6_answers_request

A. log files

D. cookies

F. browser cache

Under the MapReduce v1 architecture, which


function is performed by the JobTracker?

▪ Accepts MapReduce jobs submitted by clients

What OS command starts the ZooKeeper command-line


interface?
zkCli.sh

Which two are examples of personally identifiable


information (PII)​?

Select the FOUR answers that apply

A. Medical record number

B. IP address

C. Email address

D. Time of interaction

Which component of the Spark Unified Stack supports


learning algorithms such as, logistic regression, naive
Bayes classification, and SVM?
B. MLlib

Which statement describes "Big Data" as it is used


in the modern business world?

Select an answer

A. Structured data stores containing very large data sets such as video and
audio streams.

B. The summarization of large indexed data stores to provide information


about potential problems or opportunities.
Which component of the HDFS architecture
manages the file system namespace and
metadata?

namenode

C. Indexed databases containing very large volumes of historical data used for
compliance reporting purposes.

D. Non-conventional methods used by businesses and organizations to


capture, manage, process, and make sense of a large volume of data.

Which Hortonworks Data Platform (HDP) component


provides a common web user interface for
applications running on a Hadoop cluster?

B. Ambari

Which feature allows application developers to easily


use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities
into their own applications?

B. REST APIs
What are three examples of "Data Exhaust"?

A. banner ads
B. javascript

C. browser cache

D. video streams

E. log files

F. cookies

Which is the primary advantage of using column-based


data formats over record-based formats?

A. faster query execution

What is Hortonworks DataPlane Services (DPS)


used for?

Select an answer

D. Manage, secure, and govern data stored across all storage environments.
Which two of the following are column-based data
encoding formats?

C. ORC

D. Parquet

What Python package has support for linear algebra,


optimization, mathematical integration, and statistics?

SciPy

When sharing a notebook, what will always point to the


most recent version of the notebook?
The permalink

Under the MapReduce v1 architecture, which


element of MapReduce controls job execution on
multiple slaves?

JobTracker

Which environmental variable needs to be set to


properly start ZooKeeper?

ZOOKEEPER_HOME
Which statement describes the action performed by
HDFS when data is written to the Hadoop cluster?
D.

The data is spread out and replicated across the cluster.

Which Spark RDD operation creates a directed


acyclic graph through lazy evaluations?

Select an answer

A. Distribution

B. Actions

C. ​GraphX wa9ilaaaa

D. Transformations

Which feature allows the bigsql user to securely access


data in Hadoop on behalf of another user?
Impersonation

Which component of the Spark Unified Stack provides


processing of data arriving at the system in real-time?
C. Spark Streaming
How does MapReduce use ZooKeeper?
Aid in the high availability of Resource Manager.

Which statement describes the purpose of Ambari?

Select an answer

A. It forms the architectural center of Hadoop.

B. It is used for provisioning, managing, and monitoring Hadoop clusters.

C. It provides a distributed key/value store for scalable data storage and


retrieval.

D. It is a java-based workflow scheduler system to manage Hadoop jobs.

Under the MapReduce v1 architecture, which


function is performed by the TaskTracker?

Select an answer

A. Manages storage and transmission of intermediate output.

Which of the "Five V's" of Big Data describes the real


purpose of deriving business insight from Big Data
A. Value

What is an authentication mechanism in Hortonworks


Data Platform?

Kerberos
What is meant by data at rest?

Data in a file that has expired.


What is the term for the process of converting data
from one "raw" format to another format making it
more appropriate and valuable for a variety of
downstream purposes such as analytics and that
allows for efficient consumption of the data?
A. Data munging

What ZK CLI command is used to list all the ZNodes at


the top level of the ZooKeeper hierarchy, in the
ZooKeeper command-line interface?
ls /

Under the MapReduce v1 programming model, what


happens in the "Map" step?

Input is processed as individual splits.

B. Data is aggregated by worker nodes.

C. Output is stored and replicated in HDFS.

D. Multiple map tasks are aggregated.


What is the Hortonworks DataFlow package used
for?

Data stream management and processing.

Who can access your data or notebooks in your


Watson Studio project?

Select an answer

A. Collaborators -----

B. Anyone

C. Teams

D. Tenants

Which two Spark libraries provide a native shell?

Select the FIVE answers that apply

A. Python----

B. Java

C. C#

D. Scala----
What is Hortonworks DataPlane Services
(DPS) used for?

Select an answer

A. Perform backup and recovery of data in the Hadoop ecosystem.

B. Keep data up to date by periodically refreshing stale data.

C. Transform data from CSV format into native HDFS data.

D. Manage, secure, and govern data stored across all storage


environments.-----
E. C++

What is the default number of rows Sqoop will


export per transaction?
A. 100

Which two of the following are row-based data


encoding formats?
Avro
Csv
Under the HDFS storage model, what is the default
method of replication?
A. 3 replicas, 2 on the same rack, 1 on a different rack

Which component of the HDFS architecture


manages storage attached to the nodes?

namenode
Question demo : (​ exemple dyal kifach t7eto question , 9dam jawab ktbo correct​ )

Question: 59
Which IBM Big Data solution provides
c
Your answer

A InfoSphere Streams

B InfoSphere BigInsights

C PureData for Analytics

D InfoSphere Information Server

Which type of
partitioning is
supported by Hive?

Select an answer

A value partitioning

B range partitioning

C time partitioning
D interval partitioning

YALLAH BDAW

Which type of application can be published when


using the BigInsights Application Publish wizard in
Eclipse?

Select an answer

A BigSheets

B Big SQL

C Hive

D Java
Question: 3
Which command is used for starting all the
BigInsights components?

Select an answer

A start.sh

B start all biginsights

C start-all.sh

D start.sh biginsights

Which two commands are used to copy a data file


from the local file system to HDFS? (Choose two.)
(Please select ALL that apply)

Select an answer

A hadoop fs -get test_file test_file


B hadoop fs -copyFromLocal test_file test_file

C hadoop fs -sync test_file test_file

D hadoop fs -put test_file test_file

E hadoop fs -cp test_file test_file

Under IBM's Text Analytics framework, what


programming language is used to write rules that can
extract information from unstructured text sources?
AQL

Which statement is true about the following


command?
hadoop dfsadmin -report:
Select an answer

A It displays the list of the all files and directories in HDFS.

B It is not a valid Hadoop command.

C It displays the list of users having administration privileges.

D It displays basic file system information and statistics.

Which Hive tool is used to view and manipulate


table metadata?

Select an answer

A Spark

B JobConf Config

C Hive shell

D Cassandra
Which Hadoop query language can manipulate
semi-structured data and support the JSON format?
Jql

A user enters following command:

hadoop fs -ls /mydir/

test_file

and receives the following output:

rw-r--r-- 3 biadmin supergroup


714002200 2014-11-21 14:21
/mydir/test_file

What does the 3 indicate?


Select an answer

A the number of times blocks in file were replicated

the version of the


B
file stored with this name

C the number of blocks in which this file was stored

D expected replication factor for this file

When creating a master workbook in BigSheets,


what is responsible for formatting the data output in
the workbook?

Select an answer

A Reader

B Avro
C Formula

D Diagram

Which part of the BigInsights web console, provides


information that would help a user troubleshoot and
diagnose a failed application job?

A Application Status tab

What is the primary benefit of using Hive with


Hadoop?

Select an answer

A It provides support for transactions.

B Hadoop data can be accessed through SQL statements.


C It supports materialized views.

D Queries perform much faster than with MapReduce.

Question: 9
After the following sequence of commands are executed:

create 'table1', 'columnfamily1', 'columnfamily2', 'columnfamily3'


put 'table1', 'row1', 'columnfamily1:c11', 'r1v11'
put 'table1', 'row1', 'columnfamily1:c12', 'r1v12'
put 'table1', 'row1', 'columnfamily2:c21', 'r1v21'

put 'table1', 'row1', 'columnfamily3:c31', 'r1v31'


put 'table1', 'row2', 'columnfamily1:d11', 'r2v11'
put 'table1', 'row2', 'columnfamily1:d12', 'r2v12'
put 'table1', 'row2', 'columnfamily2:d21', 'r2v21'

What value will the ​


count 'table_1'​command return?

C 4

Which Big Data technology delivers real-time


analytic processing on data in motion?

Select an answer
A Platform computing

B Stream computing

C Data Warehouse

Hadoop

Which of the four key Big Data Use


Cases is used to lower risk and detect
fraud?
D
Select an answer

A Security/Intelligence Extension ( correct)

B Data Warehouse Augmentation

C Big Data Exploration


D Operations Analysis

Which statement is true about where the output of


MAP task is stored?

Select an answer

A It is stored in HDFS, but only one copy on the local machine.

B It is stored on the local disk.

C It is stored in memory.

It is stored in HDFS using the number of copies specified by replication


D
factor.

Under IBM's Text Analytics framework, what


programming language is used to write rules that
can extract information from unstructured text
sources?
Select an answer

A InfoSphere Streams

B Avro

C AQL

D Hive

Which type of client interface does Hive support?

Select an answer

A Jet

B SOAP

C RPC

D JDBC
Which of the four characteristics of Big Data deals
with trusting data sources?

Select an answer

A Variety

B Veracity

C Velocity

D Volume

Which file allows you to update Hive configuration?

Select an answer

A hive-conf.xml

B hive.conf

C hive-site.xml
D hive-env.config

IBM Text Analytics is embedded into the key


components of which IBM solution?

Select an answer

A InfoSphere BigInsights

B Rational

C Eclipse

D Hadoop

What is the primary reason that InfoSphere Streams


is able to process data at the rate of terabytes per
second?

Select an answer
A It utilizes the computing power of IBM i.

B Infiniband is used for high-speed storage access.

C Data is processed in memory.

D Data is retrieved from PCI-e SSD storage.

Which two file actions can HDFS complete? (Choose


two.)
(Please select ALL that apply)
create and delate

desc
A
ribe

B read

C get

D scan

E list
Which file system spans all nodes in a Hadoop
cluster?

Select an answer

A HDFS

B XFS

C NTFS

D EXT4
Which well known Big Data source has become the
most popular source for business analytics?

Select an answer

A GPS

B RFID

C social networks

D cell phones
In 2003, IBM's System S was the first prototype of a
new IBM Stream Computing solution that performed
which type of processing?

Select an answer

A Real Time Analytic Processing (mktouba f graph)

B Complex Event Processing

C On Line Analytic Processing

D On Line Transaction Processing​u

Which file allows you to update Hive configuration?


Select an answer

A hive-site.xml

B hive-env.config

C hive-conf.xml

D hive.conf
The Enterprise Edition of BigInsights offers which
feature not available in the Standard Edition?

Select an answer

A Eclipse Tooling

B Adaptive MapReduce

C Dashboards

D Big SQL
What do the query languages Hive and Big SQL
have in common?

Select an answer

A Both use Ansi-SQL.

B Both are data flow languages.

C Both can act on complex data structures.

D Both require schema.

What is a primary reason that business users want


to use BigSheets?

Select an answer
It executes MapReduce functions more efficiently than the common
A
query languages.

B It offers better performance than the common query languages.

C It provides a non-technical method to analyze Big Data.

D It includes a very easy-to-use command-line interface.

Which two Hadoop query languages do not require data


to have a schema? (Choose two.)
pig
jaql ??? big sql (wa9ila)
Which BigSheets component applies a schema to
the underlying data at runtime?

Select an answer
A Filter

B Reader

C Crawler

D Data Qualifier

Which statement is true about storage of the output


of REDUCE task?

Select an answer

A It is stored on the local disk.

B It is stored in memory.

C It is stored in HDFS, but only one copy on the local machine.

It is stored in HDFS using the number of copies specified by replication


D
factor.
Submit Test

Question: 32
What happens if a child task hangs in a MapReduce
job?

Select an answer

A JobTracker reschedules the task on another machine.

B Child JVM reschedules the task on another machine.


C TaskTracker restarts the job.

D JobTracker fails the job.

What is IBM's SQL interface to InfoSphere


BigInsights?

Select an answer

A Hive

B HBase

C Pig

Big SQL
D
What is the default data block size in BigInsights?

Select an answer

A 16 MB

B 32 MB

64 MB
C

D 128 MB
What is the process in MapReduce that moves all
data from one key to the same worker node?

Select an answer

A Shuffle

B Split

C Reduce

D Map
In the master/slave architecture, what is considered
the slave?

DATANODE

Under the MapReduce architecture, how does a


JobTracker detect the failure of a TaskTracker?

Select an answer

A r​eceives no heartbeat

B detects that NameNode is offline

C receives report from MapReduce

receives report from Child JVM


D

Submit Test

Question: 43
Which capability of Jaql gives it a significant
advantage over other query languages?

Select an answer

It can load data from HDFS.


A

B It supports the HiveQL query language.

C It can handle deeply nested, semi-structured data.

D It provides a built-in command-line shell.

Which AQL statement can be used to create


components of a Text Analytics extractor?

la reponse ???

Select an answer​What
is the default number of replicas in
HDFS replication?
Select an answer

A 2

B 3

C 4

D 5

A create view <view_name< as <select or extract statement<;

Which BigInsights tool is used to access the


BigInsights Applications Catalog?

Which BigInsights tool is used to access the


BigInsights Applications Catalog?
A Eclipse Console

B Web Console

C Eclipse Plug-in | correct

D Application Wizard

In Hadoop's rack-aware replica placement, what is


the correct default block node placement?

Select an answer

A 1 block in 1 rack, 2 blocks in a second rack

Which of the four key Big Data Use Cases is used to


lower risk and detect fraud?

Select an answer

A Security/Intelligence Extension
Which file formats can Jaql read? (jawab lah
y7fedkom)

Select an answer

A JSON, Avro, Delimited

B HTML, Avro

C JSON, MS Word document

D Binary, Delimited, Portable Document Format (PDF) ana 3met hadi

Which statement is true about the data model used


by HBase???

Select an answer

A The table schema only defines column families.

B Schema defines a fixed number of columns per row.


C A cell is specified by array, instance and version.

D Data is stored in hierarchical arrays.

What is the open-source implementation of


BigTable, Google's extremely scalable storage
system?

Select an answer

A Cloud SQL

B HBase

C MySQL

D PigeonRank
Where does HDFS store the file system namespace
and properties?

Select an answer

A DataNode

B hdfs.conf

C Hive

D FsImage

What does GPFS offer that HDFS does not?

Select an answer

A MapReduce split on one local disk

B separate clusters for analytics

C POSIX compliance
D single point of failure

High availability was added to which part of the


HDFS file system in HDFS 2.0 to prevent loss of
metadata?

Select an answer

A NameNode

B CheckPoint

C Blockpool

D DataNode

In Hadoop's rack-aware replica placement, what is


the correct default block node placement?

Select an answer
A 1 block in 1 rack, 2 blocks in a second rack

B 2 blocks in 1 rack, 2 blocks in a second rack

C 3 blocks in same rack

D 4 blocks in 4 separate racks

Which administrative console feature of BigInsights


is a visualization and analysis tool designed to work
with structured and unstructured data?

Select an answer

A BigSheets
B MapReduce

C Text Analytics

D BigR

L3EZ DRARI INSHAALAH HAYDA


N3MLO MEA DRARI LI EANDA
RATTRAPAGE
Which Hadoop query language was
developed by Yahoo to handle almost
any type of data?

Select an answer
A BigSql

B Pig

C Jaql

D Hive

Which capability of Jaql gives it a significant


advantage over other query languages?

Select an answer

A It supports the HiveQL query language.


B It can load data from HDFS.

C It can handle deeply nested, semi-structured data.

D It provides a built-in command-line shell.

derari jem3o ha question raha ha


yet3aweto fe rattrapage

Rattrapage
Which database is a columnar storage database?

Hbase

Which database provides a SQL for Hadoop interface?

Hive

Which Apache project provides coordination of resources?

Zookeeper
What is ZooKeeper's role in the Hadoop infrastructure?
Manage the coordination between HBase servers
Hadoop and MapReduce uses ZooKeeper to aid in high availability of Resource Manager
Flume uses ZooKeeper for configuration purposes in recent releases
Through what HDP component are Kerberos, Knox, and Ranger managed?
Ambari
Which security component is used to provide peripheral security?
Apache Knox
What are the components of Hortonworks Data Flow(HDF)?
Flow management
Stream processing
Enterprise services
What main features does IBM Streams provide as a Streaming Data Platform?
(Please select the THREE that apply)
Analysis and visualization
Rich data connections
Development support

What are the 4Vs of Big Data?(Please select the FOUR that apply)

Veracity
Velocity
Variety
Volume
What are the three types of Big Data?(Please select the THREE that apply)

Semi-structured
Structured
Unstructured
Select all the components of HDP which provides data access capabilities
Pig
MapReduce
Hive
Select the components that provides the capability to move data from relational database into
Hadoop.
sqoop
kafka
flume
Managing Hadoop clusters can be accomplished using which component?
Ambari
Which Hadoop functionalities does Ambari provide?
Manage
Provision
Integrate
Monitor
Which page from the Ambari UI allows you to check the versions of the software installed on your
cluster?
The Admin > Manage Ambari page
What is the default number of replicas in a Hadoop system?
3
The Job Tracker in MR1 is replaced by which component(s) in YARN?
ResourceManager
ApplicationMaster
What are the benefits of using Spark?(Please select the THREE that apply)
Generality
Ease of use
Speed
Which database is a columnar storage database?
Hbase
Which database provides a SQL for Hadoop interface?
Hive

What are the languages supported by Spark?(Please select the THREE that apply)

Python
Java
Scala
Which Apache project provides coordination of resources?
Zookeeper

What would you need to do in a Spark application that you would not need to do in a Spark shell to
start using Spark?
Import the necessary libraries to load the SparkContext
What are the most important computer languages for Data Analytics?
(Please select the THREE that apply)
scala
R
Python
What are the two ways you can work with Big SQL.
(Please select the TWO that apply)
JSqsh
Web tooling from DSM
What is one of the reasons to use Big SQL?
Want to access your Hadoop data without using MapReduce
Which file storage format has the highest performance?
Parquet
What are the two ways to classify functions?
Built-in functions
User-defined functions
Which data type is BOOLEAN defined as in a Big SQL database?
SMALLINT
Which Big SQL authentication mode is designed to provide strong authentication for client/server
applications by using secret-key cryptography?
( je pense Kerberos) oui c’est ca

You need to define a server to act as the medium between an application and a data source in a Big
SQL federation. Which command would you use?
CREATE SERVER

Which tool should you use to enable Kerberos security?


Ambari
Which of the following is a data encoding format is a compact, binary format that supports
interoperability with multiple programming languages and versioning?
Avro

When sharing a notebook, what will always point to the most recent version of the notebook?
A. The permalink

Which file format has the highest performance?


Parquet
Which three main areas make up Data Science according to Drew Conway?
Math and statistics knowledge
Hacking skills
Substantive expertise
For what are interactive notebooks used by data scientists?
Quick data exploration tasks
Which two options can be used to start and stop Big SQL?
Ambari web interface
Command line
What can be used to surround a multi-line string in a Python code cell by appearing before and after
the multi-line string?
B-””
Which command is used to populate a Big SQL table?
load
Which areas of expertise are attributed to a data scientist
A. Data modeling
D. Machine learning
Which statement describes a sequence file?
The data is not human readable.
Which command would you run to make a remote table accessible using an alias?
CREATE NICKNAME
Which data type can cause significant performance degradation and should be avoided?
STRING
What is the native programming language for Spark?
A. Scala
Which of the following is a data encoding format is a compact, binary format that supports
interoperability with multiple programming languages and versioning?
Avro
Which type of function promotes code re-use and reduces query complexity?
User-Defined
Which two of the following data sources are currently supported by Big SQL?
Teradata
Oracle
Where must a Spark configuration be set up first?
IBM CLOUD
You need to create a table that is not managed by the Big SQL database manager. Which keyword
would you use to create the table?
EXTERNAL
Which visualization library is developed by IBM as an add-on to Python notebooks?
PixieDust
In Big SQL, what is used for table definitions, location, and storage format of input files?
Hive MetaStore
Who can access your data or notebooks in your Watson Studio project?
Collaborators
You need to monitor and manage data security across a Hadoop platform. Which tool would you use?
Apache Ranger
The Big SQL head node has a set of processes running. What is the name of the service ID running
these processes?
bigsql
Under the MapReduce v1 architecture, which element of the system manages the map and reduce
functions?
TaskTracker
What is meant by data at rest?
A data file that is not changing.
What is the name of the Scala programming feature that provides functions with no names?
Lambda functions
Who can control Watson Studio project assets?
Editors
When creating a Watson Studio project, what do you need to specify?
Select an answer
A. Data service

B. Data assets__xkit_fhadi_tahiya
C. Collaborators :: wa9ilaa hada
Under the MapReduce v1 architecture, which function is performed by the JobTracker?
Accepts MapReduce jobs submitted by clients.

What must surround LaTeX code so that it appears on its own line in a Juptyer notebook?
$$
Which file format contains human-readable data where the column values are separated by a
comma?
A. Delimited
Which component of the Apache Ambari architecture stores the cluster configurations?
C. Postgres RDBMS
Which type of foundation does Big SQL build on?
Apache HIVE
Which Spark RDD operation returns values after performing the evaluations?
actions
You can import preinstalled libraries if you are using which languages?
B. R
C. Python
Which statement is true about Spark's Resilient Distributed Dataset (RDD)?
It is a distributed collection of elements that are parallelized across the cluster.
In a Hadoop cluster, which two are the result of adding more nodes to the cluster?
C. It increases available processing power.
E. It adds capacity to the file system.
Under the MapReduce v1 architecture, which function is performed by the JobTracker?
▪ Accepts MapReduce jobs submitted by clients
What OS command starts the ZooKeeper command-line interface?
zkCli.sh
Which two are examples of personally identifiable information (PII)?
Select the FOUR answers that apply
A. Medical record number
B. IP address
C. Email address
D. Time of interaction
Which component of the Spark Unified Stack supports learning algorithms such as, logistic
regression, naive Bayes classification, and SVM?
B. MLlib
Which statement describes "Big Data" as it is used in the modern business world?
Select an answer
A. Structured data stores containing very large data sets such as video and audio streams.
B. The summarization of large indexed data stores to provide information about potential problems or
opportunities.
Which component of the HDFS architecture manages the file system namespace and metadata?
namenode
Which Hortonworks Data Platform (HDP) component provides a common web user interface for
applications running on a Hadoop cluster?
B. Ambari
Which feature allows application developers to easily use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities into their own applications?
B. REST APIs
What are three examples of "Data Exhaust"?
A. log files
D. cookies
F. browser cache
Which is the primary advantage of using column-based data formats over record-based formats?
A. faster query execution
What is Hortonworks DataPlane Services (DPS) used for?
Select an answer
D. Manage, secure, and govern data stored across all storage environments​.
Which two of the following are column-based data encoding formats?
C. ORC
D. Parquet
What Python package has support for linear algebra, optimization, mathematical integration, and
statistics?
SciPy
When sharing a notebook, what will always point to the most recent version of the notebook?
The permalink

Under the MapReduce v1 architecture, which element of MapReduce controls job execution on
multiple slaves?
JobTracker
Which environmental variable needs to be set to properly start ZooKeeper?
ZOOKEEPER_HOME
Which statement describes the action performed by HDFS when data is written to the Hadoop
cluster?
D. The data is spread out and replicated across the cluster.
Which Spark RDD operation creates a directed acyclic graph through lazy evaluations?
Actions
Transformations
Which feature allows the bigsql user to securely access data in Hadoop on behalf of another user?
Impersonation
Which component of the Spark Unified Stack provides processing of data arriving at the system in
real-time?
C. Spark Streaming
How does MapReduce use ZooKeeper?
Aid in the high availability of Resource Manager.
Which statement describes the purpose of Ambari?
B. It is used for provisioning, managing, and monitoring Hadoop clusters.

Under the MapReduce v1 architecture, which function is performed by the TaskTracker?


Select an answer
A. Manages storage and transmission of intermediate output.
Which of the "Five V's" of Big Data describes the real purpose of deriving business insight from Big
Data
A. Value
What is an authentication mechanism in Hortonworks Data Platform?
Kerberos

What is meant by data at rest?


Data in a file that has expired.
What is the term for the process of converting data from one "raw" format to another format making it
more appropriate and valuable for a variety of downstream purposes such as analytics and that
allows for efficient consumption of the data?
A. Data munging
What ZK CLI command is used to list all the ZNodes at the top level of the ZooKeeper hierarchy, in
the ZooKeeper command-line interface?
ls /
Under the MapReduce v1 programming model, what happens in the "Map" step?
Input is processed as individual splits.
B. Data is aggregated by worker nodes.
C. Output is stored and replicated in HDFS.
D. Multiple map tasks are aggregated.
What is the Hortonworks DataFlow package used for?
Data stream management and processing.
Who can access your data or notebooks in your Watson Studio project?
Select an answer
A. Collaborators -----
Which two Spark libraries provide a native shell?
Select the FIVE answers that apply
A. Python----
D. Scala----

What is Hortonworks DataPlane Services (DPS) used for?


D. Manage, secure, and govern data stored across all storage environments.-----
What is the default number of rows Sqoop will export per transaction?
A. 100
Which two of the following are row-based data encoding formats?
Avro
Csv
Under the HDFS storage model, what is the default method of replication?
A. 3 replicas, 2 on the same rack, 1 on a different rack

Which IBM Big Data solution provides


InfoSphere Streams
Which type of partitioning is supported by Hive?
value partitioning
Which type of application can be published when using the BigInsights Application Publish wizard in
Eclipse?
BigSheets
Which command is used for starting all the BigInsights components?
start-all.sh
Which two commands are used to copy a data file from the local file system to HDFS? (Choose two.)
hadoop fs -copyFromLocal test_file test_file
hadoop fs -put test_file test_file
Under IBM's Text Analytics framework, what programming language is used to write rules that can
extract information from unstructured text sources?
AQL
Which Hive tool is used to view and manipulate
table metadata?
Hive shell
Which Hadoop query language can manipulate semi-
structured data and support the JSON format?
Jql
A user enters following command:
hadoop fs -ls /mydir/test_file
and receives the following output:
rw-r--r-- 3 biadmin supergroup 714002200 2014-
11-21 14:21 /mydir/test_file
What does the 3 indicate?
expected replication factor for this file
When creating a master workbook in BigSheets,what is responsible for formatting the data output
inthe workbook?
Reader
Which part of the BigInsights web console, provides information that would help a user troubleshoot
and diagnose a failed application job?
Application Status tab
What is the primary benefit of using Hive with
Hadoop?
Hadoop data can be accessed through SQL statements.
After the following sequence of commands are executed:
create 'table1', 'columnfamily1', 'columnfamily2', 'columnfamily3'
put 'table1', 'row1', 'columnfamily1:c11', 'r1v11'
put 'table1', 'row1', 'columnfamily1:c12', 'r1v12'
put 'table1', 'row1', 'columnfamily2:c21', 'r1v21'
put 'table1', 'row1', 'columnfamily3:c31', 'r1v31'
put 'table1', 'row2', 'columnfamily1:d11', 'r2v11'
put 'table1', 'row2', 'columnfamily1:d12', 'r2v12'
put 'table1', 'row2', 'columnfamily2:d21', 'r2v21'
What value will the count 'table_1'command return?
2
Which Big Data technology delivers real-time analytic processing on data in motion?
Stream computing
Which of the four key Big Data Use Cases is used to lower risk and detect fraud?
Security/Intelligence Extension
Which statement is true about where the output of MAP task is stored?
It is stored on the local disk.
Under IBM's Text Analytics framework, what programming language is used to write rules that can
extract information from unstructured text sources?
AQL
Which type of client interface does Hive support?
JDBC
Which of the four characteristics of Big Data deals with trusting data sources?
Veracity
Which file allows you to update Hive configuration?
hive-site.xml
IBM Text Analytics is embedded into the key components of which IBM solution?
InfoSphere BigInsights
What is the primary reason that InfoSphere Streams is able to process data at the rate of terabytes
per second?
Data is processed in memory.
Which file system spans all nodes in a Hadoop Cluster?
HDFS
Which well known Big Data source has become the most popular source for business analytics?
social networks

In 2003, IBM's System S was the first prototype of a new IBM Stream Computing solution that
performed which type of processing?
Real Time Analytic Processing
The Enterprise Edition of BigInsights offers which feature not available in the Standard Edition?
Adaptive MapReduce
What do the query languages Hive and Big SQL have in common?
Both require schema.
What is a primary reason that business users want to use BigSheets?
It includes a very easy-to-use command-line interface.
Which two Hadoop query languages do not require data to have a schema? (Choose two.)
pig
jaql
Which BigSheets component applies a schema to the underlying data at runtime?
Reader
Which statement is true about storage of the output of REDUCE task?
It is stored in HDFS, but only one copy on the local machine.
What happens if a child task hangs in a MapReduce job?
JobTracker reschedules the task on another machine.
What is IBM's SQL interface to InfoSphere BigInsights?
Big SQL
What is the default data block size in BigInsights?
128 MB
What is the process in MapReduce that moves all data from one key to the same worker node?
Map
​In the master/slave architecture,what is considered the slave?
DATANODE
Under the MapReduce architecture, how does a JobTracker detect the failure of a TaskTracker?
receives no heartbeat
Which capability of Jaql gives it a significant advantage over other query languages?
It can handle deeply nested, semi-structured data.
What is the default number of replicas in HDFS replication?
3
Which AQL statement can be used to create components of a Text Analytics extractor? ​create view
<view_name< as <select or extract statement<;
Which BigInsights tool is used to access the BigInsights Applications Catalog?
Web Console
In Hadoop's rack-aware replica placement, what is the correct default block node placement?
1 block in 1 rack, 2 blocks in a second rack
Which file formats can Jaql read?
JSON, Avro, Delimited
Which statement is true about the data model used by HBase?
The table schema only defines column families.
What is the open-source implementation of BigTable, Google's extremely scalable storage system?
HBase
Where does HDFS store the file system namespace and properties?
FsImage
What does GPFS offer that HDFS does not?
split on one local disk
High availability was added to which part of the HDFS file system in HDFS 2.0 to prevent loss of
metadata?
NameNode
Which administrative console feature of BigInsights is a visualization and analysis tool designed to
work with structured and unstructured data?
BigSheets
Which Hadoop query language was developed by Yahoo to handle almost any type of data?
PIG
Which two file actions can HDFS complete? (Choose two.)
get
list
Which BigSheets component presents a spreadsheetlike representation of data
Workbook
which component of IBM Watson forms the foundation of the framework and allows Waston to extract
and index data from any source
search engine
Which development environment can be used to develop programs for Text Analytics?
Eclipce
​Your company wants to utilize text analytics to gain an understanding of general public opinion
concerning company products. which type of input source would you analyze for that information
Twitter feeds
which BigInsights feature helps to extract information from text data
Text Analytics Engine
which class of software can store and manage only structerd data
HDFS
what is the process in mapreduce that move all data from one key to the same worker node
Shuffle
Which two pre-requisites must be fulfilled when running a Java MapReduce program on the Cluster,
using Eclipse? (Choose two.) (Please select ALL that apply)
C Hadoop services must be running.
D BigInsights services must be running.
Which two commands are used to retrieve data from an HBase table? (Choose two.) (Please select
ALL that apply)
scan
get
You have been asked to create an HBase table and populate it with all the sales transactions,
generated in the company in the last quarter. Currently, these transactions reside in a 300 MB tab
delimited file in HDFS. What is the most efficient way for you to accomplish this task?
pre-create regions by specifying splits in create table command and bulk loading the data
What are two main components of a Java MapReduce job? (Choose two.) (Please select ALL that
apply)
A Mapper class which should extend org.apache.hadoop.mapreduce.Mapper class
D Reducer class which should extend org.apache.hadoop.mapreduce.Reducer class
Which element(s) must be specified when creating an HBase table?
only the table name and column family(s)
Hadoop is the primary software tool used for which class of computing?
C Big Data Analytics
Which BigSheets component presents a spreadsheet-like representation of data?
Reader
Which two technologies form the foundation of Hadoop? (Choose two.) (Please select ALL that apply)
B MapReduce
C HDFS
Which tool is included as part of the IBM BigInsights Eclipse development environment?
code generator
Which class of software can store and manage only structured data?
C Data Warehouse
Why does Big SQL perform better than Hive?
It uses sub-queries.
Which BigInsights feature helps to extract information from text data?
C Text Analytics Engine

BigSQL has which advantage over Hive?


It supports standard SQL statements.
What is the name of the interface that allows Hive to read in data from a table, and write it back out to
HDFS in any custom format?
JDBC
What is one of the primary reasons that Hive is often used with Hadoop?
B MapReduce is difficult to use.
Which built-in Hive storage file format improves performance by providing semi-columnar data
storage with good compression?
D RCFILE
When the following HIVE command is executed: LOAD DATA INPATH '/tmp/department.del'
OVERWRITE INTO TABLE department;
What happens?
DThe department.del file is copied from the HDFS /tmp directory to the location corresponding to the
Hive table.
Which Advanced Analytics toolkit in InfoSphere Streams is used for developing and building
predictive models?
C SPSS
Which method is used by Jaql to operate on large arrays?
D Parallelization
Which of the four characteristics of Big Data indicates that many data formats can be stored and
analyzed in Hadoop?
Variety
Given the following array: data = [ { from: 101, to: 102, msg: "Hello" }, { from: 103, to: 104, msg:
"World!" }, { from: 105, to:106, msg: "Hello World" } ];
And the following example of expected output: [ { "message": "Hello" } ]
What is the correct sequence of JAQL commands to select only the message text from sender 101?
D data < transform{message: $msg < filter $.from == 101};
Which JSON-based query language was developed by IBM and donated to the open-source Hadoop
community?
A Jaql
BigInsights offers which added value to Hadoop?
enhanced web-based UI and tools
Which feature of Jaql provides native functions and modules that allow you to build re-usable
packages?
C extensibility
For scalability purposes, data in an HDFS cluster is broken down into what default block size?
C 64 MB
Which two Hadoop features make it very cost effective for Big Data analytics? (Choose two.) (Please
select ALL that apply)
processes large data sets
runs on commodity hardware
In HDFS 2.0, how much RAM is required in a Namenode for one million blocks of data?
A 1 GB
Which statement describes Hadoop?
an open source software framework used to manage large volumes of unstructured, semi-structured,
and structured data
Which statement describes the data model used by HBase?
A a multidimensional sorted map
What is the primary, outstanding feature offered by HBase?
C It handles sharding automatically
What is the company/organization that developed HBase?
Apache
Which statement is true about the following command? hadoop dfsadmin -report:
B It displays basic file system information and statistics
A user enters following command: hadoop fs -ls /mydir/test_file
and receives the following output:
rw-r--r-- 3 biadmin supergroup 714002200 2014-11-21 14:21 /mydir/test_file
What does the 3 indicate?
C the number of times blocks in file were replicated
In the context of a Text Analytics project, which set of AQL commands, will identify and extract any
matching person names, when run across an input data source?
create dictionary NamesDict as ('John', 'Mary', 'Eric', 'Eva'); create view Names as extract dictionary
'NamesDict' on R.text as match from Document R; output view Names;
In the master/slave architecture, what is considered the slave?
DataNode
Your company wants to utilize text analytics to gain an understanding of general public opinion
concerning company products. Which type of input source would you analyze for that information?
Select an answer
A Twitter feeds
What is the correct sequence of the three main steps in Text Analytics?
Select an answer
A index text; categorize subjects; parse the data
What does NameNode use as a transaction log to persistently record changes to file system
metadata?
EditLog
Under the MapReduce architecture, when a line of data is split between two blocks, which class will
read over the split to the end of the line?
LineRecordReader
Which class in MapReduce takes each record and transforms it into a <key, value< pair?
B InputSplitter
Which MapReduce task is responsible for reading a portion of input data and producing <key, value<
pairs?
Map
Which part of the MapReduce engine controls job execution on multiple slaves?
C JobTracker
Which component of IBM Watson Explorer forms the foundation of the framework and allows Watson
to extract and index data from any source?
Select an answer
A search engine
Which Hadoop query language can manipulate semi-structured data and support the JSON format?
Jaql
What is the name of the Hadoop-based query language developed by Facebook that facilitates
SQL-like queries?
Hive
When the following HIVE command is executed: LOAD DATA INPATH '/tmp/department.del'
OVERWRITE INTO TABLE department;
What happens?
D The department.del file is copied from the HDFS /tmp directory to the location corresponding to the
Hive table.
Which two file actions can HDFS complete? (Choose two.) (Please select ALL that apply)
Index
Delete
When using BigInsights Eclipse to develop a new application, what must be done prior to testing the
application in the cluster?
configure runtime properties
Which element of the Big Data Platform can cost effectively store and manage many petabytes of
structured and unstructured information?
Hadoop System
Which IBM Business Analytics solution facilitates collaboration and unifies disparate data across
multiple systems into a single access point?
IBM Watson Explorer
What happens when a user runs a workbook in BigSheets?
Results on the real data are computed and the output is explored.
Hadoop is designed for which type of work?
batch-oriented parallel processing
What is the primary benefit of using Hive with Hadoop?
Hadoop data can be accessed through SQL statements.

Which type of cell can be used to document and comment on a process in a Jupyter
notebook?
B.Markdown

What is the architecture of Watson Studio centered on?


B.Projects

You need to add a collaborator to your project. What do you need?


d.The email address of the collaborator

Where does the unstructured data of a project reside in Watson Studio?


A.Object Storage
Which
Watson Studio offering used to be available through something known as IBM Bluemix?
b.Watson Studio Cloud

Which two are attributes of streaming data?


B. Sent in high volume.
C. Requires extremely rapid processing.
Which statement is true about MapReduce v1 APIs?
C.MapReduce v1 APIs are implemented by applications which are largely independent of the
execution environment.
Which is the java class prefix for the MapReduce v1 APIs?
D. org.apache.hadoop.mapred
Which statement is true about Hortonworks Data Platform (HDP)?
c.It is a Hadoop distribution based on a centralized architecture with YARN at its core.

Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or
other databases?
D.sqoop
Under the MapReduce v1 programming model, which shows the proper order of the full set
of MapReduce phases?
Map -> Combine -> Shuffle -> Reduce
​How can a Sqoop invocation be constrained to only run one mapper?
A.Use the -m 1 parameter.

Which three programming languages are directly supported by Apache Spark?


Scala
Java
Python
What two security functions does Apache Knox provide?
b.Proxying services.
C.API and perimeter security.
Which two factors in a Hadoop cluster increase performance most significantly?
A.high-speed networking between nodes
F.parallel reading of large data files

Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data
on semi-structured data in a Hadoop datastore?
D.Hive

Which component of the Apache Ambari architecture integrates with an organization's LDAP
or Active Directory service?
D.Authorization Provider
Under the YARN/MRv2 framework, the Scheduler and ApplicationsManager are components
of which daemon?
B.ResourceManager

Which computing technology provides Hadoop's high performance?


A.Parallel Processing
What is the name of the Hadoop-related Apache project that utilizes an in-memory
architecture to run applications faster than MapReduce?
A.Spark

Which component of the Apache Ambari architecture provides statistical data to the
dashboard about the performance of a Hadoop cluster?
B.mbari Metrics System

What are two primary limitations of MapReduce v1?


C.Resource utilization
E.Scalability

Which component of an Hadoop system is the primary cause of poor performance?


B.disk latency
Which component of the Hortonworks Data Platform (HDP) is the architectural center of
Hadoop and provides resource management and a central platform for Hadoop applications?
B.YARN

Which three are a part of the Five Pillars of Security?


B.Data Protection
C.Administration
D.Audit

Which statement about Apache Spark is true?


D.It is much faster than MapReduce for complex applications on disk

Apache Spark can run on which two of the following cluster managers?
B.Apache Mesos
C.Hadoop YARN

What are two ways the command-line parameters for a Sqoop invocation can be simplified?
C.Place the commands in a file.
D.Include the --options-file command line argument.

What is an example of a Key-value type of NoSQL datastore?


D.REDIS

What are two security features Apache Ranger provides?


A.Auditing
C.Authorization

What is the preferred replacement for Flume?


C.Hortonworks Data Flow

Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop
datastore and is particularly good for "sparse data"?
C.HBASE

What is an example of a NoSQL datastore of the "Document Store" type?


D.MONGODB
If a Hadoop node goes down, which Ambari component will notify the Administrator?
A.Ambari Alert Framework

​ hich two are valid watches for ZNodes in ZooKeeper?


W
C.NodeChildrenChanged
D.NodeDeleted

Which hardware feature on an Hadoop datanode is recommended for cost efficient


performance?
B.JBOD

Which feature makes Apache Spark much easier to use than MapReduce?
B.Libraries that support SQL queries.

Which statement describes an example of an application using streaming data?


D.An application evaluating sensor data in real-time.

Which Spark Core function provides the main element of Spark API?
A.RDD

Which description characterizes a function provided by Apache Ambari?


C.A wizard for installing Hadoop services on host servers.

Which tool would you use to create a connection to your Big SQL database?
A.DSM

Which Big SQL feature allows users to join a Hadoop data set to data in external databases?
C.Fluid query

Which statement best describes a Big SQL database table?


B.A directory with zero or more data files

Which command creates a user-defined schema function?


D.CREATE FUNCTION

When connecting to an external database in a federation, you need to use the correct
database driver and protocol. What is this federation component called in Big SQL?
D.Wrapper

What are Big SQL database tables organized into?


A.Schemas

You need to determine the permission setting for a new schema directory. Which tool would
you use?
A.umask

What is an advantage of the ORC file format?


b.Efficient compression

What is the default directory in HDFS where tables are stored?


c./apps/hive/warehouse/

Which definition best describes RCAC?


A.It limits the rows or columns returned based on certain criteria

Using the Java SQL Shell, which command will connect to a database called mybigdata?
D./jsqsh mybigdata

You need to enable impersonation. Which two properties in the bigsql-conf.xml file need to
be marked true?
C.bigsql.alltables.io.doAs
E.bigsql.impersonation.create.table.grant.public

You are creating a new table and need to format it with parquet. Which partial SQL statement
would create the table in parquet format?
A.STORED AS parquetfile
You have a distributed file system (DFS) and need to set permissions on the the
/hive/warehouse directory to allow access to ONLY the bigsql user. Which command would
you run?
D.hdfs dfs -chmod 700 /hive/warehouse

Which two commands would you use to give or remove certain privileges to/from a user?
A.GRANT
E.REVOKE

Which is an advantage that Zeppelin holds over Jupyter?


A.Notebooks can be used by multiple people at the same time

What is a markdown cell used for in a data science notebook?


B.Documenting the computational process.

What does the user interface for Jupyter look like to a user?
C.App in web browser

What command is used to list the "magic" commands in Jupyter?


D.%lsmagic

Why might a data scientist need a particular kind of GPU (graphics processing unit)?
A.To perform certain data transformation quickly.

What are two common issues in distributed systems?


A.Partial failure of the nodes during execution
B.Finding a particular node within the cluster

Under the MapReduce v1 programming model, which optional phase is executed


simultaneously with the Shuffle phase?
C.Combiner
Hadoop 2 consists of which three open-source sub-projects maintained by the Apache
Software Foundation?
B.YARN
C.HDFS
D.MapReduce
Under the YARN/MRv2 framework, which daemon is tasked with negotiating with the
NodeManager(s) to execute and monitor tasks?
C.ApplicationMaster

Which data encoding format supports exact storage of all data in binary representations such
as VARBINARY columns?
C.SequenceFiles

Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among
all the applications in the system?
C.ResourceManager

Which Apache Hadoop application provides a high-level programming language for data
transformation on unstructured data?
B.Pig

Under the MapReduce v1 programming model, what happens in a "Reduce" step?


D.Data is aggregated by worker nodes.

What are three IBM value-add components to the Hortonworks Data Platform (HDP)?
A.Big Match
C.Big Replicate
D.Big SQL

Which statement is true about the Combiner phase of the MapReduce architecture?
B.It reduces the amount of data that is sent to the Reducer task nodes.

Hadoop uses which two Google technologies as its foundation?


C.MapReduce
E.Google File System

Which component of the Spark Unified Stack allows developers to intermix structured
database queries with Spark's programming language?
D.Spark SQL
Which
NoSQL datastore type began as an implementation of Google's BigTable that can store any
type of data and scale to many petabytes?
A.HBase
Which statement accurately describes how ZooKeeper works?
B.All servers keep a copy of the shared data in memory.

Apache Spark provides a single, unifying platform for which three of the following types of
operations?
B.batch processing
C.machine learning
E.graph operations

What is the final agent in a Flume chain named?


D.Collector

What does the split-by parameter tell Sqoop?


A.The column to use as the primary key.

Under the YARN/MRv2 framework, the JobTracker functions are split into which two
daemons?
A.ApplicationMaster
E.ResourceManager
Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data
on semi-structured data in a Hadoop datastore?
A.Hive

​ hat is the first step in a data science pipeline?


W
A.Acquisition

What is a "magic" command used for in Jupyter?


A.Extending the core language with shortcuts

Which directory permissions need to be set to allow all users to create their own schema?
D.777

How many Big SQL management node do you need at minimum?


D.1

Before you create a Jupyter notebook in Watson Studio, which two items are necessary?
C.Project
D.Spark Instance
You need to add a collaborator to your project. What do you need?
D.The email address of the collaborator
You need to add a collaborator to your project. What do you need?
The email address of the collaborator
Which primary computing bottleneck of modern computers is addressed by Hadoop?
disk latency
Which capability does IBM BigInsights add to enrich Hadoop?
Adaptive MapReduce
What is one of the two technologies that Hadoop uses as its foundation?
MapReduce
Which Hadoop-related project provides common utilities and libraries that support other
Hadoop sub projects?
Hadoop Common
What is one of the four characteristics of Big Data?
Volume
Which description identifies the real value of Big Data and Analytics?
gaining new insight through the capabilities of the world's interconnected intelligence
Which Big Data function improves the decision-making capabilities of organizations by
enabling the organizations to interpret and evaluate structured and unstructured data in
search of valuable business information?
analytics
Which type of Big Data analysis involves the processing of extremely large volumes of
constantly moving data that is impractical to store?
Stream Computing
Which statement is true about Hadoop Distributed File System (HDFS)?
Data is accessed through MapReduce
What is one function of the JobTracker in MapReduce?
keeps the work physically close to the data
What is a characteristic of IBM GPFS that distinguishes it from other distributed file systems?
posix compliance
In which step of a MapReduce job is the output stored on the local disk?
Map
To run a MapReduce job on the BigInsights cluster, which statement about the input file(s)
must be true?
The file(s) must be stored in HDFS or GPFS
Which command helps you create a directory called mydata on HDFS?
hadoop fs -mkdir mydata
Following the most common HDFS replica placement policy, when the replication factor is
three, how many replicas will be located on the local rack?
one
When running a MapReduce job from Eclipse, which BigInsights execution models are
available? (Select two.)
Cluster
Local
How are Pig and Jaql query languages similar?
Both are data flow languages.
Which command displays the sizes of files and directories contained in the given directory,
or the length of a file, in case it is just a file?
hadoop fs -du
Which statement is true regarding the number of mappers and reducers configured in a
cluster?
The number of mappers and reducers can be configured by modifying the mapred-site.xml
file.
What key feature does HDFS 2.0 provide that HDFS does not?
high availability of the NameNode
if you need to change the replication factor or increase the default storage block size, which
file do you need to modify?
hdfs-site.xml
What is one of the two driving principles of MapReduce?
spread data across a cluster of computers
Which statement represents a difference between Pig and Hive?
Pig uses Load, Transform, and Store.
In the MapReduce processing model, what is the main function performed by the
JobTracker?
coordinates the job execution
Under the HDFS architecture, what is one purpose of the NameNode?
to regulate client access to files
In addition to the high-level language Pig Latin, what is a primary component of the Apache
Pig platform?
runtime environment
What are two of the core operators that can be used in a Jaql query? (Select two.)
JOIN
TOP
Under the MapReduce programming model, which task is performed by the Reduce step?
Data is aggregated by worker nodes.
Which command should be used to list the contents of the root directory in HDFS?
hadoop fs -Is /
Which element of the MapReduce architecture runs map and reduce jobs?
TaskTracker
Which type of language is Pig?
data flow
Which Hive command is used to query a table?
SELECT
Which technology does Big SQL utilize for access to shared catalogs?
Hive metastore
In Hive, what is the difference between an external table and a Hive managed table?
An external table refers an existing location outside the warehouse directory.
Which is a use-case for Text Analytics?
sentiment analytics from social media blogs
Which utility provides a command-line interface for Hive?
Hive shell
What is an accurate description of HBase?
It is an open source implementation of Google's BigTable.
What drives the demand for Text Analytics?
Most of the world's data is in unstructured or semi-structured text.
Which statement about NoSQL is true?
It is a database technology that does not use the traditional relational model.
Which statement will make an AQL view have content displayed?
output view <view_name>
Which command can be used in Hive to list the tables available in a database/schema?
show tables
Which tool is used for developing a BigInsights Text Analytics extractor?
Eclipse with BigInsights tools for Eclipse plugin
You work for a hosting company that has data centers spread across North America. You
are trying to resolve a critical performance problem in which a large number of web servers
are performing far below expectations. You know that the information written to log files can
help determine the cause of the problem, but there is too much data to manage easily. Which
type of Big Data analysis is appropriate for this use case?
Text Analytics
What makes SQL access to Hadoop data difficult?
Data is in many formats.
Why develop SQL-based query languages that can access Hadoop data sets?
because the MapReduce Java API is sometimes difficult to use
What is the "scan" command used for in HBase?
to view data in an Hbase table
Which tool is used to access BigSheets?
Web Browser
In HBase, what is the "count" command used for?
to count the number of rows in a table
Which Hadoop-related technology provides a user-friendly interface, which enables business
users to easily analyze Big Data?
BigSheets
What is the most efficient way to load 700MB of data when you create a new HBase table?
Pre-create regions by specifying splits in create table command and bulk loading the data.
Which key benefit does NoSQL provide?
It can cost-effectively manage data sets too large for traditional RDBMS.
Which Hadoop-related technology supports analysis of large datasets stored in HDFS using
an SQL-like query language?
Hive
If you need to JOIN data from two workbooks, which operation should be performed
beforehand?
"Load" to create a new sheet with the other workbook data in the current workbook
The following sequence of commands is executed: create
'table_1','column_family1','column_family2' put 'table_1','row1','column_family1:c11','r1v11'
put 'table_1','row2','column_family1:c12','r1v12' put
'table_1','row2','column_family2:c21','r1v21' put 'table_1','row3','column_family1:d11','r1v11'
put 'table_1','row2','column_family1:d12','r1v12' put
'table_1','row2','column_family2:d21','r1v21' In HBase, which value will the "count 'table_1'"
command return?
3
Which IBM Big Data solution provides low-latency analytics for processing data-in-motion?
InfoSphere Streams
What is one of the main components of Watson Explorer (InfoSphere Data Explorer)?
crawler
How can the applications published to BigInsights Web Console be made available for users
to execute?
They need to be deployed with proper privileges.
Which IBM tool enables BigInsights users to develop, test and publish BigInsights
applications?
Eclipse
IBM InfoSphere Streams is designed to accomplish which Big Data function?
analyze and react to data in motion before it is stored
Which component of Apache Hadoop is used for scheduling and running workflow jobs?
Oozie
Under the YARN/MRv2 framework, the Scheduler and Applications Manager are
components of which daemon?
ResourceManager
What two security functions does Apache Knox provide?

Proxying services.
API and perimeter security

What are two services provided by ZooKeeper?


Providing distributed synchronization.
Maintaining configuration information.
What is true about HBase?

An industry leading implementation of Google's Big Table design

An open source Apache Top Level Project


What is NOT true about ZooKeeper?
It is a tool for analyzing streaming data.
What Hive command is used for loading data from a local file to HDFS?
LOAD DATA LOCAL INPATH …
True or False: HBase region servers are responsible for serving and managing regions.
True
Which of the following statements is true about CREATE statement in HBase?
It only requires the name of table and one or more column families
Which command must be executed before deleting an HBase table or changing its settings?
disable
True or False: A Jaql query can be thought of as a stream of operations?
True
Steps in deploying your application on the cluster.
Access applications tab -> Manage published applications -> Locate new application ->
Deploy application -> Execute application
An element in a Workflow XML that invokes a program or script?
Action
Which operation allows reading a file using Jaql?
read(())
What is NOT true regarding Big SQL?
Incurs measurable overhead for the sake of resiliency
Which of the following tools can be used to access a Big SQL server?
All of the above
Which of the following data types are not supported by Big SQL?
byte
You can insert records into BIGSQL table mapped to an HBASE table
True
You can insert records into BIGSQL table mapped to a HIVE table
False
Which statement can be used to load data in a Big SQL table?
load hadoop
Which of the following is true about CREATE HADOOP TABLE statement?
It creates a Big SQL table and the data will be stored in HDFS/GPFS
Which of the following statements represents a difference between a Big SQL table and a
DB2 table?
When creating a Big SQL table, information on how the data is stored on disk must be
provided
What is JAQL?
Jaql is a query language initially designed for Javascript Object Notation (JSON) data format
An operation that can't be performed by JAQL on I/O Adapters?
localWrite()
The function that explicitly sends each element of an array to a mapper.
arrayRead()
What component of Apache Hadoop can be used for scheduling and running workflow jobs?
Oozie
Select a BigInsights execution mode for an application or script developed in Eclipse
Cluster
What step needs to be performed for an application published to BigInsights Web Console to
become available for users to run it?
It needs to be deployed with proper privileges
Which part of the BigInsights web console, provides information, that would help a user
troubleshoot and diagnose, a failed application job?
Application Status tab
What is BigSheets?
All of the above
Which is NOT true regarding the BigSheets workbook?
None of the above
What is the 'Add Sheets' option that helps calculate values by grouping the data in the
workbook, applying functions to each group and carrying over data.
Pivot
Which command is used in HBase for inserting data in a table?
put
An element in a Workflow XML that invokes a program or script?
Action
Select a BigInsights execution mode for an application or script developed in Eclipse
Cluster
True or False: You can apply only a single operation to a sheet
True
Which statement describes a BigSheets Master Workbook feature?
It points to a file or table stored in Hadoop
An SQL style programming language for text mining or text extraction?
AQL
A list of entries containing domain specific terms.
Dictionary
Which is not a built-in scalar type of an AQL Data Model?
Char
True or False: Text analytics Java API is part of the Big Insights Text Analytics Components.
True
A data model scalar type that represents a bag of values.
List
In the context of AQL, which of the following is not a clause?
view
True or False: Text Analytics is a powerful information extraction system providing
multilingual support.
True
Line of code used to specify the Module Name at the beginning of an AQL file.
module ;
Which tool can be used for developing a Text Analytics Extractor?
Eclipse with BigInsights tools for Eclipse
What statement can be used to display the content of an AQL view?
output view <view_name>;
What big data characteristic is not covered by InfoSphere Streams?
Virtual
Which of the following Jaql operators helps projecting or retrieving a subset of columns or
fields from a data set?
transform
Which Jaql expression can be used to flatten nested arrays?
expand
Which does not belong in the BIgInsights Application types?
BigData
What step needs to be performed for an application published to BigInsights Web Console to
become available for users to run it?
It needs to be deployed with proper privileges
What is true regarding Streams Processing Language (SPL)?
All of the above
Which of the following statements is true regarding the Instance Graph available in
InfoSphere Streams:
The Instance Graph provides a graphical view of the application that’s currently running in an
instance; it is a very helpful tool for debugging Streams applications at runtime
In the context of InfoSphere Streams what is Streams Console?
Web based tool used to monitor and manage InfoSphere Streams instances and applications
What is true regarding Data Explorer?
Explorer's search combines content and data from many different systems throughout the
enterprise and presents it to users in a single view.
In the context of an enterprise search engine, what is called the process of retrieving
documents?
Crawling
What is a Search Collection?
It represents one or more information repositories and the online index created from them
when the Data Explorer search engine crawls those sources
What should be done when configuring a Data Explorer project so that the search results are
grouped by values of specific metadata, tags, or other parameters?
Add Binning to the Search Collection
Which tool can be used to create and configure a Data Explorer search project?
Data Explorer Engine administration tool
How do you manage BigInsights?
None of the above
What is NOT true about Hadoop Distributed File System (HDFS)?
Designed for random access not streaming reads
True or False: In HDFS, the data blocks are replicated to multiple nodes
True
What is NOT true about the Namenode StartUp?
NameNode stores data blocks
Hadoop command for copying files from local file system to HDFS:
hadoop fs -put
What is not a MapReduce Task?
Reader
True or False: Map Tasks need Key and Value pairs as input.
True
Which Java class is responsible for taking HDFS file and transforming it into splits?
InputSplitter
Regarding Task Failure; if a child task fails, where does the child JVM reports before it exits?
TaskTracker
What are the main components of a Java MapReduce program (Select 3)?
Map class
Reduce class
Driver or main class
What are two execution modes of a Java MapReduce program developed in Eclipse?
Cluster
Local

What are the breakpoints which can be set for a Java MapReduce program in Eclipse?
stop points to suspend the program execution for debugging purposes
What is NOT a similarity of Pig, Hive and Jaql?
Designed for random reads/writes or low latency queries
What is NOT true about Hive?
Designed for low latency queries, like RDBMS such as DB2 and Netezza
What is true regarding Hive External Tables?
data is stored outside the Hive warehouse directory
What Hive command list all the base tables and views in a database?
show tables
Which is not an ACID property?
Concurrency
You need to add a collaborator to your project. What do you need?
The email address of the collaborator
Where does the unstructured data of a project reside in Watson Studio?
Object Storage
Which Watson Studio offering used to be available through something known as IBM
Bluemix?
Watson Studio Cloud
What is the architecture of Watson Studio centered on?
Projects
Which type of cell can be used to document and comment on a process in a Jupyter
notebook?
Markdown
Which BigInsights tool is used to access the BigInsights Applications Catalog?
Web Console
Which Big Data function improves the decisionmaking capabilities of organizations by
enabling the organizations to interpret and evaluate structured and unstructured data in
search of valuable business information?
analytics
Which Hadoop-related technology provides a userfriendly interface, which enables business
users to easily analyze Big Data?
BigSheets
Which IBM Big Data solution provides low-latency analytics for processing data-inmotion?
InfoSphere Streams
- What is the correct sequence of the three m ain steps in Text Analytics?

Select an answer
A. index text: categorize subjects; parse the data
B. structure text; derive patterns; interpret output

C. sort tables; index text; derive patterns

D. categorize subjects; index columns; derive patterns

- What does NameN o de use as a trans acti on log to persiste n t l y


record chang es to file system metada ta?
Select an answer

A. EditLog

B. CheckPoint

C. FSImage

D. Dallallode
- Which statement is true about where the output of MAP task is stored?

A. It is stored in HDFS using the number of copies specified by replicatio n


factor.

B. It is stored in HDFS: but only one copy on the local machine.

C. It is stored in memo ry.

d It is stored on the local disk.

- Which BigInsights tool is used to access the BigInsights Applications Catalog?

Select an answer

A. Eclipse Plug-in

B. Eclips e Console

C. Web Console

D. Application Wizard
- In Hadoop's rack-aware replica placement, what is the correct default block node placement?

Select an ans wer

A. 1 block in 1 rack, 2 blocks in a second rack

B. 2 blocks in 1 rack. 2 blocks in a second rack

C. 3 blocks in same rack

D. 4 blocks in 4 separate racks

- IBM Text Analyt ics is emb ed d ed into the key compo n en ts of which IBM solutio n?

Select an answer

A. Rational

B. Eclipse

C. Hadoop

D. InfoSphere Biglnsights

- Which of the four characteristics of Big Data indicates that many data formats can be stored and
analyzed in Hadoop?

A. Velocity
B. Volume
C. Volatility
D. Variety

- Which Advanced Analytics toolkit in InfoSphere Streams is used for developing and building
predictive models?

A. Time Series
B. CEP
C. Geospatial
D. SPSS
What is one of the four characteristics of Big Data?
Your answer

A volatility

B volume

C verifiability

D value

Question: 2
Which description identifies the real value of Big
Data and Analytics?
Your answer

A providing solutions to help customers manage and grow large database systems

B gaining new insight through the capabilities of the world's interconnected intelligen

C enabling customers to efficiently index and access large volumes of data

D using modern technology to efficiently store the massive amounts of data generat

Question: 3
What is one of the two technologies that Hadoop
uses as its foundation?
Your answer

A HBase
Your answer

B Jaql

C Apache

D MapReduce

Question: 4
Which capability does IBM BigInsights add to enrich
Hadoop?
Your answer

A Parallel computing on commodity servers

B Adaptive MapReduce

C Jaql

D Fault tolerance through HDFS replication

Question: 5
Which type of Big Data analysis involves the
processing of extremely large volumes of constantly
moving data that is impractical to store?
Your answer

A MapReduce

B Stream Computing

C Text Analysis
Your answer

D Federated Discovery and Navigation

Question: 6
Which Hadoop-related project provides common
utilities and libraries that support other Hadoop sub
projects?
Your answer

A BigTable

B Hadoop HBase

C MapReduce

D Hadoop Common

Question: 7
Which primary computing bottleneck of modern
computers is addressed by Hadoop?
Your answer

A limited disk capacity

B 64-bit architecture

C MIPS

D disk latency

Question: 8
Which Big Data function improves the decision-
making capabilities of organizations by enabling the
organizations to interpret and evaluate structured
and unstructured data in search of valuable
business information?
Your answer

A data warehousing

B distributed file system

C stream computing

D analytics

Question: 9
Under the HDFS architecture, what is one purpose of
the NameNode?
Your answer

A to periodically report status to DataNode

B to coordinate MapReduce jobs

C to regulate client access to files

D to manage storage attached to nodes

Question: 10
In addition to the high-level language Pig Latin, what
is a primary component of the Apache Pig platform?
Your answer
Your answer

A an RDBMS such as DB2 or MySQL

B platform-specific SQL libraries

C built-in UDFs and indexing

D runtime environment

Question: 11
Which statement represents a difference between
Pig and Hive?
Your answer

A Pig has a shelf interface for executing commands.

B Pig uses Load, Transform, and Store.

C Pig is used for creating MapReduce programs.

D Pig is not designed for random reads/writes or low-latency queries.

Question: 12
What are two of the core operators that can be used
in a Jaql query? (Select two.)
Your answer

A LOAD

B TOP

C JOIN

D SELECT
Question: 13
In the MapReduce processing model, what is the
main function performed by the JobTracker?
Your answer

A executes the map and reduce functions

B coordinates the job execution

C copies Job Resources to the shared file system

D assigns tasks to each cluster node

Question: 14
Under the MapReduce programming model, which
task is performed by the Reduce step?
Your answer

A Worker nodes process individual data segments in parallel.

B Data is aggregated by worker nodes.

C Worker nodes store results in the local file system.

D Input data is split into smaller pieces.

Question: 15
Following the most common HDFS replica
placement policy, when the replication factor is
three, how many replicas will be located on the local
rack?
Your answer

A two
Your answer

B three

C none

D One

Question: 16
Which command helps you create a directory called
mydata on HDFS?
Your answer

A hadoop fs -mkdir mydata

B hadoop fs -dir mydata

C mkdir mydata

D hdfs -dir mydata

Question: 17
What is one function of the JobTracker in
MapReduce?
Your answer

A keeps the work physically close to the data

B runs map and reduce tasks


Your answer

C reports status of DataNodes

D manages storage

Question: 18
Which element of the MapReduce architecture runs
map and reduce jobs?
Your answer

A TaskTracker

B JobScheduler

C Reducer

D JobTracker

Question: 19
What is a characteristic of IBM GPFS that
distinguishes it from other distributed file systems?
Your answer

A no single point of failure

B blocks that are stored on different nodes


Your answer

C posix compliance

D operating system independence

Question: 20
Which type of language is Pig?
Your answer

A object oriented

B SQL-like

C data flow

D compiled language

Question: 21
Which statement is true regarding the number of
mappers and reducers configured in a cluster?
Your answer

A The number of mappers must be equal to the number of nodes in a cluster.

B The number of mappers and reducers can be configured by modifying the mapred-site.xml file.
Your answer

C The number of reducers is always equal to the number of mappers.

D The number of mappers and reducers is decided by the NameNode.

Question: 22
Which statement is true about Hadoop Distributed
File System (HDFS)?
Your answer

A Data can be processed over long distances without a decrease in performance.

B Data is accessed through MapReduce.

C Data can be created, updated and deleted.

D Data is designed for random access read/write.

Question: 23
What is one of the two driving principles of
MapReduce?
Your answer

A increase storage capacity through advanced compression algorithms

B provide a platform for highly efficient transaction processing


Your answer

C spread data across a cluster of computers

D provide structure to unstructured or semi-structured data

Question: 24
How are Pig and Jaql query languages similar?
Your answer

A Both are data flow languages.

B Both use Jaql query language.

C Both require schema.

D Both are developed primarily by IBM.

Question: 25
When running a MapReduce job from Eclipse, which
BigInsights execution models are available? (Select
two.)
Your answer

A Remote

B Cluster
Your answer

C Local

D Distributed

E Debugging

Question: 26
If you need to change the replication factor or
increase the default storage block size, which file do
you need to modify?
Your answer

A hadoop.conf

B hdfs-site.xml

C hadoop-configuration.xml

D hdfs.conf

Question: 27
Which command displays the sizes of files and
directories contained in the given directory, or the
length of a file, in case it is just a file?
Your answer

A hdfs -du
Your answer

B hadoop fs -du

C hdfs fs size

D hadoop size

Question: 28
In which step of a MapReduce job is the output
stored on the local disk?
Your answer

A Combine

B Reduce

C Shuffle

D Map

Question: 29
Which command should be used to list the contents
of the root directory in HDFS?
Your answer

A hadoop fs -Is /
Your answer

B hadoop fs list

C hdfs root

D hdfs list /

Question: 30
What key feature does HDFS 2.0 provide that HDFS
does not?
Your answer

A data access performed by an RDBMS

B random access to data in the cluster

C high availability of the NameNode

D a high throughput, shared file system

Question: 31
To run a MapReduce job on the BigInsights cluster,
which statement about the input file(s) must be true?
Your answer

A The file(s) must be stored on the local file system where the map reduce job was developed.
Your answer

B The file(s) must be stored on the JobTracker.

C No matter where the input files are before, they will be automatically copied to where the job ru

D The file(s) must be stored in HDFS or GPFS.

Question: 32
Which is a use-case for Text Analytics?
Your answer

A product cost analysis from accounting systems

B managing customer information in a CRM database

C sentiment analytics from social media blogs

D health insurance cost/benefit analysis from payroll data

Question: 33
Which command can be used in Hive to list the
tables available in a database/schema?
Your answer

A show tables

B list tables
Your answer

C describe tables

D show all

Question: 34
Which Hive command is used to query a table?
Your answer

A GET

B TRANSFORM

C EXPAND

D SELECT

Question: 35
Why develop SQL-based query languages that can
access Hadoop data sets?
Your answer

A because the data stored in Hadoop is always structured

B because SQL enhances query performance

C because data stored in a Hadoop cluster lends itself to structured SQL queries
Your answer

D because the MapReduce Java API is sometimes difficult to use

Question: 36
What drives the demand for Text Analytics?
Your answer

A Data warehouses contain potentially valuable information.

B Text Analytics is the most common way to derive value from Big Data.

C Most of the world's data is in unstructured or semi-structured text.

D MapReduce is unable to process unstructured text.

Question: 37
Which tool is used to access BigSheets?
Your answer

A Web Browser

B BigSheets client

C Microsoft Excel

D Eclipse

Question: 38
What is an accurate description of HBase?
Your answer

A It is a data flow language for structured data based on Ansi-SQL.

B It is an open source implementation of Google's BigTable.

C It is a distributed file system that replicates data across a cluster.

D It is a database schema for unstructured Big Data.

Question: 39
In HBase, what is the "count" command used for?
Your answer

A to count the number of columns of a table

B to count the number of regions of a table

C to count the number of column families of a table

D to count the number of rows in a table

Question: 40
Which key benefit does NoSQL provide?
Your answer

A It allows customers to leverage high-end server platforms to manage Big Data.


Your answer

B It can cost-effectively manage data sets too large for traditional RDBMS.

C It allows Hadoop to apply the schema-on-ingest model to unstructured Big Data.

D It allows an RDBMS to maintain referential integrity on a Hadoop data set.

Question: 41
Which utility provides a command-line interface for
Hive?
Your answer

A Hive SQL client

B Thrift client

C Hive Eclipse plugin

D Hive shell

Question: 42
Which Hadoop-related technology supports analysis
of large datasets stored in HDFS using an SQL-like
query language?
Your answer

A Jaql
Your answer

B HBase

C Pig

D Hive

Question: 43
Which statement will make an AQL view have
content displayed?
Your answer

A return view <view_name>

B export view <view_name>

C output view <view_name>

D display view <view_name>

Question: 44
What makes SQL access to Hadoop data difficult?
Your answer

A Data is in many formats.

B Hadoop requires pre-defined schema.


Your answer

C Hadoop data is highly structured.

D Data is located on a distributed file system.

Question: 45
The following sequence of commands is executed:
create 'table_1','column_family1','column_family2'
put 'table_1','row1','column_family1:c11','r1v11'
put 'table_1','row2','column_family1:c12','r1v12'
put 'table_1','row2','column_family2:c21','r1v21'
put 'table_1','row3','column_family1:d11','r1v11'
put 'table_1','row2','column_family1:d12','r1v12'
put 'table_1','row2','column_family2:d21','r1v21'
In HBase, which value will the "count 'table_1'"
command return?
Your answer

A 6

B 4

C 2

D 3

Question: 46
In Hive, what is the difference between an external
table and a Hive managed table?
Your answer

A An external table refers to the data stored on the local file system.

B An external table refers to the data from a remote database.

C An external table refers to a table that cannot be dropped.

D An external table refers an existing location outside the warehouse directory.

Question: 47
If you need to JOIN data from two workbooks, which
operation should be performed beforehand?
Your answer

A "Group" to bring together the two workbooks

B "Copy" to create a new sheet with the other workbook data in the current workbook

C "Add" to add the other workbook data to the current workbook

D "Load" to create a new sheet with the other workbook data in the current workbook

Question: 48
Which statement about NoSQL is true?
Your answer

A It is based on the highly scalable Google Compute Engine.


Your answer

B It is an IBM project designed to enable DB2 to manage Big Data.

C It is a database technology that does not use the traditional relational model.

D It provides all the capabilities of an RDBMS plus the ability to manage Big Data.

Question: 49
Which technology does Big SQL utilize for access to
shared catalogs?
Your answer

A RDBMS

B Hive metastore

C HCatalog

D MapReduce

Question: 50
Which Hadoop-related technology provides a user-
friendly interface, which enables business users to
easily analyze Big Data?
Your answer

A Avro
Your answer

B HBase

C BigSQL

D BigSheets

Question: 51
You work for a hosting company that has data
centers spread across North America. You are trying
to resolve a critical performance problem in which a
large number of web servers are performing far
below expectations. You know that the information
written to log files can help determine the cause of
the problem, but there is too much data to manage
easily. Which type of Big Data analysis is
appropriate for this use case?
Your answer

A Data Warehousing

B Stream Computing

C Temporal Analysis

D Text Analytics

Question: 52
What is the "scan" command used for in HBase?
Your answer

A to list all tables in Hbase

B to view data in an Hbase table

C to get detailed information about the table

D to report any inconsistencies in the database

Question: 53
Which tool is used for developing a BigInsights Text
Analytics extractor?
Your answer

A BigInsights Console with AQL plugin

B AQL command line

C Eclipse with BigInsights tools for Eclipse plugin

D AQLBuilder

Question: 54
What is the most efficient way to load 700MB of data
when you create a new HBase table?
Your answer
Your answer

A Pre-create regions by specifying splits in create table command and use the insert command to lo

B Pre-create the column families when creating the table and bulk loading the data.

C Pre-create regions by specifying splits in create table command and bulk loading the data.

D Pre-create the column families when creating the table and use the put command to load the dat

Question: 55
Which IBM tool enables BigInsights users to
develop, test and publish BigInsights applications?
Your answer

A Eclipse

B Avro

C BigInsights Applications Catalog

D HBase

Question: 56
IBM InfoSphere Streams is designed to accomplish
which Big Data function?
Your answer
Your answer

A execute ad-hoc queries against a Hadoop-based data warehouse

B analyze and react to data in motion before it is stored

C find and analyze historical stream data stored on disk

D analyze and summarize product sentiments posted to social media

Question: 57
How can the applications published to BigInsights
Web Console be made available for users to
execute?
Your answer

A They need to be linked with the master application.

B They need to be marked as "Shared."

C They need to be deployed with proper privileges.

D They need to be copied under the user home directory.

Question: 58
Which component of Apache Hadoop is used for
scheduling and running workflow jobs?
Your answer
Your answer

A Task Launcher

B Jaql

C Eclipse

D Oozie

Question: 59
Which IBM Big Data solution provides low-latency
analytics for processing data-in-motion?
Your answer

A InfoSphere Streams

B InfoSphere BigInsights

C PureData for Analytics

D InfoSphere Information Server

Question: 60
What is one of the main components of Watson
Explorer (InfoSphere Data Explorer)?
Your answer
Your answer

A replicater

B crawler

C compressor

D validater
You have completed the test for Unit 1. Big Data Overview atu47599).
YOU scored 100%. Your score has been recorded.

01. Which of the foLlowing is not a characteristic of Big Data


Your Answer: Virtual
Correct Answer: Virtual

Q2 Which of the foLlowing statements is not true about the 5 key Big Data Use Cases?
Your Answer: Enhanced 360P View of the customer extends existing customer views by incorporating internal data sources only.
Correct Answer: Enhanced 3150a View of the custom er extends existing customer views by incorporating internal data sources only.

Q3 Which of the foLlowing Big Data PLatform components can cost effectively analyze petabytes of structured and unstructured data at rest?
Your Answer: Hadoop System
Correct Answer: Hadoop System
You have completed the test for Unit 2. Hadoop big data analysis tool (Itu47600).
You scored 66%. Your score has been recorded,

Ql. What is Hadoop?


Your Answer: An open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.
Correct Answer: An open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.

X 02. What are the main components of Had nap?


Your Answer: Map Reduce, HDFS, Hive
Correct Answer: Map Reduce, Hadoop Common, HDFS

Q3. How do you manage Biglnsights?


Your Answer: None of the above
Correct Answer: None of the above
You have completed the test for Unit 3. HDF5 (ltu47601).
You scored 100%. Your score has been recorded.

Ql. What is NOT true about Hadoop Distributed File System (HDFS)?
Your Answer: Designed for random access not streaming reads
Correct Answer: Designed for random access not streaming reads

v.-- Q2. True or False: In HDFS, the data blocks are replicated to multiple nodes
Your Answer: True
Correct Answer: True

Q3. What is NOT true about the Namenode Startlip?


Your Answer: NameNode stores data blocks
Correct Answer: NameN ode stores data blocks
You have completed the test for Exercise 1. HDFS Lab (ltu47552).
You scored 66%. Your scoce has been recorded,

Ql. HDFS command for reporting basic filesysiem information and statistics
Your Answer: hadoop dfsadmin -report
Correct Answer: hadoop dfsadmin -report

v Q2. Select two methods for browsing files stored to HDFS in BigInsights
Your Answer: Command-Line. approach using the format.: hadoop fs , BigInsights Web Console s Files Tab
Correct Answer: Command-Line approach using the format: hadoop Is , Biglnsights Web Console s FiLes Tab

X Q3. Hadoop command for copying files from local file system to HDFS:
Your Answer: hadoop fs -cp
Correct Answer: hadoop fs -put

Continue
You have completed the test for Unit 4. MapReduce (ltu47602).
You scored 100%. Your score has been recorded.

'we Ql. What is not a MapReduce Task?


Your Answer: Reader
Correct Answer: Reader

v Q2 True or False: Map Tasks need Key and Value pairs as input.
Your Answer: True
Correct Answer: True

v Q3 Which Java class is responsible for taking HDFS fiLe and transforming it into splits.
Your .Answer: InputSplitter
Correct A.nswer: InputSplitter
N O °

Q4. Regarding Task Failure; if a child task falls, where does the chiLd JVM reports before it exits?
Your Answer: TaskTracker
Correct Answer: Tas kTracker
You have completed the test for Exercise 2. Mab Reduce Lab (Itu47557).
YOU SC• red 66%.. Your scare has been recorded,

N? 01. What are the main components of a Java MapReduce program (SeLect 3)
Your Answer: Map class, Reduce class, Driver or main class
Correct Answer: Map class, Reduce class, Driver or main class

Q2. What are two execution modes of a Java MapReduce program developed in Eclipse
Your Answer: Cluster, Local
Correct Answer: Cluster, Local

Q. What are the breakpoints which can be set for a Java MapReduce program in Eclipse
Your Answer: stop points for executing only a specified section of the program
Correct Answer: stop points to suspend the program execution for debugging purposes
You have completed the test for Unit 5. Fladoop Query Languages (ltu47603).
YOU scored 100%. Your score has been recorded.

Ql. What is NOT a similarity of Pig, Hive and Jaql?


Your Answer: Designed for random reads/writes or low latency queries
Correct Answer: Designed for random reads/writes or low latency queries

v 02. What is NOT true about Hive?


Your Answer: Designed for Low latency queries, Like RDBMS such as DB2 and Netezza
Correct Answer: Designed for low latency queries, Like RDBMS such as DB2 and Netezza

v." Q3. True or False: A Jaql query can be thought of as a stream of operations.
Your Answer: True
Correct Answer: True

Continue
Activer \Windows
ArrArip7 A r n e narArni=trpc nni Ir ArtiliPF Vulfrir
Q1. True or False: Data Warehouse augmentation is a very common use case for Hadoop.
Your Answer: True
Correct Answer: True

X 02. The clause that indicates the storage file/record format on HOES.
Your Answer: stored by
Correct Answer: stored as

X Q3. True or False: Hive conies with an HBase storage handler.


Your Answer: False
Correct Answer: True

Q4. A variable name that determines whether the map/reduce jobs should he submitted through a separate T1fv1 in the non local mod e.
Your Answer: Hve.exec.submitviachild
Correct Answer: hive,exec.submitviachiEd
(ou have completed the test for Exercise 3. Hive Lab (1 -tu47559).
you scored 100%. Your score has been recorded.

,./ Ql. What is true regarding Hive External Tables?


Your Answer: data is stored outside the Hive warehouse directory
Correct Answer: data is stored outside the Hive warehouse directory

v Q2 What Hive command list all the base tables and views in a database?
Your Answer: show tabLes
Correct Answer: show tabLes

v Q3 What Hive command is used for loading data from a Local file to HDF S?
Your Answer: LOAD DATA LOCAL INPATH ...
Correct Answer: LOAD DATA LOCAL INPATH ...
You have completed the test for Unit 7. HBase (ltu47604).
You scored 50%. Your score has been recorded.

Ql. What is true about HBase?


Your Answer: ALE of the above
Correct Answer: ALE of the above

ve 02. Which is not an ACID property?


Your Answer: Concurrency
Correct Answer: Concurrency

X 03. True or False: HBase region servers are responsible for serving and managing regions.
Your Answer: Fake
Correct Answer: True

X c14. What is NOT true about ZooKeeper?


YOUF Answer: None of the above
Correct Answer: It is a tool for analyzing streaming data
You have completed the test for Exercise 4. HBase Lab (ltu4'/558).
You scored 100%. Your score has been recorded.

Ql. Which of the following statements is true about CREATE statement in HBase
Your Answer: It only requires the name of table and one or more column families
Correct Answer: It only requires the name of table and one or more column families

N., Q2. Which command must be executed before deleting an HBase table or changing its settings?
Your Answer: disable
Correct Answer: disable

No' Q3. Which command is used in HBase for inserting data in a table?
Your Answer: put
Correct Answer: put

Cont[riue
You have completed the test for Unit 8: Big SQL (ltu47606).
Y:Du scored 80%. Your score has been recorded,

Ql. What is NOT true regarding Big SQL?


Your Answer: Incurs measurable overhead for the sake of resiliency
Correct Answer: Incurs measurable overhead for the sake of resiliency

Q2. Which of the following tools can be used to access a Big SQL server?
Your Answer: ALL of the above
Correct Answer: ALL of the above

N., Q3. Which of the following data types are not supported by Big SQL?
Your Answer: byte
Correct Answer: byte
N . "

Q4. You can insert records into BIG SQL table mapped to an HBASE table
Your Answer: True
Correct Answer: True

05. You can insert records into BI SQL - table mapped to a HIVE table
Your Answer: True
Correct Answer: False

Artilfinr 1Alindrimm
You have completed the test for Exercise 5. Big SQL Lab (Itu47560).
You scored 66%. Your score has been recorded.

X 01. Which statement can be used to Load data in a Big SQL tab Le?
Your Answer: toad data
Correct Answer: load hadoop

2. Which of the fo llo w ing is true about CRE ATE H ADOOP T ABLE sta tement
Your Answer: It creates a Big SQL table and the data wiEl be stared in HDFSiGPFS
Correct Answer: It creates a Big SQL table and the data wiEl be stared in HDFSiGPFS

3. Which of the foLlowing statements represents a difference between a Big SQL table and a DB2 table?
Your Answer: When creating a Big S Q1 table, information on how tha data is stored on disk must be provided
Correct Answer: When creating a. Big 5 QL table, information on how the data is stored an disk must be provided

Continue
You have completed the test for Unit 9. JAQL (Itu47607).
You scored 100%. Your score has been recorded.

v.' 01. What is JAQL?


Your Answer: Jaql is a query language initially designed for Javascript Object Notation (ISON) data format
Correct Answer: Jaql is a query language initially designed for Javascript Object Notation (JSON) data format

v.' Q2. An operation that can't be performed by JAQL on I/O Adapters?


lour Answer: localWrite0
Correct Answer: localWritec)

Q3. The function that explicitly sends each element of an array to a mapper.
Your Answer: arraylleadO
Correct Answer: arrayRea.d0
You have completed the test for Exercise 6. JAQL Lab (ltu47561).
You scored 100%. Your score has been recorded.

v Ql. Which operation allows reading a fiLe using Jaql?


Your Answer: read (O
Correct Answer: read (0)

v Q2. Which of the foLlowing Jacil operators helps projecting or retrielifng a subset of columns or fields from a data set?
`Your Answer: transform
Correct Answer: transform

v 03. Which iaql expression can be used to flatten nested arrays?


Your Answer: expand
Correct Answer: expand
You scored 50%. Your score has been recorded.

Ql. Which does not belong in the BIgInsights Application types?


Your Answer: BigData
Correct Answer: BigData

"of 02. What component of Apache Hadoop can he used for scheduling and running workflow jobs?
Your Answer: Oozie
Correct Answer: Oozie

X 03. An element in a Workflow XML that invokes a program or script?


Your Answer: KR
Correct Answer: Action

X Q4. Steps in deploying your application on the cluster.


Your Answer: Access applications tab -: Manage published applications -: Locate new application -> Execute application -: Deploy application
Correct Answer: Access applications tab -: Manage published applications -: Locate new application -> Deploy application -> Execute application

Continue
You have completed the test for Exercise 7: Application Development Lab
(Itu47562). You scored 66%. Your score has been recorded.

01. Select a BigInsights execution mode for an application or script developed in Eclipse
Your Answer: Clus ter
Cor r ec t A nswe r: Clus te r

X Q2. What step needs to be performed for an application published to BigInsights Web Console to become available for users to run it?
Your Answer: It needs to be copied under user home directory
Correct Answer: It needs to be deployed with proper privileges

Q3. Which part of the BigInsights web console, provides information, that would help a user troubleshoot and diagnose, a failed a pplication job?
Your Answer: Application Status tab
Cor r ec t A nswe r: Ap plica tion Sta tus tab
Y O U scored 100%. Your score has been recorded.

QI. W h a t is B ig Sh e e t s ?
Your Answer: ALL of the above
Correct Answer: ALL of the above

02. Whic h is N OT tr u e r e g a r d in g the Dig S heets w or kb ook?


Your Answer: None of the above
Correct Answer: None of the above

Q3. Wha t is t he ' A d d She e ts ' op tion tha t h elp s c a lc ula te va l ues b y g r oup ing the d a t a in the w or kb oo k, a p p ly ing fu n c tions t o ea c h g r oup a nd c a r r ying o ver d a ta .
Your Answer: Pivot
Cor r ec t A nswe r: P ivot

v Q4. P r ovid e s 1 0 + b uiLd -in fu nc t ions to ex tr a c t n a mes , a d d r es s es , or g a niz a t ions , ema il a nd p hone n umb er s
Your Answer: Text Analytics Integration
Corr ect Answer: T ex t Analytics Integra tion
You have completed the test for Exercise 8. browser-based data analytics tool lab (ltu47563).
You scored 33%. Your score has been recorded.

No' 01. Which statement is NOT true about BigSheeis Readers?


Your Answer: They are needed to visualise encrypted data
Correct Answer: They are needed to visualise encrypted data

X 02. True or False: You can apply only a single operation to a sheet
Your Answer: Fake
Correct Answer: True

X Q3. Which statement describes a Big5heets Master Workbook feature?


Your Answer: Can be accessed only by its owner and it can riot be shared with other users
Correct Answer: It points to a file or table stored in Hadoop
You have completed the test for Unit 12. Text Analytics (Ltu47610).
You scored 75%. Your score has been recorded.

...." Qi. An SQL style programming Language for text mining or text extraction?
Your Answer: AQL
Correct Answer: AQL

X 02. A list of entries containing domain specific terms.


Ycur Ansvver: None of the above
Correct Answer: Dictionary

../ 03. Which is not a built-in scalar type of an AQL Data ModeL?
Your Answer: Char
Correct Answer: Char

...., Q4. True or False: Text analytics Java API is part of the Big Insights Text Analytics Components.
our Answer: True
Correct Answer: True
You have completed the test for Unit 13. AQL Syntax (Ltu47611).
'You scared 75%. Your score has been recorded.

Q1. A data model scalar type that represents a bag of values.


'Your Answer: List
Correct Answer: List

X Q2. In the context of AQL, which of the following is not a clause?


Your Answer: consolidate
Correct Answer: view

vf Q3. True or False: Text Analytics is a powerful information extraction system providing multilingual support.
Your Answer: True
Correct Answer: True

-of Q4. Line of code used to specify the ModuLe Name at the be of an AQL file.
Your Answer: module
Correct A nswer: mo dule ;
You have completed the test for Exercise 9. MIL EmaiL Analysis Lab (ltu47564).
You scored 66%. Your score has been recorded.

Ql. Which tooL can be used for developing a Text AnaLytics Extractor?
Your Answer: Eclipse with Biginsights took for Eclipse
Correct Answer: Eclipse with Biginsights took for Eclipse

Q2. Which of the following is an AQL statement that can be used to create a component of a Text Analytics extractor
Your Answer: create view <viewname> as <select or extract statement›;
Correct Answer: create view <viewname) as <select or extract state ment>;

X Q3. What statement can be used to display the content of an AQL view?
Your Answer: s ho w c o nt e nt vi e w v i e w nam e;
Correct Answer: output view (view_na.me;
You have completed the test for Unit 14. Streams (1 - tu47612).
You scored 100%. Your score has been recorded.

1. What big data characteristic is not covered by InfoSphere Streams?


Your Answer: Virtual
C o r r e c t A n s w e r : V i r tu a l

2. What is true regarding Streams Processing Langu age (SPL)?


Your Answer: ALL of the above
Correct Answer: ALL of the above

Q3 An Eclipse- bas ed tool that enables developers to create, edit , visualize, t est, debug, run SPL and SPL mixed -mode applic ations
Your Answer: InfoSp here Streams Studio
Correct Answer: InfoSp here Streams Studio

Q4 S WS pro vides w eb - bas ed gr aph ic al us er in t erf ac e in St r eams Cons ole, wh at do es SWS st an ds fo r?


Your Answer: Streams Web Service
Correct Answer: Streams Web Service

Continue
)u have completed the test for Exercise 10: Streams Lab (ltu47565).
)u scored 0%. Your score has been recorded,

IC Ql. Which too' is used to develop Streams jobs and applications?


Your Answer: streamtool command Line tool
Correct Answer: Streams Studio

IC Q2. Which of the following statements is true regarding the Instance Graph available in InfoSphere Streams:
Your Answer: The Instance Graph provides a graphical representation of the network topology of the Cluster where Streams is installed and running.
Correct Answer: The Instance Graph provides a graphical view of the application that's currently running in an instance; it is a very helpful tool for debugging Streams applications at
runtime

ICQ3. In the context of InfoSphere Streams what is Streams Console?


Your Answer: Tool that integrates with InfoSphere Streams to audit the streaming data and applications
Correct Answer: Web based tool used to monitor and manage InfoSphere Streams instances and applications
Ql. What is true regarding Data Explorer?
'Your Answer: Explorer's search combines content and data from many different systems throughout the enterprise and presents it to users in a single view.
Correct Answer: Explorer's search combines content and data from many different systems throughout the enterprise a nd presents it to users in a single view.

X Q2 I n t h e c o n t e x t o f a n en t e r pr i se s e ar c h e n gi ne , w h a t is c a ll e d t he pr o c es s o f re t ri e vi n g d o c u m en t s ?
Your Answer: Clustering
Correct Answer: Crawling

X Q3 A component of Data Explorer Engine that process raw data discovered by the crawler and produce 1+ pieces of index -able data.
Your Answer: Retrieving
Correct Answer: Converting

X Q4 True or False: Data Explorer's collaborative search enables saving and sharing of results and queries among users.
Your Answer: False
Correct Answer: True
f oe have completed the test for Exercise 11: Data Explorer Lab (Itu47566),
( ou scored 33%. Your sc or e has be en r ecor ded.

X Ql. What is a Search Collection?


Your Answer: It represents a mechanisms that can be defined within Data Explorer to take an input document in one format and convert it to another so that it can be index ed,
reprocessed, ':Jr displayed
Corr ect Answer: I t r epres ents one or mor e informa tion r epositories and the online index created from them whe n the Da ta Exp lor er search engine c rawls those s ourc es

Q2. What should be done when configuring a Data Explorer project so that the search results are grouped by values of specific met adata, tags, or other parameters?
Your Answer: Add Binning to the Search Collection
Corr ect Answer: A dd Binning to the Se arch Colle c tion

X Q3. Which tool can be used to create and configure a Data Explorer search project?
Your Answer: Data Explorer Studio
Corr ect Answer: Data Explor er Engine a dminis tra tion tool
Question 29

Apache Spark can run on which two of the following cluster managers?

Hadoop YARN, Apache Mesos

Question 30

Which component of the Spark Unified Stack allows developers to intermix structured database
queries with Spark's programming language?

Spark SQL

Question 31

Under the MapReduce v1 programming model, which shows the proper order of the full set of
MapReduce phases?

Map -> Combine -> Shuffle -> Reduce

Question 32

Which description characterizes a function provided by Apache Ambari?

A wizard for installing Hadoop services on host servers.

Question 33

Which two are attributes of streaming data?

Requires extremely rapid processing.

Sent in high volume.

Question 34

Which component of the Apache Ambari architecture provides statistical data to the dashboard
about the performance of a Hadoop cluster?

Ambari Metrics System

Question 35

What is the preferred replacement for Flume?


Hortonworks Data Flow

Question 36

What is an example of a Key-value type of NoSQL datastore?

REDIS

Question 37

Which statement is true about the Combiner phase of the MapReduce architecture?

It reduces the amount of data that is sent to the Reducer task nodes.
Question 38

Which feature makes Apache Spark much easier to use than MapReduce?

Libraries that support SQL queries.

Question 39

Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among all the
applications in the system?

ResourceManager

Question 40

Which Apache Hadoop application provides a high-level programming language for data
transformation on unstructured data?

Pig

Question 41

Under the YARN/MRv2 framework, which daemon is tasked with negotiating with the
NodeManager(s) to execute and monitor tasks?

ApplicationMaster

Question 42

What is an example of a NoSQL datastore of the "Document Store" type?

MongoDB

Question 43

What are three IBM value-add components to the Hortonworks Data Platform (HDP)?

Big SQL

Big Replicate

Big Match

Question 44

Which statement is true about Hortonworks Data Platform (HDP)?

It is a Hadoop distribution based on a centralized architecture with YARN at its core.

Question 45

What is the name of the Hadoop-related Apache project that utilizes an in-memory architecture to
run applications faster than MapReduce?

Spark
Question 46

Apache Spark provides a single, unifying platform for which three of the following types of
operations?

batch processing

machine learning

graph operations

Question 47

Which three programming languages are directly supported by Apache Spark?

Python

Java

Scala

Question 48

Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop datastore
and is particularly good for "sparse data"?

HBase

Question 49

Which statement is true about MapReduce v1 APIs?


MapReduce v1 APIs are implemented by applications which are largely independent of the
execution environment.

Question 50

Which statement about Apache Spark is true?

It is much faster than MapReduce for complex applications on disk.

Question 51

Under the MapReduce v1 programming model, which optional phase is executed simultaneously
with the Shuffle phase?

Combiner

Question 52

What are two ways the command-line parameters for a Sqoop invocation can be simplified?

Place the commands in a file.

Include the --options-file command line argument.

Question 53

If a Hadoop node goes down, which Ambari component will notify the Administrator?

Ambari Alert Framework


Question 54

What two security functions does Apache Knox provide?

API and perimeter security.

Proxying services.

Question 55

What does the split-by parameter tell Sqoop?

The column to use as the primary key.

Question 56

Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data on
semi-structured data in a Hadoop datastore?

Hive

Question 57

Which data encoding format supports exact storage of all data in binary representations such as
VARBINARY columns?

SequenceFiles

Question 58

Hadoop uses which two Google technologies as its foundation?

MapReduce

Question 59

Which is the java class prefix for the MapReduce v1 APIs?

org.apache.hadoop.mapred

Question 60

Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or other
databases?

Sqoop
Introduction to Big Data
*A tsunami of Big Data (Huge volumes , Data of different types and formats ,mpacting the
business at new and ever increasing speeds)
Big Data refers to non-conventional strategies and innovative technologies used
by businesses and organizations to capture, manage, process, and make sense of
a large volume of data
The four classic dimensions of Big Data (4 Vs)
Volume : the main characteristic of big data is its huge volume collected
through various sources

Variety : various formats ( unstructured and structured ) and sources

Velocity : velocity is the speed or frequency at which data is collected in various forms and
from different sources for processing
Veracity : filter out clean and relevant data from big data In order to make accurate
decisions.

Value : to the information procured leads to the whole purpose of big data, smart decision
making.

Or even more…
• Volume - how much data is there?
• Velocity - how quickly is the data being created, moved, or accessed?
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment?

Types of Big Data


• Structured
Data that can be stored and processed in a
fixed format, aka schema
• Semi-structured
Data that does not have a formal structure of a data model, i.e. a table definition in a
relational DBMS, but nevertheless it has some organizational properties like tags and other
markers to separate semantic elements thatmakes it easier to analyze, aka XML or JSON
• Unstructured
Data that has an unknown form and cannot be stored in RDBMS and cannot be analyzed
unless it is transformed into a structured format is called as unstructured data
Text Files and multimedia contents like images, audios, videos are example of
unstructured data - unstructured data is growing quicker than others, experts say that 80
percent of the data in an organization is unstructured

Common Use Cases applied to Big Data


Extract / Transform / Load (ETL) , Text mining, Index building,Graph creation and
analysis,Pattern recognition,Collaborative filtering,Predictive models,Sentiment analysis,Risk
assessment…

Data at Rest: data has already arrived and is stored

Data in Motion: Streaming Data

The six design principles in Industry 4.0


• Interoperability
The ability of cyber-physical systems (i.e. workpiece carriers, assembly stations, and
products), humans and Smart Factories to connect/communicate with each other via the
Internet of Things (IoT) and the Internet of Services (IoS)
• Virtualization
A virtual copy of the Smart Factory which is created by linking sensor data (from
monitoring physical processes) with virtual plant models and simulation models
• Decentralization
The ability of cyber-physical systems within Smart Factories to make decisions on their
own
• Real-time Capability
The capability to collect and analyze data and provide the derived insights immediately
• Service Orientation
Offering of services (of cyber-physical systems, humans or Smart Factories) via the
Internet of Services
• Modularity
Flexible adaptation of Smart Factories to changing requirements by replacing or
expanding individual modules

Technology & tomorrow: Future directions (IEEE)


•Big Data :
• Brain : This initiative is dedicated to advancing technologies that improve the
understanding of brain function developing new approaches to interface the brain with
machines for augmenting human-machine interaction and mitigating effects of neurological
disease and injury
• Cybersecurity Initiative : building long-standing and world-leading technical activities in
cybersecurity and privacy to actively engage, inform, and support members, organizations,
and communities involved in cybersecurity research, development, operations, policy, and
education.
• Digital Senses : dedicated to advancing technologies that capture and reproduce, or
synthesize the stimuli of various senses (sight, hearing, touch, smell, taste, etc.)
• Green ICT : energy consumption, atmospheric emissions, e-waste, life cycle management)
• Internet of Things (IoT)
• Rebooting Computing
• Smart Cities
• Smart Materials
• Software Defined Networks
(SDN)
• Cloud Computing
• Life Sciences
• Smart Grid
• Transportation Electrification
Hadoop
It is a framework written in Java Consists of 3 sub projects:
MapReduce
Hadoop Distributed File System (aka. HDFS)
Hadoop Common
• Has a large ecosystem with both open-source & proprietary Hadooprelated projects
HBase / ZooKeeper / Avro / etc
is optimized to handle massive amounts of data which could be structured, unstructured or
semi-structured, using commodity hardware, that is, relativelinexpensive computers .

Unit 2 : Hortonworks Data Platform (HDP)


HDP is platform for data-at-rest , Secure, enterprise-ready open source Apache Hadoop
distribution based on a centralized architecture (YARN)

Data workflow
Sqoop is a tool for moving data between structured databases or relational databases and
related Hadoop system. This works both ways. You can take data in your RDBMS and move
to your HDFS and move from your HDFS to some other form of RDBMS
FLUME : Essentially, when you have large amounts of data, such as log files, that needs to be
moved from one location to another, Flume is the tool
Kafka : is a messaging system used for real-time data pipelines
Data Acces
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data
summarization, ad-hoc queries, and analysis of large data sets in Hadoop (Includes HCatalog)
Apache Pig is a platform for analyzing large data sets, Pig has its own language, called Pig
Latin, with a purpose of simplifying MapReduce programming. PigLatin is a simple scripting
language, once compiled, will become MapReduce jobs to run against Hadoop data.
HBase is a columnar datastore, which means that the data is organized in columns as opposed
to traditional rows where traditional RDMBS is based upon, HBase is modeled after Google's
BigTable and provides BigTable-like capabilities on top of Hadoop and HDFS. HBase is a
NoSQL
datastore.
Accumulo imilar to HBase. You can think of Accumulo as a "highly secure HBase
Phoenix enables online transactional process and operational analytics in Hadoop for low
latency applications. Essentially, this is a SQL for NoSQL database
Storm is designed for real-time computation that is fast Used to process large volumes of
high-velocity data , useful when milliseconds of latency matter and Spark isn't fast enough
Solr search platform built on the Apache Lucene Java search library, It is designed for full
text indexing and searching
Spark : is a fast and general engine for large-scale data processing.
Druid is a datastore designed for business intelligence (OLAP) queries. Druid provides real-
time data ingestion, query, and fast aggregations. It integrates with Apache Hive to build
OLAP cubes and run sub-seconds queries.
Data Lifecycle and Governance
Falcon : is used for managing the data life cycle in Hadoop clusters
Atlas : It provides features for data classification, centralized auditing, centralized lineage,
and security and policy engine. It integrates with the whole enterprise data ecosystem.
Security
Ranger is used to control data security across the entire Hadoop platform. The Ranger console
can manage policies for access to files, folders, databases, tables and columns. The policies
can be set for individual users or groups.
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for
Hadoop. You can think of Knox like the castle walls, where within walls is your Hadoop
cluster.
OPERATIONS
Ambari : For provisioning, managing, and monitoring Apache Hadoop clusters. Provides
intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks' project,
and is currently not a part of Apache. It automates the launch of clusters into various cloud
infrastructure platforms.
ZooKeeper provides a centralized service for maintaining configuration information, naming,
providing distributed synchronization and providing group services across your Hadoop
cluster
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with the
rest of the Hadoop stack. Oozie workflow jobs are Directed Acyclical Graphs
(DAGs) of actions. At the heart of this is YARN.
TOOLS
Zepplin is a web based notebook designed for data scientists to easily and quickly
explore dataset through collaborations, Zeppelin allows for interaction and visualization of
large datasets.
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File,
HDFS which allows developers to monitor and manage the cluster

BM value-add components
• Big SQL : SQL processing engine for the Hadoop cluster

• Big Replicate : It replicates data automatically with guaranteed consistency on Hadoop


clusters on any distribution, cloud object storage and local and NDFS mounted file systems

• BigQuality : platform for data integration, data quality, and governance that is unified by a
common metadata layer and scalable architecture.

• BigIntegrate : is a big data integration solution that provides superior connectivity, fast
transformation and reliable, easy-to-use data delivery features that execute on the data nodes
of a Hadoop cluster.

• Big Match

UNIT4
What hardware is not used for Hadoop
RAID , Linux Logical Volume Manager (LVM), Solid-state disk (SSD)
Parallel data processing is the answer
▪GRID computing: spreads processing load
▪ distributed workload: hard to manage applications, overhead on developer
▪ parallel databases: Db2 DPF, Teradata, Netezza, etc. (distribute the data)
What is Hadoop?
Hadoop is an open source project of the Apache Foundation. ,It is a framework written in Java
originally , Hadoop uses Google's MapReduce and Google File System (GFS) technologies as
its foundation
• Consists of 4 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ YARN
▪ Hadoop Common
• Supported by many Apache/Hadoop-related projects:
▪ HBase, ZooKeeper, Avro, etc.
Hadoop is not used for OLTP nor OLAP, but is used for big data, and it complements
these two to manage data. Hadoop is not a replacement for a RDBMS.
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high
throughput access to application data.( The Hadoop Distributed File System (HDFS) is where
Hadoop stores its data. This file system spans all the nodes in a cluster. Effectively, HDFS
links together the data that resides on many local nodes, making the data part of one big file
system. You can use other file systems with Hadoop, but HDFS is quite common)
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.'
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters. It also provides a dashboard for viewing cluster health and ability to
view MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data
storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and
ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary
DAG of tasks to process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed
applications.
Advantages and disadvantages of Hadoop
Hadoop is good for:
▪ processing massive amounts of data through parallelism
▪ handling a variety of data (structured, unstructured, semi-structured)
▪ using inexpensive commodity hardware
Hadoop is not good for:
▪ processing transactions (random access)
▪ when work cannot be parallelized
▪ low latency data access
▪ processing lots of small files
▪ intensive calculations with small amounts of data

HDFS and MapReduce


The driving principle of MapReduce is a simple one: spread the data out across a huge cluster
of machines and then, rather than bringing the data to your programs as you do in a traditional
programming, write your program in a specific way that allows the program to be moved to
the data. Thus, the entire cluster is made available in both reading the data as well as
processing the data.
The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file system.
When a file is written to the cluster, blocks of the file are spread out and replicated across the
whole cluster (in the diagram, notice that every block of the file is replicated to three different
machines).
Adding more nodes to the cluster instantly adds capacity to the file system and automatically
increases the available processing power and parallelism.

HDFS architecture
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed
throughout the cluster. In this way, the map and reduce functions can be executed on smaller
subsets of your larger data sets, and this provides the scalability that is needed for big data
processing.
Hadoop Distributed File System (HDFS) principles
• Distributed, scalable, fault tolerant, high throughput
• Data access through MapReduce
• Files split into blocks (aka splits)
• 3 replicas for each piece of data by default
• Can create, delete, and copy, but cannot update
• Designed for streaming reads, not random access
• Data locality is an important concept: processing data on or near the
physical storage to decrease transmission of data
UNIT 5 MapReduce and YARN
The driving principle of MapReduce is a simple one: spread your data out across a huge
cluster of machines and then, rather than bringing the data to your programs as you do in
traditional programming, you write your program in a specific way that allows the program to
be moved to the data.
A Distributed File System (DFS) is at the heart of MapReduce. It is responsible for spreading
data across the cluster, by making the entire cluster look like one giant file system

MapReduce v1 explained
MapReduce v1 engine
• Master/Slave architecture
▪ Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
• JobTracker
▪ Accepts MapReduce jobs submitted by clients
▪ Pushes map and reduce tasks out to TaskTracker nodes
▪ Keeps the work as physically close to data as possible
▪ Monitors tasks and TaskTracker status
• TaskTracker
▪ Runs map and reduce tasks
▪ Reports status to JobTracker
▪ Manages storage and transmission of intermediate output
The MapReduce programming model
"Map" step
▪ Input is split into pieces (HDFS blocks or "splits")
▪ Worker nodes process the individual pieces in parallel
(under global control of a Job Tracker)
▪ Each worker node stores its result in its local file system where a reducer
is able to access it
"Reduce" step
▪ Data is aggregated ("reduced" from the map steps) by worker nodes
(under control of the Job Tracker)
▪ Multiple reduce tasks parallelize the aggregation
▪ Output is stored in HDFS (and thus replicated)
MapReduce 1 overview
Map phase : A mapper is typically a relatively small program with a relatively simple task: it
is responsible for reading a portion of the input data, interpreting, filtering or transforming the
data as necessary and then finally producing a stream of <key, value> pairs
Shuffle phase : The output of each mapper is locally grouped together by key One node is
chosen to process data for each unique key All of this movement (shuffle) of data is
transparently orchestrated by MapReduce
Reduce phase : Small programs (typically) that aggregate all of the values for the key that
they are responsible for , Each reducer writes output to its own file
Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
▪ InputSplitter dividing a file into splits
-Splits are normally the block size but depends on number of requested Map tasks, whether
any compression allows splitting, etc.
▪ RecordReader takes a split and reads the files into records
-For example, one record per line (LineRecordReader)
-But note that a record can be spit across splits
▪ InputFormat takes each record and transforms it into a <key, value> pair that is then passed
to the Map task
The primary way that Hadoop achieves fault tolerance is through restarting tasks
The most serious limitations of classical MapReduce are:
▪ Scalability
▪ Resource utilization
▪ Support of workloads different from MapReduce.

YARN overhauls MRv1


The fundamental idea of YARN/MRv2 is to split up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into separate
daemons.
The idea is to have a global ResourceManager (RM) and per-application
ApplicationMaster (AM).
The ResourceManager has two main components: Scheduler (The Scheduler is pure
scheduler in the sense that it performs no monitoring or tracking of status for the
application.)and ApplicationsManager.( is responsible for accepting job-submissions,
negotiating the first container for executing the application specific ApplicationMaster and
provides the service for restarting the ApplicationMaster container on failure.)

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability
UNIT 6 : APASH SPARK
Apache Spark was designed as a computing platform to be fast, general-purpose, and easy to
use. It extends the MapReduce model and takes it to a whole other level.

Spark Core contains basic Spark functionalities required for running jobs and needed
by other components. The most important of these is the RDD concept, or resilient
distributed dataset, the main element of Spark API
Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive
variant of SQL). Spark SQL allows developers to intermix SQL with Spark's
programming language supported by Python, Scala, Java, and R.
• Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for
developers to move between applications that processes data stored in memory
vs arriving in real-time. It also provides the same degree of fault tolerance,
throughput, and scalability that the Spark Core provides.
• MLlib is the machine learning library that provides multiple types of machine
learning algorithms. These algorithms are designed to scale out across the
cluster as well. Supported algorithms include logistic regression, naive Bayes classification,
SVM, decision trees, random forests, linear regression, k-means
clustering, and others.
• GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised
of vertices and edges connecting them. GraphX provides functions for building
graphs and implementations of the most important algorithms of the graph
theory, like page rank, connected components, shortest paths, and others.
Two types of RDD(RDD essentially is just a distributed collection of elements that is
parallelized across the cluster.) operations:
▪ Transformations
-Creates a directed acyclic graph (DAG)
-Lazy evaluations
-No return value
▪ Actions
-Performs the transformations
-The action that follows returns a value
• RDD provides fault tolerance
• Has in-memory caching (with overflow to disk)
UNIT 7 :
Common data representation formats used for big data include:
▪ Row- or record-based encodings:
-Flat files / text files
-CSV and delimited files
-Avro / SequenceFile
-JSON
-Other formats: XML, YAML
▪ Column-based storage formats:
-RC / ORC file was developed to support Hive
-Parquet(developed by Cloudera and Twitter)
▪ NoSQL datastores
• Compression of data
WHY NoSQL ?

briefly:
• Increased complexity of SQL(Massive data sets exhaust the capacity and scale of existing
RDBMSs)
• Sharding introduces complexity(Distributing the RDBMS is operationally challenging and
often technically impossible)
• Single point of failure(
• Failover servers more complex
• Backups more complex
• Operational complexity added
WHY HBASE ?
• Highly Scalable
▪ Automatic partitioning (sharding)
▪ Scale linearly and automatically with new nodes
• Low Latency
▪ Support random read/write, small range scan
• Highly Available
• Strong Consistency
• Very good for "sparse data" (no fixed columns)
HBase and ACID Properties
• Atomicity
▪ All reading and writing of data in one region is done by the assigned Region Server
▪ All clients have to talk to the assigned Region Server to get to the data
▪ Provides row level atomicity
• Consistency and Isolation
▪ All rows returned via any access API will consist of a complete row that existed at
some point in the table's history
▪ A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation. Any
row returned by the scan will be a consistent view (for example, that version of the
complete row existed at some point in time)
• Durability
▪ All visible data is also durable data. That is to say, a read will never return
data that has not been made durable on disk
HBase data model
• Data is stored in HBase table(s)
• Tables are made of rows and columns
• All columns in HBase belong to a particular column family
• Table schema only defines column families
▪ Can have large, variable number of columns per row
▪ (row key, column key, timestamp) Î value
▪ A {row, column, version} tuple exactly specifies a cell
• Each cell value has a version
▪ Timestamp
• Row stored in order by row keys
▪ Row keys are byte arrays; lexicographically sorted
• Technically HBase is a multidimensional sorted map
Pig
• Pig runs in two modes:
▪ Local mode: on a single machine without requirement for HDFS
▪ MapReduce/Hadoop mode: execution on an HDFS cluster, with the Pig scrip converted
to a MapReduce job
• When Pig run runs in an interactive shell, the prompt is grunt>
• Pig scripts have, by convention, a suffix of .pig
• Pig is written in the language Pig Latin
Pig vs. SQL
• In contrast to SQL, Pig:
▪ uses lazy evaluation
▪ uses ETL techniques
▪ is able to store data at any point during a pipeline
▪ declares execution plans
▪ supports pipeline split
• Pig Latin is procedural language with a pipeline paradigm
• SQL is a declarative language

What is Hive?
• tools to enable easy data extract/transform/load (ETL)
• a mechanism to impose structure on a variety of data formats
• access to files stored either directly in Apache HDFS or in other data storage systems such
as Apache HBase
• query execution via MapReduce
Components of Hive include HCatalog and WebHCat:
• HCatalog is a component of Hive. It is a table and storage management layer for Hadoop
that enables users with different data processing tools – including Pig and MapReduce - to
more easily read and write data on the grid.
• WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig,
Hive jobs or perform Hive metadata operations using an http (REST style) interface.

ZooKeeper
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
ZooKeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
naming. It is designed to be easy to program to, and uses a data model styled after the
familiar directory tree structure of file systems. It runs in Java and has bindings for both Java
and C.
• ZooKeeper provides support for writing distributed applications in the Hadoop ecosystem
• ZooKeeper addresses the issue of partial failure
▪ Partial failure is intrinsic to distributed systems
▪ ZK provides a set of tools to build distributed applications that can safely handle partial
failures
• ZooKeeper has the following characteristics:
▪ simple
▪ expressive
▪ highly available
▪ facilitates loosely coupled interactions
▪ is a library
• Apache ZooKeeper is an open source server that enables highly reliable distributed
coordination
Distributed systems :

Multiple software components on multiple computers, but run as a single system


• Computers can be physically close (local network), or geographically
distant (WAN)
• The goal of distributed computing is to make such a network work as a single computer
• Distributed systems offer many benefits over centralized systems
▪ Scalability: System can easily be expanded by adding more machines as needed
▪ Redundancy: Several machines can provide the same services, so if one is unavailable:
work does not stop, smaller machines can be used, redundancy not prohibitively expensive

ZooKeeper service: Replicated mode


The processing sequence is:
• ZooKeeper Service is replicated over a set of machines
• All machines store a copy of the data (in memory)
• A leader is elected on service startup
• Clients only connect to a single ZooKeeper server & maintains a TCP connection.
• Client can read from any ZooKeeper server, writes go through the leader & and this needs
a majority consensus.

ZooKeeper service: Standalone mode


• Single ZooKeeper server
• Good for testing/learning
• Lacks benefits of Replicated Mode; no guarantee of high-availability or resilience
ZooKeeper provides five consistency guarantees:
1. Sequential Consistency: updates from a client to the ZooKeeper service are applied in the
order they are sent.
2. Atomicity: updates in ZooKeeper either succeed or fail. Partial updates are not allowed.
3. Single System Image: a client will see the same view of the ZooKeeper service regardless
of the server in the ensemble that it is connected to.
4. Reliability: if an update succeeds in ZooKeeper then it will persist and not be rolled back.
The update will only be overwritten when another client performs a new update.
5. Timeliness: a client's view of the system is guarantee to be up-to-date within a certain
time bound, generally within tens of seconds. If a client does not see system changes within
that time bound, then the client assumes a service outage and will connect to a different
server in the ensemble.
What ZooKeeper does not guarantee
• Simultaneously consistent cross-client views
• Different clients will not always have identical views of ZooKeeper data at every instance in
time.
▪ ZooKeeper provides the sync() method
-Forces a ZooKeeper ensemble server to catch up with leader
ZooKeeper structure: Data model
• Distributed processes coordinate through shared hierarchicalnamespaces
▪ These are organized very similarly to standard UNIX and Linux file systems
• A namespace consists of data registers
▪ Called ZNodes
▪ Similar to files and directories
▪ ZNode holds data, children, or both
• ZNode types
▪ Persistent: lasts until deleted
▪ Ephemeral: lasts for the duration of the session, cannot have children
▪ Sequence: provides unique numbering
ZooKeeper's role in the Hadoop infrastructure
• HBase
▪ Uses ZooKeeper for master election, server lease management, bootstrapping, and
coordination between servers
• Hadoop and MapReduce
▪ Uses ZooKeeper to aid in high availability of Resource Manager
• Flume
▪ Uses ZooKeeper for configuration purposes in recent releases

Slider
• Apache Slider is a YARN application to deploy existing distributed applications on YARN,
monitor them and make them larger or smaller as desired, even while the application is
running.
• Some of the features are:
▪ Allows users to create on-demand applications in a YARN cluster
▪ Allows different users/applications to run different versions of the
application.
▪ Allows users to configure different application instances differently
▪ Stop/restart application instances as needed
▪ Expand/shrink application instances as needed
• The Slider tool is a Java command line application.

Role of Slider in the Hadoop ecosystem (1 of 2)


• Applications can be stopped then started
▪ The distribution of the deployed application across the YARN cluster is persisted
▪ This enables best-effort placement close to the previous locations
▪ Applications which remember the previous placement of data (such as HBase) can exhibit
fast start-up times from this feature.
• YARN itself monitors the health of "YARN containers" hosting parts of the deployed
application
▪ YARN notifies the Slider manager application of container failure
▪ Slider then asks YARN for a new container, into which Slider deploys a replacement for the
failed component, keeping the size of managed applications consistent with the specified
configuration
• The tool persists the information as JSON documents in HDFS.
• Once the cluster has been started:
▪ The cluster can be made to grow or shrink using Slider commands
▪ The cluster can also be stopped and later restarted
• Slider implements all its functionality through YARN APIs and the existing application shell
scripts
• The goal of the application was to have minimal code changes and impact on existing
applications

Knox
• The Apache Knox Gateway is an extensible reverse proxy framework for securely exposing
REST APIs and HTTP based services at a perimeter
• Different types of REST access supported: HTTP(S) client, cURL, Knox Shell (DSL), SSL, …
Big SQL is SQL on Hadoop
• Big SQL builds on Apache Hive foundation
▪ Integrates with the Hive metastore
▪ Instead of MapReduce, uses powerful native C/C++ MPP engine
• View on your data residing in the Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your warehouse data with little or no modifications

What does Big SQL provide?


• Comprehensive, standard SQL
• Optimization and performance
• Support for variety of storage formats
• Integration with RDBMSs

Big SQL provides powerful optimization and performance


• IBM MPP engine (native C++) replaces Java MapReduce layer
• Continuous running daemons (no start up latency)
• Message passing allow data to flow between nodes without persisting intermediate results
• In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed
available RAM)
• Cost-based query optimization with 140+ rewrite rules
Big SQL supports a variety of storage formats
• Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
• Data persisted in:
▪ DFS
▪ Hive
▪ Hbase,
▪ WebHDFS URI* (Tech preview)
• No IBM proprietary format required
Big SQL architecture
• Head (coordinator / management) node
▪ Listens to the JDBC/ODBC connections
▪ Compiles, optimizes, and coordinates execution of the query
• Big SQL worker processes reside on compute nodes (some or all)
• Worker nodes stream data between each other as needed
• Workers can spill large data sets to local disk if needed

Many Db2 technologies you already know exist in Big SQL, including
• "Native Tables" with full transactional support on the Head Node
• Row oriented, traditional Db2 tables
• BLU Columnar, In-memory tables (on Head Node Only)
• Materialized Query Tables
• GET SNAPSHOT / snapshot table functions
• RUNSTATS command (db2) Æ ANALYZE command (Big SQL)
• Row and Column Security
• Federation / Fluid Query
• Views
• SQL PL Stored Procedures & UDFs
• Workload Manager
• System Temporary Table Spaces to support sort overflows
• User Temporary Table Spaces for Declared Global Temporary Tables

Accessing Big SQL


• Java SQL Shell (JSqsh)
• Web tooling using Data
Server Manager (DSM)
• Tools that support IBM
JDBC/ODBC driver

JSqsh (1 of 3)
• Big SQL comes with a CLI pronounced as "jay-skwish" - Java SQL Shell
▪ Open source command client
▪ Query history and query recall
▪ Multiple result set display styles
▪ Multiple active sessions
• Started under /usr/ibmpacks/common-utils/current/jsqsh/bin

Run the JSqsh connection


wizard to supply connection information:
• Connect to the bigsql database:
▪ ./jsqsh bigsql

Creating Big SQL schemas


myhost][bigsql] 1> use "newschema";
[myhost][bigsql] 1> create hadoop table t1 (c1 int);
[myhost][bigsql] 1> insert into t1 values (10);
[myhost][bigsql] 1> select * from t1;

Creating a Big SQL table


create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;

• Meta data collected (Big SQL & Hive)


SYSCAT.* and SYSHADOOP.* views
• HADOOP keyword
▪ Must be specified unless you enable the SYSHADOOP.COMPATIBILITY_MODE
• EXTERNAL keyword
▪ Indicates that the table is not managed by the database manager
▪ When the table is dropped, the definition is removed, the data remains unaffected.
• LOCATION keyword
▪ Specifies the DFS directory to store the data files

CREATE VIEW
create view my_users as
select fname, lname from bigsql.users where id > 100;

Loading data into Big SQL tables


• Populating tables via LOAD
▪ Best runtime performance
• Populating tables via INSERT
▪ INSERT INTO … SELECT FROM
-Parallel read and write operations
▪ INSERT INTO …. VALUES(…)
-NOT parallelized. 1 file per insert. Not recommended, except for quick tests.
• Populate tables using CREATE …TABLE… AS SELECT….
▪ Create a Big SQL table based on contents of other table(s)

Populating Big SQL tables via LOAD


Load data from local or remote file system
• Load data from RDBMS (Db2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop using file url
'ftp://myID:myPassword@myServer.ibm.com:22/installdir/bigsql/samples/data/GOSALESDW
.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
Load data from RDBMS (Db2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);

Data types
• Big SQL uses HCatalog (Hive Metastore) as its underlying data
representation and access method
• SQL type
• Hive type
DATE type
DATE can be stored two ways:
• DATE STORED AS TIMESTAMP
• DATE STORED AS DATE
FILE FORMAT

Here are the Big SQL file formats that will be covered in detail in upcoming
slides:
• Delimited
• Sequence
• Delimited
• Binary
• Parquet
• ORC
• RC
• Avro
SEQUENCE
Unit 4 :

Big SQL inherits the authentication modes of Hortonworks Data


Platform to authenticate users. Big SQL can use LDAP, flat
files, or Kerberos.

Setting up Kerberos for the Big SQL service


Ranger security for Big SQL
• Framework to enable, monitor, manage comprehensive data security across the Hadoop
platform
• Support is available for Big SQL
Can be enabled to control access to Hadoop and HBase tables
• Enable / Disable the Big SQL Ranger plugin
• All access to Big SQL tables automatically audited by Ranger
• Two limitations
Ranger security cannot be used in combination with Impersonation
View based access control is not available at the view level
Enable SSL encryption
• Big SQL supports SSL encryption
• Configure the SSL support in the Db2 instance (Big SQL engine)
• Modify the configuration file to add SSL port
• Restart Big SQL
Big SQL is classified as a long running service in Hadoop. This means that the
service is running whether or not users are connected to Big SQL
Authorization of Big SQL objects
• Level 1 – Controlling access with authorization in the distributed file system
• Level 2 – Authorization with the GRANT command
• Level 3 – Authorization at the row and column level
• Level 4 – Controlling access by using VIEWS or STORED PROCEDURES
Big SQL federation overview
• Users can process SQL queries and statements from multiple data
sources.
As if they were ordinary tables or views
Integrate existing data sources with Big SQL
- Db2 LUW, Oracle, Teradata and Netezza (IBM PureData for Analytics)
• Federation is one of the key feature and differentiators of Big SQL
Federated System
• A special type of distributed database management system (DBMS) that
consists of:
 Instance that operates as a Federated Server
Database (default 'BIGSQL') that acts as the Federated Database

One or more data sources

Clients (users and applications) that access the database and data

sources
• Characteristics
Transparent: Appears to be one source

Extensible: Bring together data source

Autonomous: No interruption to data sources, applications, systems

High Function: Full query support against all data (e.g. scalar

functions, stored
procedures)
High Performance: Optimization of distributed queries

You might also like