Final
Final
Final
table?
Load
You need to monitor and manage data security
across a Hadoop platform. Which tool would you
use?
a. HDFS
b. Hive
c. SSL
d. Apache Ranger
D. PixieDust visualization
Value
B. Indexed databases containing very large volumes of historical data used for
compliance reporting purposes.
C. Non-conventional methods used by businesses and organizations to
capture, manage, process, and make sense of a large volume of data.
D. Structured data stores containing very large data sets such as video and
audio streams.
Select an answer
D. HDFS links the disks on multiple nodes into one large file system.
Select an answer
A. 100,000
B. 1,000
C. 100
A. Scalar
B. User-Defined
C. OLAP
D. Built-in
***
The Spark configuration must be set up first through IBM Cloud
***
You can import preinstalled libraries if you are
using which languages?
python and R
Who can control Watson Studio project assets?
editors
collaborateurs
Editors
***
Which visualization library is developed by IBM as
an add-on to Python notebooks?
PixieDust
***
B. CREATE WRAPPER
C. CREATE NICKNAME
D. CREATE SERVER
A. ORC
B. Sequence
C. Delimited
D. Parquet
enter choices
Centralized security framework to enable, monitor and manage
comprehensive data security across the Hadoop platform
• Manage fine-grained access control over Hadoop data access
components like Apache Hive and Apache HBase
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
• Policies can be set for individual users or groups
Which is the primary advantage of using column-based
data formats over record-based formats?
A. Ambari
B. Scheduler
C. Hadoop Cluster
AVRO
A. using
B. pull
C. import
D. load
Which command would you run to make a remote table
accessible using an alias?
A. CREATE NICKNAME
B. CREATE SERVER
C. SET AUTHORIZATION
D. CREATE WRAPPER
Select an answer
A. Scala
B. PixieDust
C. Spark
D. Watson Studio
Which feature allows application developers to easily
use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities
into their own applications?
A. Ambari
B. HDFS
C. YARN
D. MapReduce
B. video streams
C. banner ads
D. log files
E. cookies
F. javascript
A. Spark SQL
B. Spark Live
D. MLlib
The Big SQL head node has a set of processes
running. What is the name of the service ID running
these processes?
A. user1
B. bigsql
C. hdfs
D. Db2
C. IP address
D. Time of interaction
Question 3
Which component of the HDFS architecture
manages storage attached to the nodes?
Select an answer
A. NameNode
B. MasterNode
C. DataNode
D. StorageNode
A. C++
B. Scala
C. Python
D. C#
E. Java
Which two of the following are column-based data
encoding formats?
A. ORC
B. JSON
C. Parquet
D. Flat
E. Avro
A. GraphX
B. Distribution
C. Actions
D. Transformations
Select an answer
A. Syntactical functions
B. Lambda functions
C. Persistent functions
D. Distributed functions
A. Parquet
B. ORC
C. Sequence
Select an answer
A. Syntactical functions
B. Lambda functions
C. Distributed functions
D. Persistent functions
A. RStudio
B. Jupyter
C. Apache HIVE
D. MapReduce
Can be all rows of a table ▪ Can limit the rows and columns ▪ Can specify your own
query to access relational data
A. Database native indexes.
(select 3 )
A. Machine learning
B. Hacking skills
C. Substantive expertise
E. Traditional research
Select an answer
A. Manage, secure, and govern data stored across all storage environments.
A. JSON
B. CSV
C. XML
D. SQL
Select an answer
A. MapReduce
B. Data mining
C. Data munging
D. YARN
B. bank records
Select an answer
A. NameNode
B. SlaveNode
C. WorkerNode
D. DataNode
Spark service
D. Postgres RDBMS
Select an answer
A. ZOOKEEPER_HOME
B. ZOOKEEPER_DATA
C. ZOOKEEPER_APP
D. ZOOKEEPER
Select an answer
A. Hardware token
B. Preshared keys
C. IP address
D. Kerberos
- “””
Under the HDFS storage model, what is the default
method of replication?
Select an answer
B. Kerberos
What is meant by data at rest?
Collaborators
Tenants
Anyone
Teams
Select an answer
A. OLAP
B. Scalar
C. User-Defined
D. Built-in
Select an answer
A. NameNode
B. SlaveNode
C. WorkerNode
D. DataNode
Which statement is true about the Hadoop Distributed
File System (HDFS)?
HDFS links the disks on multiple nodes into one large file system.
What is one disadvantage to using CSV formatted
data in a Hadoop data store?
Select an answer
A. Data must be extracted, cleansed, and loaded into the data warehouse.
C. Fields must be positioned at a fixed offset from the beginning of the record.
The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. When a file is written to the cluster, blocks of the file are spread out and replicated
across the whole cluster (in the diagram, notice that every block of the file is replicated to
three different machines).
Select an answer
Select an answer
read
scan
list
get
describe
Select an answer
pre-create the column families when creating the table and bulk
loading the data
pre-create the column families when creating the table and use the put
command to load the data
Select an answer
Select an answer
Select an answer
Select an answer
Workbook
Table
Reader
View
Select an answer
A HBase
B MapReduce
C HDFS
D GPFS
Select an answer
code generator
code converter
Select an answer
A MapReduce
B HDFS
C Data Warehouse
D Hadoop
Why does Big SQL perform better than Hive?
Select an answer
It uses sub-queries.
It supports Hcatalog.
Select an answer
A Data Warehouse
B Hive
D Zookeeper
Select an answer
Select an answer
JDBC
BigSQL
Thrift
Select an answer
JDBC
BigSQL
Thrift
SerDe
Select an answer
Select an answer
A TEXTFILE
B SEQUENCEFILE
C DBOUTPUTFILE
D RCFILE
What happens?
Select an answer
Select an answer
A CEP
B Geospatial
C SPSS
D Time Series
Select an answer
A Buckets
B SQL
C Partitions
D Parallelization
Select an answer
Velocity
Variety
Volume
Volatility
Select an answer
Select an answer
A Jaql
B Big SQL
C Avro
D Pig
Select an answer
higher scalability
Select an answer
C extensibility
Select an answer
A 16 MB
B 32 MB
C 64 MB
D 128 MB
Select an answer
A 1 GB
B 2 GB
C 5 GB
D 10 GB
Select an answer
A Index
B Delete
C Create
D Update
E Execute
Select an answer
Select an answer
Select an answer
Select an answer
A Microsoft
B IBM
C Oracle
D Apache
Select an answer
Select an answer
Select an answer
Select an answer
A CodeEnvy
B Eclipse
C Java NetBeans
D VisualAge
Select an answer
EditLog
FsImage
NameSpace
DataNode
Select an answer
A Twitter feeds
B email
C server logs
D voicemail
Select an answer
Select an answer
DataNode
CheckPoint
FSImage
EditLog
Select an answer
A 2
B 3
C 4
D 7
Select an answer
InputSplitter
LineRecordReader
InputFormat
RecordReader
Which class in MapReduce takes each record and
transforms it into a <key, value< pair?
Select an answer
A RecordReader
B InputSplitter
C InputFormat
D LineRecordReader
Select an answer
A Combiner
B Map
C Reduce
D Shuffle
Select an answer
A Job Client
B MapTask
C JobTracker
D TaskTracker
Select an answer
A search engine
B application framework
C meta data
D connector framework
Select an answer
Ruby
Jaql
Python
AQL
Select an answer
Pig
BigSql
Jaql
Hive
When the following HIVE command is executed:
LOAD DATA INPATH '/tmp/department.del'
OVERWRITE
INTO TABLE department;
What happens?
Select an answer
Select an answer
The department.del file is copied from the local file system /tmp
directory to the location corresponding to the Hive table in HDFS.
The department.del file is moved from the local file system /tmp
directory in the local file system to HDFS.
The department.del file is moved from the HDFS /tmp directory to the
location corresponding to the Hive table.
The department.del file is copied from the HDFS /tmp directory to the
location corresponding to the Hive table.
Question: 20
Which built-in Hive storage file format improves
performance by providing semi-columnar data
storage with good compression?
Select an answer
SEQUENCEFILE
TEXTFILE
RCFILE
DBOUTPUTFILE
Select an answer
Select an answer
Select an answer
Security/Intelligence extension
360-degree view of the customer
Operations analysis
Select an answer
Select an answer
A 2
B 3
C 4
D 7
Select an answer
SerDe
Thrift
JDBC
BigSQL
Select an answer
A Index
Delete
C Create
D Update
E Execute
Select an answer
Select an answer
code converter
code generator
Select an answer
Stream Computing
Contextual Discovery
Hadoop System
Data Warehous
Select an answer
RFID
GPS
social networks
cell phones
Select an answer
InfoSphere Streams
HBase
What happens?
Select an answer
The department.del file is moved from the local file system /tmp
directory in the local file system to HDFS.
The department.del file is copied from the local file system /tmp
directory to the location corresponding to the Hive table in HDFS.
The department.del file is moved from the HDFS /tmp directory to the
location corresponding to the Hive table.
The department.del file is copied from the HDFS /tmp directory to the
location corresponding to the Hive table.
Select an answer
Results on the real data are computed and the output is explored.
Select an answer
low latency
random acce
JDBC
SerDe
Thrift
BigSQL
What is the primary benefit of using Hive with
Hadoop?
Select an answer
SMALLINT
A. The permalink
Which file format has the highest performance?
Parquet
B-””
load
CREATE NICKNAME
STRING
A. Scala
User-Defined
Which two of the following data sources are
currently supported by Big SQL?
Teradata
Oracle
---IBM CLOUD
EXTERNAL
Which visualization library is developed by IBM as
an add-on to Python notebooks?
---PixieDust
Apache Ranger
The Big SQL head node has a set of processes running. What is the name of
the service ID running these processes?
bigsql
A. Data service
B. Data assets
C. Collaborators
D. Spark service
Question 20
Which file format contains human-readable data
where the column values are separated by a
comma?
A. Delimited
Which component of the Apache Ambari
architecture stores the cluster configurations?
C. Postgres RDBMS
B. R
C. Python
select_6_answers_request
A. log files
D. cookies
F. browser cache
B. IP address
C. Email address
D. Time of interaction
Select an answer
A. Structured data stores containing very large data sets such as video and
audio streams.
namenode
C. Indexed databases containing very large volumes of historical data used for
compliance reporting purposes.
B. Ambari
B. REST APIs
What are three examples of "Data Exhaust"?
A. banner ads
B. javascript
C. browser cache
D. video streams
E. log files
F. cookies
Select an answer
D. Manage, secure, and govern data stored across all storage environments.
Which two of the following are column-based data
encoding formats?
C. ORC
D. Parquet
SciPy
JobTracker
ZOOKEEPER_HOME
Which statement describes the action performed by
HDFS when data is written to the Hadoop cluster?
D.
Select an answer
A. Distribution
B. Actions
C. GraphX wa9ilaaaa
D. Transformations
Select an answer
Select an answer
Kerberos
What is meant by data at rest?
Select an answer
A. Collaborators -----
B. Anyone
C. Teams
D. Tenants
A. Python----
B. Java
C. C#
D. Scala----
What is Hortonworks DataPlane Services
(DPS) used for?
Select an answer
namenode
Question demo : ( exemple dyal kifach t7eto question , 9dam jawab ktbo correct )
Question: 59
Which IBM Big Data solution provides
c
Your answer
A InfoSphere Streams
B InfoSphere BigInsights
Which type of
partitioning is
supported by Hive?
Select an answer
A value partitioning
B range partitioning
C time partitioning
D interval partitioning
YALLAH BDAW
Select an answer
A BigSheets
B Big SQL
C Hive
D Java
Question: 3
Which command is used for starting all the
BigInsights components?
Select an answer
A start.sh
C start-all.sh
D start.sh biginsights
Select an answer
Select an answer
A Spark
B JobConf Config
C Hive shell
D Cassandra
Which Hadoop query language can manipulate
semi-structured data and support the JSON format?
Jql
test_file
Select an answer
A Reader
B Avro
C Formula
D Diagram
Select an answer
Question: 9
After the following sequence of commands are executed:
C 4
Select an answer
A Platform computing
B Stream computing
C Data Warehouse
Hadoop
Select an answer
C It is stored in memory.
A InfoSphere Streams
B Avro
C AQL
D Hive
Select an answer
A Jet
B SOAP
C RPC
D JDBC
Which of the four characteristics of Big Data deals
with trusting data sources?
Select an answer
A Variety
B Veracity
C Velocity
D Volume
Select an answer
A hive-conf.xml
B hive.conf
C hive-site.xml
D hive-env.config
Select an answer
A InfoSphere BigInsights
B Rational
C Eclipse
D Hadoop
Select an answer
A It utilizes the computing power of IBM i.
desc
A
ribe
B read
C get
D scan
E list
Which file system spans all nodes in a Hadoop
cluster?
Select an answer
A HDFS
B XFS
C NTFS
D EXT4
Which well known Big Data source has become the
most popular source for business analytics?
Select an answer
A GPS
B RFID
C social networks
D cell phones
In 2003, IBM's System S was the first prototype of a
new IBM Stream Computing solution that performed
which type of processing?
Select an answer
A hive-site.xml
B hive-env.config
C hive-conf.xml
D hive.conf
The Enterprise Edition of BigInsights offers which
feature not available in the Standard Edition?
Select an answer
A Eclipse Tooling
B Adaptive MapReduce
C Dashboards
D Big SQL
What do the query languages Hive and Big SQL
have in common?
Select an answer
Select an answer
It executes MapReduce functions more efficiently than the common
A
query languages.
Select an answer
A Filter
B Reader
C Crawler
D Data Qualifier
Select an answer
B It is stored in memory.
Question: 32
What happens if a child task hangs in a MapReduce
job?
Select an answer
Select an answer
A Hive
B HBase
C Pig
Big SQL
D
What is the default data block size in BigInsights?
Select an answer
A 16 MB
B 32 MB
64 MB
C
D 128 MB
What is the process in MapReduce that moves all
data from one key to the same worker node?
Select an answer
A Shuffle
B Split
C Reduce
D Map
In the master/slave architecture, what is considered
the slave?
DATANODE
Select an answer
A receives no heartbeat
Submit Test
Question: 43
Which capability of Jaql gives it a significant
advantage over other query languages?
Select an answer
la reponse ???
Select an answerWhat
is the default number of replicas in
HDFS replication?
Select an answer
A 2
B 3
C 4
D 5
B Web Console
D Application Wizard
Select an answer
Select an answer
A Security/Intelligence Extension
Which file formats can Jaql read? (jawab lah
y7fedkom)
Select an answer
B HTML, Avro
Select an answer
Select an answer
A Cloud SQL
B HBase
C MySQL
D PigeonRank
Where does HDFS store the file system namespace
and properties?
Select an answer
A DataNode
B hdfs.conf
C Hive
D FsImage
Select an answer
C POSIX compliance
D single point of failure
Select an answer
A NameNode
B CheckPoint
C Blockpool
D DataNode
Select an answer
A 1 block in 1 rack, 2 blocks in a second rack
Select an answer
A BigSheets
B MapReduce
C Text Analytics
D BigR
Select an answer
A BigSql
B Pig
C Jaql
D Hive
Select an answer
Rattrapage
Which database is a columnar storage database?
Hbase
Hive
Zookeeper
What is ZooKeeper's role in the Hadoop infrastructure?
Manage the coordination between HBase servers
Hadoop and MapReduce uses ZooKeeper to aid in high availability of Resource Manager
Flume uses ZooKeeper for configuration purposes in recent releases
Through what HDP component are Kerberos, Knox, and Ranger managed?
Ambari
Which security component is used to provide peripheral security?
Apache Knox
What are the components of Hortonworks Data Flow(HDF)?
Flow management
Stream processing
Enterprise services
What main features does IBM Streams provide as a Streaming Data Platform?
(Please select the THREE that apply)
Analysis and visualization
Rich data connections
Development support
What are the 4Vs of Big Data?(Please select the FOUR that apply)
Veracity
Velocity
Variety
Volume
What are the three types of Big Data?(Please select the THREE that apply)
Semi-structured
Structured
Unstructured
Select all the components of HDP which provides data access capabilities
Pig
MapReduce
Hive
Select the components that provides the capability to move data from relational database into
Hadoop.
sqoop
kafka
flume
Managing Hadoop clusters can be accomplished using which component?
Ambari
Which Hadoop functionalities does Ambari provide?
Manage
Provision
Integrate
Monitor
Which page from the Ambari UI allows you to check the versions of the software installed on your
cluster?
The Admin > Manage Ambari page
What is the default number of replicas in a Hadoop system?
3
The Job Tracker in MR1 is replaced by which component(s) in YARN?
ResourceManager
ApplicationMaster
What are the benefits of using Spark?(Please select the THREE that apply)
Generality
Ease of use
Speed
Which database is a columnar storage database?
Hbase
Which database provides a SQL for Hadoop interface?
Hive
What are the languages supported by Spark?(Please select the THREE that apply)
Python
Java
Scala
Which Apache project provides coordination of resources?
Zookeeper
What would you need to do in a Spark application that you would not need to do in a Spark shell to
start using Spark?
Import the necessary libraries to load the SparkContext
What are the most important computer languages for Data Analytics?
(Please select the THREE that apply)
scala
R
Python
What are the two ways you can work with Big SQL.
(Please select the TWO that apply)
JSqsh
Web tooling from DSM
What is one of the reasons to use Big SQL?
Want to access your Hadoop data without using MapReduce
Which file storage format has the highest performance?
Parquet
What are the two ways to classify functions?
Built-in functions
User-defined functions
Which data type is BOOLEAN defined as in a Big SQL database?
SMALLINT
Which Big SQL authentication mode is designed to provide strong authentication for client/server
applications by using secret-key cryptography?
( je pense Kerberos) oui c’est ca
You need to define a server to act as the medium between an application and a data source in a Big
SQL federation. Which command would you use?
CREATE SERVER
When sharing a notebook, what will always point to the most recent version of the notebook?
A. The permalink
B. Data assets__xkit_fhadi_tahiya
C. Collaborators :: wa9ilaa hada
Under the MapReduce v1 architecture, which function is performed by the JobTracker?
Accepts MapReduce jobs submitted by clients.
What must surround LaTeX code so that it appears on its own line in a Juptyer notebook?
$$
Which file format contains human-readable data where the column values are separated by a
comma?
A. Delimited
Which component of the Apache Ambari architecture stores the cluster configurations?
C. Postgres RDBMS
Which type of foundation does Big SQL build on?
Apache HIVE
Which Spark RDD operation returns values after performing the evaluations?
actions
You can import preinstalled libraries if you are using which languages?
B. R
C. Python
Which statement is true about Spark's Resilient Distributed Dataset (RDD)?
It is a distributed collection of elements that are parallelized across the cluster.
In a Hadoop cluster, which two are the result of adding more nodes to the cluster?
C. It increases available processing power.
E. It adds capacity to the file system.
Under the MapReduce v1 architecture, which function is performed by the JobTracker?
▪ Accepts MapReduce jobs submitted by clients
What OS command starts the ZooKeeper command-line interface?
zkCli.sh
Which two are examples of personally identifiable information (PII)?
Select the FOUR answers that apply
A. Medical record number
B. IP address
C. Email address
D. Time of interaction
Which component of the Spark Unified Stack supports learning algorithms such as, logistic
regression, naive Bayes classification, and SVM?
B. MLlib
Which statement describes "Big Data" as it is used in the modern business world?
Select an answer
A. Structured data stores containing very large data sets such as video and audio streams.
B. The summarization of large indexed data stores to provide information about potential problems or
opportunities.
Which component of the HDFS architecture manages the file system namespace and metadata?
namenode
Which Hortonworks Data Platform (HDP) component provides a common web user interface for
applications running on a Hadoop cluster?
B. Ambari
Which feature allows application developers to easily use the Ambari interface to integrate Hadoop
provisioning, management, and monitoring capabilities into their own applications?
B. REST APIs
What are three examples of "Data Exhaust"?
A. log files
D. cookies
F. browser cache
Which is the primary advantage of using column-based data formats over record-based formats?
A. faster query execution
What is Hortonworks DataPlane Services (DPS) used for?
Select an answer
D. Manage, secure, and govern data stored across all storage environments.
Which two of the following are column-based data encoding formats?
C. ORC
D. Parquet
What Python package has support for linear algebra, optimization, mathematical integration, and
statistics?
SciPy
When sharing a notebook, what will always point to the most recent version of the notebook?
The permalink
Under the MapReduce v1 architecture, which element of MapReduce controls job execution on
multiple slaves?
JobTracker
Which environmental variable needs to be set to properly start ZooKeeper?
ZOOKEEPER_HOME
Which statement describes the action performed by HDFS when data is written to the Hadoop
cluster?
D. The data is spread out and replicated across the cluster.
Which Spark RDD operation creates a directed acyclic graph through lazy evaluations?
Actions
Transformations
Which feature allows the bigsql user to securely access data in Hadoop on behalf of another user?
Impersonation
Which component of the Spark Unified Stack provides processing of data arriving at the system in
real-time?
C. Spark Streaming
How does MapReduce use ZooKeeper?
Aid in the high availability of Resource Manager.
Which statement describes the purpose of Ambari?
B. It is used for provisioning, managing, and monitoring Hadoop clusters.
In 2003, IBM's System S was the first prototype of a new IBM Stream Computing solution that
performed which type of processing?
Real Time Analytic Processing
The Enterprise Edition of BigInsights offers which feature not available in the Standard Edition?
Adaptive MapReduce
What do the query languages Hive and Big SQL have in common?
Both require schema.
What is a primary reason that business users want to use BigSheets?
It includes a very easy-to-use command-line interface.
Which two Hadoop query languages do not require data to have a schema? (Choose two.)
pig
jaql
Which BigSheets component applies a schema to the underlying data at runtime?
Reader
Which statement is true about storage of the output of REDUCE task?
It is stored in HDFS, but only one copy on the local machine.
What happens if a child task hangs in a MapReduce job?
JobTracker reschedules the task on another machine.
What is IBM's SQL interface to InfoSphere BigInsights?
Big SQL
What is the default data block size in BigInsights?
128 MB
What is the process in MapReduce that moves all data from one key to the same worker node?
Map
In the master/slave architecture,what is considered the slave?
DATANODE
Under the MapReduce architecture, how does a JobTracker detect the failure of a TaskTracker?
receives no heartbeat
Which capability of Jaql gives it a significant advantage over other query languages?
It can handle deeply nested, semi-structured data.
What is the default number of replicas in HDFS replication?
3
Which AQL statement can be used to create components of a Text Analytics extractor? create view
<view_name< as <select or extract statement<;
Which BigInsights tool is used to access the BigInsights Applications Catalog?
Web Console
In Hadoop's rack-aware replica placement, what is the correct default block node placement?
1 block in 1 rack, 2 blocks in a second rack
Which file formats can Jaql read?
JSON, Avro, Delimited
Which statement is true about the data model used by HBase?
The table schema only defines column families.
What is the open-source implementation of BigTable, Google's extremely scalable storage system?
HBase
Where does HDFS store the file system namespace and properties?
FsImage
What does GPFS offer that HDFS does not?
split on one local disk
High availability was added to which part of the HDFS file system in HDFS 2.0 to prevent loss of
metadata?
NameNode
Which administrative console feature of BigInsights is a visualization and analysis tool designed to
work with structured and unstructured data?
BigSheets
Which Hadoop query language was developed by Yahoo to handle almost any type of data?
PIG
Which two file actions can HDFS complete? (Choose two.)
get
list
Which BigSheets component presents a spreadsheetlike representation of data
Workbook
which component of IBM Watson forms the foundation of the framework and allows Waston to extract
and index data from any source
search engine
Which development environment can be used to develop programs for Text Analytics?
Eclipce
Your company wants to utilize text analytics to gain an understanding of general public opinion
concerning company products. which type of input source would you analyze for that information
Twitter feeds
which BigInsights feature helps to extract information from text data
Text Analytics Engine
which class of software can store and manage only structerd data
HDFS
what is the process in mapreduce that move all data from one key to the same worker node
Shuffle
Which two pre-requisites must be fulfilled when running a Java MapReduce program on the Cluster,
using Eclipse? (Choose two.) (Please select ALL that apply)
C Hadoop services must be running.
D BigInsights services must be running.
Which two commands are used to retrieve data from an HBase table? (Choose two.) (Please select
ALL that apply)
scan
get
You have been asked to create an HBase table and populate it with all the sales transactions,
generated in the company in the last quarter. Currently, these transactions reside in a 300 MB tab
delimited file in HDFS. What is the most efficient way for you to accomplish this task?
pre-create regions by specifying splits in create table command and bulk loading the data
What are two main components of a Java MapReduce job? (Choose two.) (Please select ALL that
apply)
A Mapper class which should extend org.apache.hadoop.mapreduce.Mapper class
D Reducer class which should extend org.apache.hadoop.mapreduce.Reducer class
Which element(s) must be specified when creating an HBase table?
only the table name and column family(s)
Hadoop is the primary software tool used for which class of computing?
C Big Data Analytics
Which BigSheets component presents a spreadsheet-like representation of data?
Reader
Which two technologies form the foundation of Hadoop? (Choose two.) (Please select ALL that apply)
B MapReduce
C HDFS
Which tool is included as part of the IBM BigInsights Eclipse development environment?
code generator
Which class of software can store and manage only structured data?
C Data Warehouse
Why does Big SQL perform better than Hive?
It uses sub-queries.
Which BigInsights feature helps to extract information from text data?
C Text Analytics Engine
Which type of cell can be used to document and comment on a process in a Jupyter
notebook?
B.Markdown
Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or
other databases?
D.sqoop
Under the MapReduce v1 programming model, which shows the proper order of the full set
of MapReduce phases?
Map -> Combine -> Shuffle -> Reduce
How can a Sqoop invocation be constrained to only run one mapper?
A.Use the -m 1 parameter.
Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data
on semi-structured data in a Hadoop datastore?
D.Hive
Which component of the Apache Ambari architecture integrates with an organization's LDAP
or Active Directory service?
D.Authorization Provider
Under the YARN/MRv2 framework, the Scheduler and ApplicationsManager are components
of which daemon?
B.ResourceManager
Which component of the Apache Ambari architecture provides statistical data to the
dashboard about the performance of a Hadoop cluster?
B.mbari Metrics System
Apache Spark can run on which two of the following cluster managers?
B.Apache Mesos
C.Hadoop YARN
What are two ways the command-line parameters for a Sqoop invocation can be simplified?
C.Place the commands in a file.
D.Include the --options-file command line argument.
Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop
datastore and is particularly good for "sparse data"?
C.HBASE
Which feature makes Apache Spark much easier to use than MapReduce?
B.Libraries that support SQL queries.
Which Spark Core function provides the main element of Spark API?
A.RDD
Which tool would you use to create a connection to your Big SQL database?
A.DSM
Which Big SQL feature allows users to join a Hadoop data set to data in external databases?
C.Fluid query
When connecting to an external database in a federation, you need to use the correct
database driver and protocol. What is this federation component called in Big SQL?
D.Wrapper
You need to determine the permission setting for a new schema directory. Which tool would
you use?
A.umask
Using the Java SQL Shell, which command will connect to a database called mybigdata?
D./jsqsh mybigdata
You need to enable impersonation. Which two properties in the bigsql-conf.xml file need to
be marked true?
C.bigsql.alltables.io.doAs
E.bigsql.impersonation.create.table.grant.public
You are creating a new table and need to format it with parquet. Which partial SQL statement
would create the table in parquet format?
A.STORED AS parquetfile
You have a distributed file system (DFS) and need to set permissions on the the
/hive/warehouse directory to allow access to ONLY the bigsql user. Which command would
you run?
D.hdfs dfs -chmod 700 /hive/warehouse
Which two commands would you use to give or remove certain privileges to/from a user?
A.GRANT
E.REVOKE
What does the user interface for Jupyter look like to a user?
C.App in web browser
Why might a data scientist need a particular kind of GPU (graphics processing unit)?
A.To perform certain data transformation quickly.
Which data encoding format supports exact storage of all data in binary representations such
as VARBINARY columns?
C.SequenceFiles
Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among
all the applications in the system?
C.ResourceManager
Which Apache Hadoop application provides a high-level programming language for data
transformation on unstructured data?
B.Pig
What are three IBM value-add components to the Hortonworks Data Platform (HDP)?
A.Big Match
C.Big Replicate
D.Big SQL
Which statement is true about the Combiner phase of the MapReduce architecture?
B.It reduces the amount of data that is sent to the Reducer task nodes.
Which component of the Spark Unified Stack allows developers to intermix structured
database queries with Spark's programming language?
D.Spark SQL
Which
NoSQL datastore type began as an implementation of Google's BigTable that can store any
type of data and scale to many petabytes?
A.HBase
Which statement accurately describes how ZooKeeper works?
B.All servers keep a copy of the shared data in memory.
Apache Spark provides a single, unifying platform for which three of the following types of
operations?
B.batch processing
C.machine learning
E.graph operations
Under the YARN/MRv2 framework, the JobTracker functions are split into which two
daemons?
A.ApplicationMaster
E.ResourceManager
Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data
on semi-structured data in a Hadoop datastore?
A.Hive
Which directory permissions need to be set to allow all users to create their own schema?
D.777
Before you create a Jupyter notebook in Watson Studio, which two items are necessary?
C.Project
D.Spark Instance
You need to add a collaborator to your project. What do you need?
D.The email address of the collaborator
You need to add a collaborator to your project. What do you need?
The email address of the collaborator
Which primary computing bottleneck of modern computers is addressed by Hadoop?
disk latency
Which capability does IBM BigInsights add to enrich Hadoop?
Adaptive MapReduce
What is one of the two technologies that Hadoop uses as its foundation?
MapReduce
Which Hadoop-related project provides common utilities and libraries that support other
Hadoop sub projects?
Hadoop Common
What is one of the four characteristics of Big Data?
Volume
Which description identifies the real value of Big Data and Analytics?
gaining new insight through the capabilities of the world's interconnected intelligence
Which Big Data function improves the decision-making capabilities of organizations by
enabling the organizations to interpret and evaluate structured and unstructured data in
search of valuable business information?
analytics
Which type of Big Data analysis involves the processing of extremely large volumes of
constantly moving data that is impractical to store?
Stream Computing
Which statement is true about Hadoop Distributed File System (HDFS)?
Data is accessed through MapReduce
What is one function of the JobTracker in MapReduce?
keeps the work physically close to the data
What is a characteristic of IBM GPFS that distinguishes it from other distributed file systems?
posix compliance
In which step of a MapReduce job is the output stored on the local disk?
Map
To run a MapReduce job on the BigInsights cluster, which statement about the input file(s)
must be true?
The file(s) must be stored in HDFS or GPFS
Which command helps you create a directory called mydata on HDFS?
hadoop fs -mkdir mydata
Following the most common HDFS replica placement policy, when the replication factor is
three, how many replicas will be located on the local rack?
one
When running a MapReduce job from Eclipse, which BigInsights execution models are
available? (Select two.)
Cluster
Local
How are Pig and Jaql query languages similar?
Both are data flow languages.
Which command displays the sizes of files and directories contained in the given directory,
or the length of a file, in case it is just a file?
hadoop fs -du
Which statement is true regarding the number of mappers and reducers configured in a
cluster?
The number of mappers and reducers can be configured by modifying the mapred-site.xml
file.
What key feature does HDFS 2.0 provide that HDFS does not?
high availability of the NameNode
if you need to change the replication factor or increase the default storage block size, which
file do you need to modify?
hdfs-site.xml
What is one of the two driving principles of MapReduce?
spread data across a cluster of computers
Which statement represents a difference between Pig and Hive?
Pig uses Load, Transform, and Store.
In the MapReduce processing model, what is the main function performed by the
JobTracker?
coordinates the job execution
Under the HDFS architecture, what is one purpose of the NameNode?
to regulate client access to files
In addition to the high-level language Pig Latin, what is a primary component of the Apache
Pig platform?
runtime environment
What are two of the core operators that can be used in a Jaql query? (Select two.)
JOIN
TOP
Under the MapReduce programming model, which task is performed by the Reduce step?
Data is aggregated by worker nodes.
Which command should be used to list the contents of the root directory in HDFS?
hadoop fs -Is /
Which element of the MapReduce architecture runs map and reduce jobs?
TaskTracker
Which type of language is Pig?
data flow
Which Hive command is used to query a table?
SELECT
Which technology does Big SQL utilize for access to shared catalogs?
Hive metastore
In Hive, what is the difference between an external table and a Hive managed table?
An external table refers an existing location outside the warehouse directory.
Which is a use-case for Text Analytics?
sentiment analytics from social media blogs
Which utility provides a command-line interface for Hive?
Hive shell
What is an accurate description of HBase?
It is an open source implementation of Google's BigTable.
What drives the demand for Text Analytics?
Most of the world's data is in unstructured or semi-structured text.
Which statement about NoSQL is true?
It is a database technology that does not use the traditional relational model.
Which statement will make an AQL view have content displayed?
output view <view_name>
Which command can be used in Hive to list the tables available in a database/schema?
show tables
Which tool is used for developing a BigInsights Text Analytics extractor?
Eclipse with BigInsights tools for Eclipse plugin
You work for a hosting company that has data centers spread across North America. You
are trying to resolve a critical performance problem in which a large number of web servers
are performing far below expectations. You know that the information written to log files can
help determine the cause of the problem, but there is too much data to manage easily. Which
type of Big Data analysis is appropriate for this use case?
Text Analytics
What makes SQL access to Hadoop data difficult?
Data is in many formats.
Why develop SQL-based query languages that can access Hadoop data sets?
because the MapReduce Java API is sometimes difficult to use
What is the "scan" command used for in HBase?
to view data in an Hbase table
Which tool is used to access BigSheets?
Web Browser
In HBase, what is the "count" command used for?
to count the number of rows in a table
Which Hadoop-related technology provides a user-friendly interface, which enables business
users to easily analyze Big Data?
BigSheets
What is the most efficient way to load 700MB of data when you create a new HBase table?
Pre-create regions by specifying splits in create table command and bulk loading the data.
Which key benefit does NoSQL provide?
It can cost-effectively manage data sets too large for traditional RDBMS.
Which Hadoop-related technology supports analysis of large datasets stored in HDFS using
an SQL-like query language?
Hive
If you need to JOIN data from two workbooks, which operation should be performed
beforehand?
"Load" to create a new sheet with the other workbook data in the current workbook
The following sequence of commands is executed: create
'table_1','column_family1','column_family2' put 'table_1','row1','column_family1:c11','r1v11'
put 'table_1','row2','column_family1:c12','r1v12' put
'table_1','row2','column_family2:c21','r1v21' put 'table_1','row3','column_family1:d11','r1v11'
put 'table_1','row2','column_family1:d12','r1v12' put
'table_1','row2','column_family2:d21','r1v21' In HBase, which value will the "count 'table_1'"
command return?
3
Which IBM Big Data solution provides low-latency analytics for processing data-in-motion?
InfoSphere Streams
What is one of the main components of Watson Explorer (InfoSphere Data Explorer)?
crawler
How can the applications published to BigInsights Web Console be made available for users
to execute?
They need to be deployed with proper privileges.
Which IBM tool enables BigInsights users to develop, test and publish BigInsights
applications?
Eclipse
IBM InfoSphere Streams is designed to accomplish which Big Data function?
analyze and react to data in motion before it is stored
Which component of Apache Hadoop is used for scheduling and running workflow jobs?
Oozie
Under the YARN/MRv2 framework, the Scheduler and Applications Manager are
components of which daemon?
ResourceManager
What two security functions does Apache Knox provide?
Proxying services.
API and perimeter security
What are the breakpoints which can be set for a Java MapReduce program in Eclipse?
stop points to suspend the program execution for debugging purposes
What is NOT a similarity of Pig, Hive and Jaql?
Designed for random reads/writes or low latency queries
What is NOT true about Hive?
Designed for low latency queries, like RDBMS such as DB2 and Netezza
What is true regarding Hive External Tables?
data is stored outside the Hive warehouse directory
What Hive command list all the base tables and views in a database?
show tables
Which is not an ACID property?
Concurrency
You need to add a collaborator to your project. What do you need?
The email address of the collaborator
Where does the unstructured data of a project reside in Watson Studio?
Object Storage
Which Watson Studio offering used to be available through something known as IBM
Bluemix?
Watson Studio Cloud
What is the architecture of Watson Studio centered on?
Projects
Which type of cell can be used to document and comment on a process in a Jupyter
notebook?
Markdown
Which BigInsights tool is used to access the BigInsights Applications Catalog?
Web Console
Which Big Data function improves the decisionmaking capabilities of organizations by
enabling the organizations to interpret and evaluate structured and unstructured data in
search of valuable business information?
analytics
Which Hadoop-related technology provides a userfriendly interface, which enables business
users to easily analyze Big Data?
BigSheets
Which IBM Big Data solution provides low-latency analytics for processing data-inmotion?
InfoSphere Streams
- What is the correct sequence of the three m ain steps in Text Analytics?
Select an answer
A. index text: categorize subjects; parse the data
B. structure text; derive patterns; interpret output
A. EditLog
B. CheckPoint
C. FSImage
D. Dallallode
- Which statement is true about where the output of MAP task is stored?
Select an answer
A. Eclipse Plug-in
B. Eclips e Console
C. Web Console
D. Application Wizard
- In Hadoop's rack-aware replica placement, what is the correct default block node placement?
- IBM Text Analyt ics is emb ed d ed into the key compo n en ts of which IBM solutio n?
Select an answer
A. Rational
B. Eclipse
C. Hadoop
D. InfoSphere Biglnsights
- Which of the four characteristics of Big Data indicates that many data formats can be stored and
analyzed in Hadoop?
A. Velocity
B. Volume
C. Volatility
D. Variety
- Which Advanced Analytics toolkit in InfoSphere Streams is used for developing and building
predictive models?
A. Time Series
B. CEP
C. Geospatial
D. SPSS
What is one of the four characteristics of Big Data?
Your answer
A volatility
B volume
C verifiability
D value
Question: 2
Which description identifies the real value of Big
Data and Analytics?
Your answer
A providing solutions to help customers manage and grow large database systems
B gaining new insight through the capabilities of the world's interconnected intelligen
D using modern technology to efficiently store the massive amounts of data generat
Question: 3
What is one of the two technologies that Hadoop
uses as its foundation?
Your answer
A HBase
Your answer
B Jaql
C Apache
D MapReduce
Question: 4
Which capability does IBM BigInsights add to enrich
Hadoop?
Your answer
B Adaptive MapReduce
C Jaql
Question: 5
Which type of Big Data analysis involves the
processing of extremely large volumes of constantly
moving data that is impractical to store?
Your answer
A MapReduce
B Stream Computing
C Text Analysis
Your answer
Question: 6
Which Hadoop-related project provides common
utilities and libraries that support other Hadoop sub
projects?
Your answer
A BigTable
B Hadoop HBase
C MapReduce
D Hadoop Common
Question: 7
Which primary computing bottleneck of modern
computers is addressed by Hadoop?
Your answer
B 64-bit architecture
C MIPS
D disk latency
Question: 8
Which Big Data function improves the decision-
making capabilities of organizations by enabling the
organizations to interpret and evaluate structured
and unstructured data in search of valuable
business information?
Your answer
A data warehousing
C stream computing
D analytics
Question: 9
Under the HDFS architecture, what is one purpose of
the NameNode?
Your answer
Question: 10
In addition to the high-level language Pig Latin, what
is a primary component of the Apache Pig platform?
Your answer
Your answer
D runtime environment
Question: 11
Which statement represents a difference between
Pig and Hive?
Your answer
Question: 12
What are two of the core operators that can be used
in a Jaql query? (Select two.)
Your answer
A LOAD
B TOP
C JOIN
D SELECT
Question: 13
In the MapReduce processing model, what is the
main function performed by the JobTracker?
Your answer
Question: 14
Under the MapReduce programming model, which
task is performed by the Reduce step?
Your answer
Question: 15
Following the most common HDFS replica
placement policy, when the replication factor is
three, how many replicas will be located on the local
rack?
Your answer
A two
Your answer
B three
C none
D One
Question: 16
Which command helps you create a directory called
mydata on HDFS?
Your answer
C mkdir mydata
Question: 17
What is one function of the JobTracker in
MapReduce?
Your answer
D manages storage
Question: 18
Which element of the MapReduce architecture runs
map and reduce jobs?
Your answer
A TaskTracker
B JobScheduler
C Reducer
D JobTracker
Question: 19
What is a characteristic of IBM GPFS that
distinguishes it from other distributed file systems?
Your answer
C posix compliance
Question: 20
Which type of language is Pig?
Your answer
A object oriented
B SQL-like
C data flow
D compiled language
Question: 21
Which statement is true regarding the number of
mappers and reducers configured in a cluster?
Your answer
B The number of mappers and reducers can be configured by modifying the mapred-site.xml file.
Your answer
Question: 22
Which statement is true about Hadoop Distributed
File System (HDFS)?
Your answer
Question: 23
What is one of the two driving principles of
MapReduce?
Your answer
Question: 24
How are Pig and Jaql query languages similar?
Your answer
Question: 25
When running a MapReduce job from Eclipse, which
BigInsights execution models are available? (Select
two.)
Your answer
A Remote
B Cluster
Your answer
C Local
D Distributed
E Debugging
Question: 26
If you need to change the replication factor or
increase the default storage block size, which file do
you need to modify?
Your answer
A hadoop.conf
B hdfs-site.xml
C hadoop-configuration.xml
D hdfs.conf
Question: 27
Which command displays the sizes of files and
directories contained in the given directory, or the
length of a file, in case it is just a file?
Your answer
A hdfs -du
Your answer
B hadoop fs -du
C hdfs fs size
D hadoop size
Question: 28
In which step of a MapReduce job is the output
stored on the local disk?
Your answer
A Combine
B Reduce
C Shuffle
D Map
Question: 29
Which command should be used to list the contents
of the root directory in HDFS?
Your answer
A hadoop fs -Is /
Your answer
B hadoop fs list
C hdfs root
D hdfs list /
Question: 30
What key feature does HDFS 2.0 provide that HDFS
does not?
Your answer
Question: 31
To run a MapReduce job on the BigInsights cluster,
which statement about the input file(s) must be true?
Your answer
A The file(s) must be stored on the local file system where the map reduce job was developed.
Your answer
C No matter where the input files are before, they will be automatically copied to where the job ru
Question: 32
Which is a use-case for Text Analytics?
Your answer
Question: 33
Which command can be used in Hive to list the
tables available in a database/schema?
Your answer
A show tables
B list tables
Your answer
C describe tables
D show all
Question: 34
Which Hive command is used to query a table?
Your answer
A GET
B TRANSFORM
C EXPAND
D SELECT
Question: 35
Why develop SQL-based query languages that can
access Hadoop data sets?
Your answer
C because data stored in a Hadoop cluster lends itself to structured SQL queries
Your answer
Question: 36
What drives the demand for Text Analytics?
Your answer
B Text Analytics is the most common way to derive value from Big Data.
Question: 37
Which tool is used to access BigSheets?
Your answer
A Web Browser
B BigSheets client
C Microsoft Excel
D Eclipse
Question: 38
What is an accurate description of HBase?
Your answer
Question: 39
In HBase, what is the "count" command used for?
Your answer
Question: 40
Which key benefit does NoSQL provide?
Your answer
B It can cost-effectively manage data sets too large for traditional RDBMS.
Question: 41
Which utility provides a command-line interface for
Hive?
Your answer
B Thrift client
D Hive shell
Question: 42
Which Hadoop-related technology supports analysis
of large datasets stored in HDFS using an SQL-like
query language?
Your answer
A Jaql
Your answer
B HBase
C Pig
D Hive
Question: 43
Which statement will make an AQL view have
content displayed?
Your answer
Question: 44
What makes SQL access to Hadoop data difficult?
Your answer
Question: 45
The following sequence of commands is executed:
create 'table_1','column_family1','column_family2'
put 'table_1','row1','column_family1:c11','r1v11'
put 'table_1','row2','column_family1:c12','r1v12'
put 'table_1','row2','column_family2:c21','r1v21'
put 'table_1','row3','column_family1:d11','r1v11'
put 'table_1','row2','column_family1:d12','r1v12'
put 'table_1','row2','column_family2:d21','r1v21'
In HBase, which value will the "count 'table_1'"
command return?
Your answer
A 6
B 4
C 2
D 3
Question: 46
In Hive, what is the difference between an external
table and a Hive managed table?
Your answer
A An external table refers to the data stored on the local file system.
Question: 47
If you need to JOIN data from two workbooks, which
operation should be performed beforehand?
Your answer
B "Copy" to create a new sheet with the other workbook data in the current workbook
D "Load" to create a new sheet with the other workbook data in the current workbook
Question: 48
Which statement about NoSQL is true?
Your answer
C It is a database technology that does not use the traditional relational model.
D It provides all the capabilities of an RDBMS plus the ability to manage Big Data.
Question: 49
Which technology does Big SQL utilize for access to
shared catalogs?
Your answer
A RDBMS
B Hive metastore
C HCatalog
D MapReduce
Question: 50
Which Hadoop-related technology provides a user-
friendly interface, which enables business users to
easily analyze Big Data?
Your answer
A Avro
Your answer
B HBase
C BigSQL
D BigSheets
Question: 51
You work for a hosting company that has data
centers spread across North America. You are trying
to resolve a critical performance problem in which a
large number of web servers are performing far
below expectations. You know that the information
written to log files can help determine the cause of
the problem, but there is too much data to manage
easily. Which type of Big Data analysis is
appropriate for this use case?
Your answer
A Data Warehousing
B Stream Computing
C Temporal Analysis
D Text Analytics
Question: 52
What is the "scan" command used for in HBase?
Your answer
Question: 53
Which tool is used for developing a BigInsights Text
Analytics extractor?
Your answer
D AQLBuilder
Question: 54
What is the most efficient way to load 700MB of data
when you create a new HBase table?
Your answer
Your answer
A Pre-create regions by specifying splits in create table command and use the insert command to lo
B Pre-create the column families when creating the table and bulk loading the data.
C Pre-create regions by specifying splits in create table command and bulk loading the data.
D Pre-create the column families when creating the table and use the put command to load the dat
Question: 55
Which IBM tool enables BigInsights users to
develop, test and publish BigInsights applications?
Your answer
A Eclipse
B Avro
D HBase
Question: 56
IBM InfoSphere Streams is designed to accomplish
which Big Data function?
Your answer
Your answer
Question: 57
How can the applications published to BigInsights
Web Console be made available for users to
execute?
Your answer
Question: 58
Which component of Apache Hadoop is used for
scheduling and running workflow jobs?
Your answer
Your answer
A Task Launcher
B Jaql
C Eclipse
D Oozie
Question: 59
Which IBM Big Data solution provides low-latency
analytics for processing data-in-motion?
Your answer
A InfoSphere Streams
B InfoSphere BigInsights
Question: 60
What is one of the main components of Watson
Explorer (InfoSphere Data Explorer)?
Your answer
Your answer
A replicater
B crawler
C compressor
D validater
You have completed the test for Unit 1. Big Data Overview atu47599).
YOU scored 100%. Your score has been recorded.
Q2 Which of the foLlowing statements is not true about the 5 key Big Data Use Cases?
Your Answer: Enhanced 360P View of the customer extends existing customer views by incorporating internal data sources only.
Correct Answer: Enhanced 3150a View of the custom er extends existing customer views by incorporating internal data sources only.
Q3 Which of the foLlowing Big Data PLatform components can cost effectively analyze petabytes of structured and unstructured data at rest?
Your Answer: Hadoop System
Correct Answer: Hadoop System
You have completed the test for Unit 2. Hadoop big data analysis tool (Itu47600).
You scored 66%. Your score has been recorded,
Ql. What is NOT true about Hadoop Distributed File System (HDFS)?
Your Answer: Designed for random access not streaming reads
Correct Answer: Designed for random access not streaming reads
v.-- Q2. True or False: In HDFS, the data blocks are replicated to multiple nodes
Your Answer: True
Correct Answer: True
Ql. HDFS command for reporting basic filesysiem information and statistics
Your Answer: hadoop dfsadmin -report
Correct Answer: hadoop dfsadmin -report
v Q2. Select two methods for browsing files stored to HDFS in BigInsights
Your Answer: Command-Line. approach using the format.: hadoop fs , BigInsights Web Console s Files Tab
Correct Answer: Command-Line approach using the format: hadoop Is , Biglnsights Web Console s FiLes Tab
X Q3. Hadoop command for copying files from local file system to HDFS:
Your Answer: hadoop fs -cp
Correct Answer: hadoop fs -put
Continue
You have completed the test for Unit 4. MapReduce (ltu47602).
You scored 100%. Your score has been recorded.
v Q2 True or False: Map Tasks need Key and Value pairs as input.
Your Answer: True
Correct Answer: True
v Q3 Which Java class is responsible for taking HDFS fiLe and transforming it into splits.
Your .Answer: InputSplitter
Correct A.nswer: InputSplitter
N O °
Q4. Regarding Task Failure; if a child task falls, where does the chiLd JVM reports before it exits?
Your Answer: TaskTracker
Correct Answer: Tas kTracker
You have completed the test for Exercise 2. Mab Reduce Lab (Itu47557).
YOU SC• red 66%.. Your scare has been recorded,
N? 01. What are the main components of a Java MapReduce program (SeLect 3)
Your Answer: Map class, Reduce class, Driver or main class
Correct Answer: Map class, Reduce class, Driver or main class
Q2. What are two execution modes of a Java MapReduce program developed in Eclipse
Your Answer: Cluster, Local
Correct Answer: Cluster, Local
Q. What are the breakpoints which can be set for a Java MapReduce program in Eclipse
Your Answer: stop points for executing only a specified section of the program
Correct Answer: stop points to suspend the program execution for debugging purposes
You have completed the test for Unit 5. Fladoop Query Languages (ltu47603).
YOU scored 100%. Your score has been recorded.
v." Q3. True or False: A Jaql query can be thought of as a stream of operations.
Your Answer: True
Correct Answer: True
Continue
Activer \Windows
ArrArip7 A r n e narArni=trpc nni Ir ArtiliPF Vulfrir
Q1. True or False: Data Warehouse augmentation is a very common use case for Hadoop.
Your Answer: True
Correct Answer: True
X 02. The clause that indicates the storage file/record format on HOES.
Your Answer: stored by
Correct Answer: stored as
Q4. A variable name that determines whether the map/reduce jobs should he submitted through a separate T1fv1 in the non local mod e.
Your Answer: Hve.exec.submitviachild
Correct Answer: hive,exec.submitviachiEd
(ou have completed the test for Exercise 3. Hive Lab (1 -tu47559).
you scored 100%. Your score has been recorded.
v Q2 What Hive command list all the base tables and views in a database?
Your Answer: show tabLes
Correct Answer: show tabLes
v Q3 What Hive command is used for loading data from a Local file to HDF S?
Your Answer: LOAD DATA LOCAL INPATH ...
Correct Answer: LOAD DATA LOCAL INPATH ...
You have completed the test for Unit 7. HBase (ltu47604).
You scored 50%. Your score has been recorded.
X 03. True or False: HBase region servers are responsible for serving and managing regions.
Your Answer: Fake
Correct Answer: True
Ql. Which of the following statements is true about CREATE statement in HBase
Your Answer: It only requires the name of table and one or more column families
Correct Answer: It only requires the name of table and one or more column families
N., Q2. Which command must be executed before deleting an HBase table or changing its settings?
Your Answer: disable
Correct Answer: disable
No' Q3. Which command is used in HBase for inserting data in a table?
Your Answer: put
Correct Answer: put
Cont[riue
You have completed the test for Unit 8: Big SQL (ltu47606).
Y:Du scored 80%. Your score has been recorded,
Q2. Which of the following tools can be used to access a Big SQL server?
Your Answer: ALL of the above
Correct Answer: ALL of the above
N., Q3. Which of the following data types are not supported by Big SQL?
Your Answer: byte
Correct Answer: byte
N . "
Q4. You can insert records into BIG SQL table mapped to an HBASE table
Your Answer: True
Correct Answer: True
05. You can insert records into BI SQL - table mapped to a HIVE table
Your Answer: True
Correct Answer: False
Artilfinr 1Alindrimm
You have completed the test for Exercise 5. Big SQL Lab (Itu47560).
You scored 66%. Your score has been recorded.
X 01. Which statement can be used to Load data in a Big SQL tab Le?
Your Answer: toad data
Correct Answer: load hadoop
2. Which of the fo llo w ing is true about CRE ATE H ADOOP T ABLE sta tement
Your Answer: It creates a Big SQL table and the data wiEl be stared in HDFSiGPFS
Correct Answer: It creates a Big SQL table and the data wiEl be stared in HDFSiGPFS
3. Which of the foLlowing statements represents a difference between a Big SQL table and a DB2 table?
Your Answer: When creating a Big S Q1 table, information on how tha data is stored on disk must be provided
Correct Answer: When creating a. Big 5 QL table, information on how the data is stored an disk must be provided
Continue
You have completed the test for Unit 9. JAQL (Itu47607).
You scored 100%. Your score has been recorded.
Q3. The function that explicitly sends each element of an array to a mapper.
Your Answer: arraylleadO
Correct Answer: arrayRea.d0
You have completed the test for Exercise 6. JAQL Lab (ltu47561).
You scored 100%. Your score has been recorded.
v Q2. Which of the foLlowing Jacil operators helps projecting or retrielifng a subset of columns or fields from a data set?
`Your Answer: transform
Correct Answer: transform
"of 02. What component of Apache Hadoop can he used for scheduling and running workflow jobs?
Your Answer: Oozie
Correct Answer: Oozie
Continue
You have completed the test for Exercise 7: Application Development Lab
(Itu47562). You scored 66%. Your score has been recorded.
01. Select a BigInsights execution mode for an application or script developed in Eclipse
Your Answer: Clus ter
Cor r ec t A nswe r: Clus te r
X Q2. What step needs to be performed for an application published to BigInsights Web Console to become available for users to run it?
Your Answer: It needs to be copied under user home directory
Correct Answer: It needs to be deployed with proper privileges
Q3. Which part of the BigInsights web console, provides information, that would help a user troubleshoot and diagnose, a failed a pplication job?
Your Answer: Application Status tab
Cor r ec t A nswe r: Ap plica tion Sta tus tab
Y O U scored 100%. Your score has been recorded.
QI. W h a t is B ig Sh e e t s ?
Your Answer: ALL of the above
Correct Answer: ALL of the above
Q3. Wha t is t he ' A d d She e ts ' op tion tha t h elp s c a lc ula te va l ues b y g r oup ing the d a t a in the w or kb oo k, a p p ly ing fu n c tions t o ea c h g r oup a nd c a r r ying o ver d a ta .
Your Answer: Pivot
Cor r ec t A nswe r: P ivot
v Q4. P r ovid e s 1 0 + b uiLd -in fu nc t ions to ex tr a c t n a mes , a d d r es s es , or g a niz a t ions , ema il a nd p hone n umb er s
Your Answer: Text Analytics Integration
Corr ect Answer: T ex t Analytics Integra tion
You have completed the test for Exercise 8. browser-based data analytics tool lab (ltu47563).
You scored 33%. Your score has been recorded.
X 02. True or False: You can apply only a single operation to a sheet
Your Answer: Fake
Correct Answer: True
...." Qi. An SQL style programming Language for text mining or text extraction?
Your Answer: AQL
Correct Answer: AQL
../ 03. Which is not a built-in scalar type of an AQL Data ModeL?
Your Answer: Char
Correct Answer: Char
...., Q4. True or False: Text analytics Java API is part of the Big Insights Text Analytics Components.
our Answer: True
Correct Answer: True
You have completed the test for Unit 13. AQL Syntax (Ltu47611).
'You scared 75%. Your score has been recorded.
vf Q3. True or False: Text Analytics is a powerful information extraction system providing multilingual support.
Your Answer: True
Correct Answer: True
-of Q4. Line of code used to specify the ModuLe Name at the be of an AQL file.
Your Answer: module
Correct A nswer: mo dule ;
You have completed the test for Exercise 9. MIL EmaiL Analysis Lab (ltu47564).
You scored 66%. Your score has been recorded.
Ql. Which tooL can be used for developing a Text AnaLytics Extractor?
Your Answer: Eclipse with Biginsights took for Eclipse
Correct Answer: Eclipse with Biginsights took for Eclipse
Q2. Which of the following is an AQL statement that can be used to create a component of a Text Analytics extractor
Your Answer: create view <viewname> as <select or extract statement›;
Correct Answer: create view <viewname) as <select or extract state ment>;
X Q3. What statement can be used to display the content of an AQL view?
Your Answer: s ho w c o nt e nt vi e w v i e w nam e;
Correct Answer: output view (view_na.me;
You have completed the test for Unit 14. Streams (1 - tu47612).
You scored 100%. Your score has been recorded.
Q3 An Eclipse- bas ed tool that enables developers to create, edit , visualize, t est, debug, run SPL and SPL mixed -mode applic ations
Your Answer: InfoSp here Streams Studio
Correct Answer: InfoSp here Streams Studio
Continue
)u have completed the test for Exercise 10: Streams Lab (ltu47565).
)u scored 0%. Your score has been recorded,
IC Q2. Which of the following statements is true regarding the Instance Graph available in InfoSphere Streams:
Your Answer: The Instance Graph provides a graphical representation of the network topology of the Cluster where Streams is installed and running.
Correct Answer: The Instance Graph provides a graphical view of the application that's currently running in an instance; it is a very helpful tool for debugging Streams applications at
runtime
X Q2 I n t h e c o n t e x t o f a n en t e r pr i se s e ar c h e n gi ne , w h a t is c a ll e d t he pr o c es s o f re t ri e vi n g d o c u m en t s ?
Your Answer: Clustering
Correct Answer: Crawling
X Q3 A component of Data Explorer Engine that process raw data discovered by the crawler and produce 1+ pieces of index -able data.
Your Answer: Retrieving
Correct Answer: Converting
X Q4 True or False: Data Explorer's collaborative search enables saving and sharing of results and queries among users.
Your Answer: False
Correct Answer: True
f oe have completed the test for Exercise 11: Data Explorer Lab (Itu47566),
( ou scored 33%. Your sc or e has be en r ecor ded.
Q2. What should be done when configuring a Data Explorer project so that the search results are grouped by values of specific met adata, tags, or other parameters?
Your Answer: Add Binning to the Search Collection
Corr ect Answer: A dd Binning to the Se arch Colle c tion
X Q3. Which tool can be used to create and configure a Data Explorer search project?
Your Answer: Data Explorer Studio
Corr ect Answer: Data Explor er Engine a dminis tra tion tool
Question 29
Apache Spark can run on which two of the following cluster managers?
Question 30
Which component of the Spark Unified Stack allows developers to intermix structured database
queries with Spark's programming language?
Spark SQL
Question 31
Under the MapReduce v1 programming model, which shows the proper order of the full set of
MapReduce phases?
Question 32
Question 33
Question 34
Which component of the Apache Ambari architecture provides statistical data to the dashboard
about the performance of a Hadoop cluster?
Question 35
Question 36
REDIS
Question 37
Which statement is true about the Combiner phase of the MapReduce architecture?
It reduces the amount of data that is sent to the Reducer task nodes.
Question 38
Which feature makes Apache Spark much easier to use than MapReduce?
Question 39
Under the YARN/MRv2 framework, which daemon arbitrates the execution of tasks among all the
applications in the system?
ResourceManager
Question 40
Which Apache Hadoop application provides a high-level programming language for data
transformation on unstructured data?
Pig
Question 41
Under the YARN/MRv2 framework, which daemon is tasked with negotiating with the
NodeManager(s) to execute and monitor tasks?
ApplicationMaster
Question 42
MongoDB
Question 43
What are three IBM value-add components to the Hortonworks Data Platform (HDP)?
Big SQL
Big Replicate
Big Match
Question 44
Question 45
What is the name of the Hadoop-related Apache project that utilizes an in-memory architecture to
run applications faster than MapReduce?
Spark
Question 46
Apache Spark provides a single, unifying platform for which three of the following types of
operations?
batch processing
machine learning
graph operations
Question 47
Python
Java
Scala
Question 48
Which Apache Hadoop component can potentially replace an RDBMS as a large Hadoop datastore
and is particularly good for "sparse data"?
HBase
Question 49
Question 50
Question 51
Under the MapReduce v1 programming model, which optional phase is executed simultaneously
with the Shuffle phase?
Combiner
Question 52
What are two ways the command-line parameters for a Sqoop invocation can be simplified?
Question 53
If a Hadoop node goes down, which Ambari component will notify the Administrator?
Proxying services.
Question 55
Question 56
Which Apache Hadoop application provides an SQL-like interface to allow abstraction of data on
semi-structured data in a Hadoop datastore?
Hive
Question 57
Which data encoding format supports exact storage of all data in binary representations such as
VARBINARY columns?
SequenceFiles
Question 58
MapReduce
Question 59
org.apache.hadoop.mapred
Question 60
Which Hadoop ecosystem tool can import data into a Hadoop cluster from a DB2, MySQL, or other
databases?
Sqoop
Introduction to Big Data
*A tsunami of Big Data (Huge volumes , Data of different types and formats ,mpacting the
business at new and ever increasing speeds)
Big Data refers to non-conventional strategies and innovative technologies used
by businesses and organizations to capture, manage, process, and make sense of
a large volume of data
The four classic dimensions of Big Data (4 Vs)
Volume : the main characteristic of big data is its huge volume collected
through various sources
Velocity : velocity is the speed or frequency at which data is collected in various forms and
from different sources for processing
Veracity : filter out clean and relevant data from big data In order to make accurate
decisions.
Value : to the information procured leads to the whole purpose of big data, smart decision
making.
Or even more…
• Volume - how much data is there?
• Velocity - how quickly is the data being created, moved, or accessed?
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment?
Data workflow
Sqoop is a tool for moving data between structured databases or relational databases and
related Hadoop system. This works both ways. You can take data in your RDBMS and move
to your HDFS and move from your HDFS to some other form of RDBMS
FLUME : Essentially, when you have large amounts of data, such as log files, that needs to be
moved from one location to another, Flume is the tool
Kafka : is a messaging system used for real-time data pipelines
Data Acces
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data
summarization, ad-hoc queries, and analysis of large data sets in Hadoop (Includes HCatalog)
Apache Pig is a platform for analyzing large data sets, Pig has its own language, called Pig
Latin, with a purpose of simplifying MapReduce programming. PigLatin is a simple scripting
language, once compiled, will become MapReduce jobs to run against Hadoop data.
HBase is a columnar datastore, which means that the data is organized in columns as opposed
to traditional rows where traditional RDMBS is based upon, HBase is modeled after Google's
BigTable and provides BigTable-like capabilities on top of Hadoop and HDFS. HBase is a
NoSQL
datastore.
Accumulo imilar to HBase. You can think of Accumulo as a "highly secure HBase
Phoenix enables online transactional process and operational analytics in Hadoop for low
latency applications. Essentially, this is a SQL for NoSQL database
Storm is designed for real-time computation that is fast Used to process large volumes of
high-velocity data , useful when milliseconds of latency matter and Spark isn't fast enough
Solr search platform built on the Apache Lucene Java search library, It is designed for full
text indexing and searching
Spark : is a fast and general engine for large-scale data processing.
Druid is a datastore designed for business intelligence (OLAP) queries. Druid provides real-
time data ingestion, query, and fast aggregations. It integrates with Apache Hive to build
OLAP cubes and run sub-seconds queries.
Data Lifecycle and Governance
Falcon : is used for managing the data life cycle in Hadoop clusters
Atlas : It provides features for data classification, centralized auditing, centralized lineage,
and security and policy engine. It integrates with the whole enterprise data ecosystem.
Security
Ranger is used to control data security across the entire Hadoop platform. The Ranger console
can manage policies for access to files, folders, databases, tables and columns. The policies
can be set for individual users or groups.
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for
Hadoop. You can think of Knox like the castle walls, where within walls is your Hadoop
cluster.
OPERATIONS
Ambari : For provisioning, managing, and monitoring Apache Hadoop clusters. Provides
intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks' project,
and is currently not a part of Apache. It automates the launch of clusters into various cloud
infrastructure platforms.
ZooKeeper provides a centralized service for maintaining configuration information, naming,
providing distributed synchronization and providing group services across your Hadoop
cluster
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with the
rest of the Hadoop stack. Oozie workflow jobs are Directed Acyclical Graphs
(DAGs) of actions. At the heart of this is YARN.
TOOLS
Zepplin is a web based notebook designed for data scientists to easily and quickly
explore dataset through collaborations, Zeppelin allows for interaction and visualization of
large datasets.
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File,
HDFS which allows developers to monitor and manage the cluster
BM value-add components
• Big SQL : SQL processing engine for the Hadoop cluster
• BigQuality : platform for data integration, data quality, and governance that is unified by a
common metadata layer and scalable architecture.
• BigIntegrate : is a big data integration solution that provides superior connectivity, fast
transformation and reliable, easy-to-use data delivery features that execute on the data nodes
of a Hadoop cluster.
• Big Match
UNIT4
What hardware is not used for Hadoop
RAID , Linux Logical Volume Manager (LVM), Solid-state disk (SSD)
Parallel data processing is the answer
▪GRID computing: spreads processing load
▪ distributed workload: hard to manage applications, overhead on developer
▪ parallel databases: Db2 DPF, Teradata, Netezza, etc. (distribute the data)
What is Hadoop?
Hadoop is an open source project of the Apache Foundation. ,It is a framework written in Java
originally , Hadoop uses Google's MapReduce and Google File System (GFS) technologies as
its foundation
• Consists of 4 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ YARN
▪ Hadoop Common
• Supported by many Apache/Hadoop-related projects:
▪ HBase, ZooKeeper, Avro, etc.
Hadoop is not used for OLTP nor OLAP, but is used for big data, and it complements
these two to manage data. Hadoop is not a replacement for a RDBMS.
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high
throughput access to application data.( The Hadoop Distributed File System (HDFS) is where
Hadoop stores its data. This file system spans all the nodes in a cluster. Effectively, HDFS
links together the data that resides on many local nodes, making the data part of one big file
system. You can use other file systems with Hadoop, but HDFS is quite common)
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.'
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters. It also provides a dashboard for viewing cluster health and ability to
view MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data
storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and
ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary
DAG of tasks to process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed
applications.
Advantages and disadvantages of Hadoop
Hadoop is good for:
▪ processing massive amounts of data through parallelism
▪ handling a variety of data (structured, unstructured, semi-structured)
▪ using inexpensive commodity hardware
Hadoop is not good for:
▪ processing transactions (random access)
▪ when work cannot be parallelized
▪ low latency data access
▪ processing lots of small files
▪ intensive calculations with small amounts of data
HDFS architecture
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed
throughout the cluster. In this way, the map and reduce functions can be executed on smaller
subsets of your larger data sets, and this provides the scalability that is needed for big data
processing.
Hadoop Distributed File System (HDFS) principles
• Distributed, scalable, fault tolerant, high throughput
• Data access through MapReduce
• Files split into blocks (aka splits)
• 3 replicas for each piece of data by default
• Can create, delete, and copy, but cannot update
• Designed for streaming reads, not random access
• Data locality is an important concept: processing data on or near the
physical storage to decrease transmission of data
UNIT 5 MapReduce and YARN
The driving principle of MapReduce is a simple one: spread your data out across a huge
cluster of machines and then, rather than bringing the data to your programs as you do in
traditional programming, you write your program in a specific way that allows the program to
be moved to the data.
A Distributed File System (DFS) is at the heart of MapReduce. It is responsible for spreading
data across the cluster, by making the entire cluster look like one giant file system
MapReduce v1 explained
MapReduce v1 engine
• Master/Slave architecture
▪ Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
• JobTracker
▪ Accepts MapReduce jobs submitted by clients
▪ Pushes map and reduce tasks out to TaskTracker nodes
▪ Keeps the work as physically close to data as possible
▪ Monitors tasks and TaskTracker status
• TaskTracker
▪ Runs map and reduce tasks
▪ Reports status to JobTracker
▪ Manages storage and transmission of intermediate output
The MapReduce programming model
"Map" step
▪ Input is split into pieces (HDFS blocks or "splits")
▪ Worker nodes process the individual pieces in parallel
(under global control of a Job Tracker)
▪ Each worker node stores its result in its local file system where a reducer
is able to access it
"Reduce" step
▪ Data is aggregated ("reduced" from the map steps) by worker nodes
(under control of the Job Tracker)
▪ Multiple reduce tasks parallelize the aggregation
▪ Output is stored in HDFS (and thus replicated)
MapReduce 1 overview
Map phase : A mapper is typically a relatively small program with a relatively simple task: it
is responsible for reading a portion of the input data, interpreting, filtering or transforming the
data as necessary and then finally producing a stream of <key, value> pairs
Shuffle phase : The output of each mapper is locally grouped together by key One node is
chosen to process data for each unique key All of this movement (shuffle) of data is
transparently orchestrated by MapReduce
Reduce phase : Small programs (typically) that aggregate all of the values for the key that
they are responsible for , Each reducer writes output to its own file
Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
▪ InputSplitter dividing a file into splits
-Splits are normally the block size but depends on number of requested Map tasks, whether
any compression allows splitting, etc.
▪ RecordReader takes a split and reads the files into records
-For example, one record per line (LineRecordReader)
-But note that a record can be spit across splits
▪ InputFormat takes each record and transforms it into a <key, value> pair that is then passed
to the Map task
The primary way that Hadoop achieves fault tolerance is through restarting tasks
The most serious limitations of classical MapReduce are:
▪ Scalability
▪ Resource utilization
▪ Support of workloads different from MapReduce.
YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability
UNIT 6 : APASH SPARK
Apache Spark was designed as a computing platform to be fast, general-purpose, and easy to
use. It extends the MapReduce model and takes it to a whole other level.
Spark Core contains basic Spark functionalities required for running jobs and needed
by other components. The most important of these is the RDD concept, or resilient
distributed dataset, the main element of Spark API
Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive
variant of SQL). Spark SQL allows developers to intermix SQL with Spark's
programming language supported by Python, Scala, Java, and R.
• Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for
developers to move between applications that processes data stored in memory
vs arriving in real-time. It also provides the same degree of fault tolerance,
throughput, and scalability that the Spark Core provides.
• MLlib is the machine learning library that provides multiple types of machine
learning algorithms. These algorithms are designed to scale out across the
cluster as well. Supported algorithms include logistic regression, naive Bayes classification,
SVM, decision trees, random forests, linear regression, k-means
clustering, and others.
• GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised
of vertices and edges connecting them. GraphX provides functions for building
graphs and implementations of the most important algorithms of the graph
theory, like page rank, connected components, shortest paths, and others.
Two types of RDD(RDD essentially is just a distributed collection of elements that is
parallelized across the cluster.) operations:
▪ Transformations
-Creates a directed acyclic graph (DAG)
-Lazy evaluations
-No return value
▪ Actions
-Performs the transformations
-The action that follows returns a value
• RDD provides fault tolerance
• Has in-memory caching (with overflow to disk)
UNIT 7 :
Common data representation formats used for big data include:
▪ Row- or record-based encodings:
-Flat files / text files
-CSV and delimited files
-Avro / SequenceFile
-JSON
-Other formats: XML, YAML
▪ Column-based storage formats:
-RC / ORC file was developed to support Hive
-Parquet(developed by Cloudera and Twitter)
▪ NoSQL datastores
• Compression of data
WHY NoSQL ?
briefly:
• Increased complexity of SQL(Massive data sets exhaust the capacity and scale of existing
RDBMSs)
• Sharding introduces complexity(Distributing the RDBMS is operationally challenging and
often technically impossible)
• Single point of failure(
• Failover servers more complex
• Backups more complex
• Operational complexity added
WHY HBASE ?
• Highly Scalable
▪ Automatic partitioning (sharding)
▪ Scale linearly and automatically with new nodes
• Low Latency
▪ Support random read/write, small range scan
• Highly Available
• Strong Consistency
• Very good for "sparse data" (no fixed columns)
HBase and ACID Properties
• Atomicity
▪ All reading and writing of data in one region is done by the assigned Region Server
▪ All clients have to talk to the assigned Region Server to get to the data
▪ Provides row level atomicity
• Consistency and Isolation
▪ All rows returned via any access API will consist of a complete row that existed at
some point in the table's history
▪ A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation. Any
row returned by the scan will be a consistent view (for example, that version of the
complete row existed at some point in time)
• Durability
▪ All visible data is also durable data. That is to say, a read will never return
data that has not been made durable on disk
HBase data model
• Data is stored in HBase table(s)
• Tables are made of rows and columns
• All columns in HBase belong to a particular column family
• Table schema only defines column families
▪ Can have large, variable number of columns per row
▪ (row key, column key, timestamp) Î value
▪ A {row, column, version} tuple exactly specifies a cell
• Each cell value has a version
▪ Timestamp
• Row stored in order by row keys
▪ Row keys are byte arrays; lexicographically sorted
• Technically HBase is a multidimensional sorted map
Pig
• Pig runs in two modes:
▪ Local mode: on a single machine without requirement for HDFS
▪ MapReduce/Hadoop mode: execution on an HDFS cluster, with the Pig scrip converted
to a MapReduce job
• When Pig run runs in an interactive shell, the prompt is grunt>
• Pig scripts have, by convention, a suffix of .pig
• Pig is written in the language Pig Latin
Pig vs. SQL
• In contrast to SQL, Pig:
▪ uses lazy evaluation
▪ uses ETL techniques
▪ is able to store data at any point during a pipeline
▪ declares execution plans
▪ supports pipeline split
• Pig Latin is procedural language with a pipeline paradigm
• SQL is a declarative language
What is Hive?
• tools to enable easy data extract/transform/load (ETL)
• a mechanism to impose structure on a variety of data formats
• access to files stored either directly in Apache HDFS or in other data storage systems such
as Apache HBase
• query execution via MapReduce
Components of Hive include HCatalog and WebHCat:
• HCatalog is a component of Hive. It is a table and storage management layer for Hadoop
that enables users with different data processing tools – including Pig and MapReduce - to
more easily read and write data on the grid.
• WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig,
Hive jobs or perform Hive metadata operations using an http (REST style) interface.
ZooKeeper
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
ZooKeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and
naming. It is designed to be easy to program to, and uses a data model styled after the
familiar directory tree structure of file systems. It runs in Java and has bindings for both Java
and C.
• ZooKeeper provides support for writing distributed applications in the Hadoop ecosystem
• ZooKeeper addresses the issue of partial failure
▪ Partial failure is intrinsic to distributed systems
▪ ZK provides a set of tools to build distributed applications that can safely handle partial
failures
• ZooKeeper has the following characteristics:
▪ simple
▪ expressive
▪ highly available
▪ facilitates loosely coupled interactions
▪ is a library
• Apache ZooKeeper is an open source server that enables highly reliable distributed
coordination
Distributed systems :
Slider
• Apache Slider is a YARN application to deploy existing distributed applications on YARN,
monitor them and make them larger or smaller as desired, even while the application is
running.
• Some of the features are:
▪ Allows users to create on-demand applications in a YARN cluster
▪ Allows different users/applications to run different versions of the
application.
▪ Allows users to configure different application instances differently
▪ Stop/restart application instances as needed
▪ Expand/shrink application instances as needed
• The Slider tool is a Java command line application.
Knox
• The Apache Knox Gateway is an extensible reverse proxy framework for securely exposing
REST APIs and HTTP based services at a perimeter
• Different types of REST access supported: HTTP(S) client, cURL, Knox Shell (DSL), SSL, …
Big SQL is SQL on Hadoop
• Big SQL builds on Apache Hive foundation
▪ Integrates with the Hive metastore
▪ Instead of MapReduce, uses powerful native C/C++ MPP engine
• View on your data residing in the Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your warehouse data with little or no modifications
Many Db2 technologies you already know exist in Big SQL, including
• "Native Tables" with full transactional support on the Head Node
• Row oriented, traditional Db2 tables
• BLU Columnar, In-memory tables (on Head Node Only)
• Materialized Query Tables
• GET SNAPSHOT / snapshot table functions
• RUNSTATS command (db2) Æ ANALYZE command (Big SQL)
• Row and Column Security
• Federation / Fluid Query
• Views
• SQL PL Stored Procedures & UDFs
• Workload Manager
• System Temporary Table Spaces to support sort overflows
• User Temporary Table Spaces for Declared Global Temporary Tables
JSqsh (1 of 3)
• Big SQL comes with a CLI pronounced as "jay-skwish" - Java SQL Shell
▪ Open source command client
▪ Query history and query recall
▪ Multiple result set display styles
▪ Multiple active sessions
• Started under /usr/ibmpacks/common-utils/current/jsqsh/bin
CREATE VIEW
create view my_users as
select fname, lname from bigsql.users where id > 100;
Data types
• Big SQL uses HCatalog (Hive Metastore) as its underlying data
representation and access method
• SQL type
• Hive type
DATE type
DATE can be stored two ways:
• DATE STORED AS TIMESTAMP
• DATE STORED AS DATE
FILE FORMAT
Here are the Big SQL file formats that will be covered in detail in upcoming
slides:
• Delimited
• Sequence
• Delimited
• Binary
• Parquet
• ORC
• RC
• Avro
SEQUENCE
Unit 4 :
Clients (users and applications) that access the database and data
sources
• Characteristics
Transparent: Appears to be one source
High Function: Full query support against all data (e.g. scalar
functions, stored
procedures)
High Performance: Optimization of distributed queries