Hadoop Interview Questions
Hadoop Interview Questions
Hadoop Interview Questions
http://www.tutorialspoint.com/hadoop/hadoop_interview_questions.htm
Copyright tutorialspoint.com
Dear readers, these Hadoop Interview Questions have been designed specially to get you
acquainted with the nature of questions you may encounter during your interview for the subject
of Hadoop. As per my experience good interviewers hardly plan to ask any particular question
during your interview, normally questions start with some basic concept of the subject and later
they continue based on further discussion and what you answer
What does jps command do?
It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the
status of namenode, datanode , secondary namenode, Jobtracker and Task tracker.
How to restart Namenode?
Step-1. Click on stop-all.sh and then click on start-all.sh OR
Step-2. Write sudo hdfs pressenter, su-hdfs pressenter, /etc/init.d/ha pressenter and
then /etc/init.d/hadoop-0.20-namenode start pressenter.
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are
1. standalone local mode
2. Pseudo-distributed mode
3. Fully distributed mode
What does /etc /init.d do?
/etc /init.d specifies where daemons services are placed or to see the status of these daemons. It is
very LINUX specific, and nothing to do with Hadoop.
What if a Namenode has no data?
It cannot be part of the Hadoop cluster.
What happens to job tracker when Namenode is down?
When Namenode is down, your cluster is OFF, this is because Namenode is the single point of
failure in HDFS.
What is Big Data?
Big Data is nothing but an assortment of such a huge and complex data that it becomes very
tedious to capture, store, process, retrieve and analyze it with the help of on-hand database
management tools or traditional data processing techniques.
What are the four characteristics of Big Data?
the three characteristics of Big Data are
Effective analysis of Big Data provides a lot of business advantage as organizations will learn
which areas to focus on and which areas are less important. Big data analysis provides some early
key indicators that can prevent the company from a huge loss or help in grasping a great
opportunity with open hands! A precise analysis of Big Data helps in decision making! For
instance, nowadays people rely so much on Facebook and Twitter before buying any product or
service. All thanks to the Big Data explosion.
Why do we need Hadoop?
Everyday a large amount of unstructured data is getting dumped into our machines. The major
challenge is not to store large data sets in our systems but to retrieve and analyze the big data in
the organizations, that too data present in different machines at different locations. In this situation
a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different
machines at different locations very quickly and in a very cost effective way. It uses the concept of
MapReduce which enables it to divide the query into small parts and process them in parallel. This
is also known as parallel computing. The following link Why Hadoop gives a detailed explanation
about why Hadoop is gaining so much popularity!
What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to report and archive the data,
whereas Hadoop is an approach to store huge amount of data in the distributed file system and
process it. RDBMS will be useful when you want to seek one record from Big data, whereas,
Hadoop will be useful when you want Big data in one shot and perform analysis on that later
What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets
destroyed. Then there is no chance of getting the data back present in that file. To avoid such
situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we
store a file, it automatically gets replicated at two other locations also. So even if one or two of the
systems collapse, the file is still available on the third system.
Replication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware systemswithaverageconfigurations that has high chances of getting
crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores
data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one
of them is corrupted and the other is unavailable for some time for any reason, then data can be
accessed from the third one. Hence, there is no chance of losing the data. This replication factor
helps us to attain the feature of Hadoop called Fault Tolerant.
Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node
will also be replicated on the other two?
No, calculations will be done only on the original data. The master node will know which node
exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be
failed. Only then, the required calculation will be done on the second replica.
What is a Namenode?
Namenode is the master node on which job tracker runs and consists of the metadata. It maintains
and manages the blocks which are present on the datanodes. It is a high-availability machine and
single point of failure in HDFS.
Is Namenode also a commodity hardware?
No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the
single point of failure in HDFS. Namenode has to be a high-availability machine.
What is a Datanode?
Datanodes are the slaves which are deployed on each machine and provide the actual storage.
These are responsible for serving read and write requests for the clients.
Why do we use HDFS for applications having large data sets and not when there are lot of small
files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount
of data spread across multiple files. This is because Namenode is a very expensive high
performance system, so it is not prudent to occupy the space in the Namenode by unnecessary
amount of metadata that is generated for multiple small files. So, when there is a large amount of
data in a single file, name node will occupy less space. Hence for getting optimized performance,
HDFS supports large data sets instead of multiple small files.
What is a job tracker?
Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in
Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one
job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce
Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task
tracker based on which Job tracker decides whether the assigned task is completed or not.
What is a task tracker?
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of
individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and
divide the work and assign them to different task trackers to perform MapReduce tasks. While
performing this action, the task tracker will be simultaneously communicating with job tracker by
sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified
time, then it will assume that task tracker has crashed and assign that task to another task tracker
in the cluster.
What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and
task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive
heart beat then they will decide that there is some problem in datanode or task tracker is unable
to perform the assigned task.
What is a block in HDFS?
A block is the minimum amount of data that can be read or written. In HDFS, the default block
size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken
down into block-sized chunks, which are stored as independent units. HDFS blocks are large as
compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will
the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where
the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block
and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an
efficient manner.
What are the benefits of block transfer?
A file can be larger than any single disk in the network. Theres nothing that requires the blocks
from a file to be stored on the same disk, so they can take advantage of any of the disks in the
cluster. Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and
disk and machine failure, each block is replicated to a small number of physically separate
machines typicallythree. If a block becomes unavailable, a copy can be read from another location in
a way that is transparent to the client?
How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS
will keep on storing the last part of the data which will say where the next part of the data will be.
Are job tracker and task trackers present in separate machines?
Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a
single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are
halted.
What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically
located at different places. Rack is a physical collection of datanodes which are stored at a single
location. There can be multiple racks in a single location.
What is a Secondary Namenode? Is it a substitute to the Namenode?
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it
into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode
fails, the entire Hadoop system goes down.
Explain how do map and reduce works.
Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes
process the tasks assigned to them and make a key-value pair and returns the intermediate output
to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them
and generates the final output.
Why Reading is done in parallel and Writing is not in HDFS?
Through mapreduce program the file can be read by splitting its blocks when reading. But while
writing as the incoming values are not yet known to the system mapreduce cannot be applied and
no parallel writing is possible.
Copy a directory from one node in the cluster to another
Use -distcp command to copy,
Default replication factor to a file is 3.
Use -setrep command to change replication factor of a file to 2.
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
What is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack
definitions Hadoop will try to minimize the network traffic between datanodes within the same rack
and will only contact remote racks if it has to. The namenode is able to control this due to rack
awareness.
Which file does the Hadoop-core configuration?
core-default.xml
Is there a hdfs command to see available free space in hdfs
hadoop dfsadmin -report
The requirement is to add a new data node to a running Hadoop cluster; how do I start services on
just one data node?
You do not need to shutdown and/or restart the entire cluster in this case.
First, add the new node's DNS name to the conf/slaves file on the master node.
Then log in to the new slave node and execute
$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
then issuehadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes so that
the NameNode and JobTracker know of the additional node that has been added.
It will restart the task again on some other TaskTracker and only if the task fails more than four
thedefaultsettingandcanbechanged times will it kill the job.
What are Problems with small files and HDFS?
HDFS is not good at handling large number of small files. Because every file, directory and block in
HDFS is represented as an object in the namenodes memory, each of which occupies approx 150
bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go
for a billion files the memory requirement in namenode cannot be met.
What is speculative execution in Hadoop?
If a node appears to be running slow, the master node can redundantly execute another instance
of the same task and first output will be taken .this process is called as Speculative execution.
Can Hadoop handle streaming data?
Yes, through Technologies like Apache Kafka, Apache Flume, and Apache Spark it is possible to do
large-scale streaming.
Why is Checkpointing Important in Hadoop?
As more and more files are added the namenode creates large edit logs. Which can substantially
delay NameNode startup as the NameNode reapplies all the edits. Checkpointing is a process that
takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of
replaying a potentially unbounded edit log, the NameNode can load the final in-memory state
directly from the fsimage. This is a far more efficient operation and reduces NameNode startup
time.
What is Twitter Bootstrap?
Bootstrap is a sleek, intuitive, and powerful mobile first front-end framework for faster and easier
web development. It uses HTML, CSS and Javascript.
Why use Bootstrap?
Bootstrap can be used as
Mobile first approach Since Bootstrap 3, the framework consists of Mobile first styles
throughout the entire library instead of in separate files.
Browser Support It is supported by all popular browsers.
Easy to get started With just the knowledge of HTML and CSS anyone can get started
with Bootstrap. Also the Bootstrap official site has a good documentation.
Responsive design Bootstrap's responsive CSS adjusts to Desktops,Tablets and Mobiles.
Provides a clean and uniform solution for building an interface for developers.
It contains beautiful and functional built-in components which are easy to customize.
It also provides web based customization.
And best of all it is an open source.
What does Bootstrap package includes?
Bootstrap package includes
Scaffolding Bootstrap provides a basic structure with Grid System, link styles,
background. This is is covered in detail in the section Bootstrap Basic Structure
CSS Bootstrap comes with feature of global CSS settings, fundamental HTML elements
styled and enhanced with extensible classes, and an advanced grid system. This is covered in
detail in the section Bootstrap with CSS.
Components Bootstrap contains over a dozen reusable components built to provide
iconography, dropdowns, navigation, alerts, popovers, and much more. This is covered in
detail in the section Layout Components.
JavaScript Plugins Bootstrap contains over a dozen custom jQuery plugins. You can
easily include them all, or one by one. This is covered in details in the section Bootstrap
Plugins.
Customize You can customize Bootstrap's components, LESS variables, and jQuery
plugins to get your very own version.
What is Contextual classes of table in Bootstrap?
The Contextual classes allow you to change the background color of your table rows or individual
cells.
Class
Description
.active
.success
.warning
.danger
A modal is a child window that is layered over its parent window. Typically, the purpose is to
display content from a separate source that can have some interaction without leaving the parent
window. Child windows can provide information, interaction, or more.
How do you use the Dropdown plugin?
You can toggle the dropdown plugin's hidden content:
Via data attributes: Add data-toggle="dropdown" to a link or button to toggle a
dropdown as shown below
<div >
<a data-toggle="dropdown" href="#">Dropdown trigger</a>
<ul >
...
</ul>
</div>
Via JavaScript To call the dropdown toggle via JavaScript, use the following method:
$('.dropdown-toggle').dropdown()
Input groups are extended Form Controls. Using input groups you can easily prepend and append
text or buttons to the text-based inputs.
By adding prepended and appended content to an input field, you can add common elements to
the user's input. For example, you can add the dollar symbol, the @ for a Twitter username, or
anything else that might be common for your application interface.
To prepend or append elements to a .form-control
Wrap it in a <div> with class .input-group
As a next step, within that same <div> , place your extra content inside a <span> with class
.input-group-addon.
Now place this <span> either before or after the <input> element.
How will you create a tabbed navigation menu
To create a tabbed navigation menu
Start with a basic unordered list with the base class of .nav.
Add class .nav-tabs.
How will you create a pills navigation menu
To create a pills navigation menu
Start with a basic unordered list with the base class of .nav.
Add class .nav-pills.
How will you create a vertical pills navigation menu
You can stack the pills vertically using the class .nav-stacked along with the classes: .nav, .nav-pills.
What is bootstrap navbar
The navbar is one of the prominent features of Bootstrap sites. Navbars are responsive 'meta'
components that serve as navigation headers for your application or site. Navbars collapse in
mobile views and become horizontal as the available viewport width increases. At its core, the
navbar includes styling for site names and basic navigation.
How to create a navbar in bootstrap
To create a default navbar
Add the classes .navbar, .navbar-default to the <nav> tag.
Add role="navigation" to the above element, to help with accessibility.
Add a header class .navbar-header to the <div> element. Include an <a> element with class
navbar-brand. This will give the text a slightly larger size.
To add links to the navbar, simply add an unordered list with the classes of .nav, .navbar-nav.
What is bootstrap breadcrumb
Breadcrumbs are a great way to show hierarchy-based information for a site. In the case of blogs,
breadcrumbs can show the dates of publishing, categories, or tags. They indicate the current
page's location within a navigational hierarchy.
A breadcrumb in Bootstrap is simply an unordered list with a class of .breadcrumb. The separator
is automatically added by CSS bootstrap. min. css.
Which class is used for basic pagination
.pagination class is uesed to add the pagination on a page.
The goal of media objects lightmarkup, easyextendability is achieved by applying classes to some of the
simple markup.
What is the purpose of .mecia class in bootstrap?
This class allows to float a media object images, video, andaudio to the left or right of a content block.
What is the purpose of .media-list class in bootstrap
If you are preparing a list where the items will be part of an unordered list, use a class. useful for
comment threads or articles lists.
What are bootstrap panels
Panel components are used when you want to put your DOM component in a box. To get a basic
panel, just add class .panel to the <div> element. Also add class .panel-default to this element.
How will you create a bootstrap panel with heading
here are two ways to add panel heading
Use .panel-heading class to easily add a heading container to your panel.
Use any <h1>-<h6> with a .panel-title class to add a pre-styled heading.
How will you create a bootstrap panel with footer
You can add footers to panels, by wrapping buttons or secondary text in a <div> containing class
.panel-footer.
What contextual classes are available to style the panels
Use contextual state classes such as, panel-primary, panel-success, panel-info, panel-warning,
panel-danger, to make a panel more meaningful to a particular context.
Can you put a table within bootstrap panel
Yes! To get a non-bordered table within a panel, use the class .table within the panel. Suppose
there is a <div> containing .panel-body, we add an extra border to the top of the table for
separation. If there is no <div> containing .panel-body, then the component moves from panel
header to table without interruption.
Can you put a listgroup within bootstrap panel
Yes! You can include list groups within any panel. Create a panel by adding class .panel to the
<div> element. Also add class .panel-default to this element. Now within this panel include your
list groups.
What is bootstrap well
A well is a container in <div> that causes the content to appear sunken or an inset effect on the
page. To create a well, simply wrap the content that you would like to appear in the well with a
<div> containing the class of .well.
What is Scrollspy plugin
The Scrollspy autoupdatingnav plugin allows you to target sections of the page based on the scroll
position. In its basic implementation, as you scroll, you can add .active classes to the navbar based
on the scroll position.
What is affix plugin
The affix plugin allows a <div> to become affixed to a location on the page. You can also toggle
it's pinning on and off using this plugin. A common example of this are social icons. They will start
in a location, but as the page hits a certain mark, the <div> will be locked in place and will stop
scrolling with the rest of the page.
What is Next ?
Further you can go through your past assignments you have done with the subject and make sure
you are able to speak confidently on them. If you are fresher then interviewer does not expect you
will answer very complex questions, rather you have to make your basics concepts very strong.
Second it really doesn't matter much if you could not answer few questions but it matters that
whatever you answered, you must have answered with confidence. So just feel confident during
your interview. We at tutorialspoint wish you best luck to have a good interviewer and all the very
best for your future endeavor. Cheers :-)
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js