Hadoop Virtualization: Courtney Webster
Hadoop Virtualization: Courtney Webster
Hadoop Virtualization: Courtney Webster
Virtualization
Courtney Webster
Hadoop Virtualization
Courtney Webster
Hadoop Virtualization
by Courtney Webster
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For
more information, contact our corporate/institutional sales department: 800-998-9938
or corporate@oreilly.com.
Editors: Julie Steele and Jenn Webb Illustrator: Rebecca Demarest
ISBN: 978-1-491-90674-3
[LSI]
Table of Contents
iii
The Benefits of Deploying Hadoop
in a Private Cloud
Abstract
Hadoop is a popular framework used for nimble, cost-effective anal‐
ysis of unstructured data. The global Hadoop market, valued at $1.5
billion in 2012, is estimated to reach $50 billion by 2020.1 Companies
can now choose to deploy a Hadoop cluster in a physical server envi‐
ronment, a private cloud environment, or in the public cloud. We have
yet to see which deployment model will predominate during this
growth period; however, the security and granular control offered by
private clouds may lead this model to dominate for medium to large
enterprises. When compared to other deployment models, a private
cloud Hadoop cluster offers unique benefits:
1
Introduction
Today, we are capable of collecting more data (and various forms of
data) than ever before.2 It may be the most valuable intangible asset of
our time. The sheer volume ("big data") and need for flexible, low-
latency analysis can overwhelm traditional management systems like
structured relational databases. As a result, new tools have emerged to
store and mine large collections of unstructured data.
MapReduce
In 2004, Google engineers described a scalable programming model
for processing large, distributed datasets.3 This model, MapReduce,
abstracts computation away from more complicated tasks like data
distribution, failure handling, and parallelization. Developers specify
a processing ("map") function that behaves as an independent, mod‐
ular operation on blocks of local data. The resulting analyses can then
be consolidated (or “reduced”) to provide an aggregate result. This
model of local computation is particularly useful for big data, where
the transfer time required to move the data to a centralized computing
module is limiting.
Hadoop
Doug Cutting and others at Yahoo! combined the computational pow‐
er of MapReduce with a distributed filesystem prototyped by Google
in 2003.4 This evolved into Hadoop—an open source system made of
MapReduce and the Hadoop Distributed File System (HDFS). HDFS
makes several replica copies of the data blocks for resilience against
server failure and is best used on high I/O bandwidth storage devices.
In Hadoop 1.0, two master roles (the JobTracker and the Namenode)
direct MapReduce and HDFS, respectively.
Hadoop was originally built to use local data storage on a dedicated
group of commodity hardware. In a Hadoop cluster, each server is
considered a node. A “master” node stores either the JobTracker of
MapReduce or the Namenode of HDFS (although in a small cluster as
shown in Figure 1, one master node could store both). The remaining
servers ("worker" nodes) store blocks of data and run local computa‐
tion on that data.
Hadoop 2.0
In the newest version of Hadoop, the JobTracker is no longer solely
reponsible for managing the MapReduce programming framework.
The JobTracker function is distributed, among others, to a new Ha‐
doop component called the Application Master. In order to run tasks,
ApplicationMasters request resources from a central scheduler called
the ResourceManager. This architectural redesign improves scalability
and efficiency, bypassing some of the limitations in Hadoop 1.0. A new
central scheduler, the ResourceManager, acts as its key replacement.
Developers can then construct ApplicationMasters to encapsulate any
knowledge of the programming framework, such as MapReduce. In
Virtualizing Hadoop
As physically deployed Hadoop clusters grew in size, developers asked
a familiar question: can we virtualize it?
Like other enterprise (and Java-based) applications, development ef‐
forts moved to virtualization as Hadoop matured. A virtualized private
cloud uses a group of hardware on the same hypervisor (such as
vSphere [by VMware], XenServer [by Citrix], KVM [by Red Hat], or
Hyper-V [by Microsoft]). Instead of individual servers, nodes are vir‐
tual machines (VMs) designated with master or worker roles. Each
VM is allocated specific computing and storage resources from the
physical host, and as a result, one can consolidate their Hadoop cluster
onto far fewer physical servers. There is an up-front cost for virtuali‐
zation licenses and supported or enterprise-level software, but this can
be offset with the cluster’s decreased operating expenses over time.
Virtualizing Hadoop created the infrastructure required to run Ha‐
doop in the cloud, leading major players to offer web-service Hadoop.
The first, Amazon Web Services, began beta testing their Elastic Map‐
Reduce service as early as 2009. Though public cloud deployment is
not the focus of this review, it’s worth noting that it can be useful for
ad hoc or batch processing, especially if your data is already stored in
the cloud. For a stable, live cluster, a company might find that building
its own private cloud is more cost effective. Additionally, regulated
industries may prefer the security of a private hosting facility.
In 2012, VMware released Project Serengeti—an open source man‐
agement and deployment platform on vSphere for private cloud en‐
vironments. Soon thereafter, they released Big Data Extensions (BDE),
the advanced commercial version of Project Serengeti (run on vSphere
Enterprise Edition). Other offerings, like OpenStack’s Project Sahara
on KVM (formerly called Project Savanna), were also released in the
past two years.
Cluster managers
Cluster managers and management tools work on the application level
to manage containers and schedule tasks in an aggregated environ‐
ment. Many cluster managers, like Apache Mesos (backed by Meso‐
sphere) and StackIQ, are designed to support analytics (like Hadoop)
alongside other services.
Hadoop on Mesos
Apache Mesos provides a foundational layer for running a variety of
distributed applications by pooling resources. Mesos allows a cluster
to elastically provision resources to multiple applications (including
more than one Hadoop cluster). Mesosphere aims to commercialize
Mesos for Hadoop and other enterprise applications while building
add-on frameworks (like distributed schedulers) along the way. In
2013, they released Elastic Mesos to easily provision a Mesos cluster
on Amazon Web Services, allowing companies to run Hadoop 1.0 on
Mesos in bare-metal, virtualized, and now public cloud environ‐
ments.
Competitive performance
Since a hypervisor demands some amount of computational resour‐
ces, initial concerns about virtual Hadoop focused on performance.
The virtualization layer requires some CPU, memory, and other re‐
sources in order to manage its hosted VMs,7 though the impact is
dependent on the characteristics of the hypervisor used. Over the past
5 to 10 years, however, the performance of VMs have significantly
improved (especially for Java-based applications).
Many independent reports show that when using best practices, a vir‐
tual Hadoop cluster performs competitively to a physical system.8,9
Increasing the number of VMs per host can even lead to enhanced
performance (up to 13%).
Container-based clusters (like Linux VServer, OpenVZ, and LXC) can
also provide near-native performance on Hadoop benchmarking tests
like WordCount and TeraSuite.10
With such results, performance concerns are generally outweighed by
the numerous other benefits provided by a private-cloud deployment.
Rapid deployment
To deploy a cluster, Hadoop administrators must navigate a compli‐
cated setup and configuration procedure. Clusters can be composed
Scalability
Modifying a physical cluster—removing or adding physical nodes—
requires a reshuffling of the data within the entire system. Load bal‐
ancing (ensuring that all worker nodes store approximately the same
amount of data) is one of the most important tasks when scaling and
maintaining a cluster. Some hypervisors, like vSphere Enterprise Ed‐
Improved Efficiency
Flexibility
Hardware flexibility
By using commodity hardware and built-in failure protection, Ha‐
doop was designed for flexibility. Virtualization takes this a step fur‐
ther by abstracting away from hardware completely. A private cloud
can use direct attached storage (DAS), a storage attached network
(SAN), or a network attached storage (NAS). SAN/NAS storage can
be more costly but offers enhanced scalability, performance, and data
isolation. If a company has already invested in non-local storage, their
Hadoop cluster can strategically employ both direct- and network-
attached devices. The storage or VMDK files for the Namenode and
JobTracker can be placed on SAN for maximum reliability (as they are
memory- but not storage-intensive) while worker nodes store their
data on DAS.5 The temporary data generated during MapReduce can
be stored wherever I/O bandwidth is maximized.
Configurational flexibility
As previously described, organizing computation tasks to run on
blocks of local data (data locality) is the key to Hadoop’s performance.
In a physical deployment, this necessitates that worker nodes host data
and compute roles in a fixed 1:1 ratio (a "combined" model). For this
model to be mimicked in a virtual cluster, each hypervisor server
would host one or more VMs that contained data and compute pro‐
cesses (see A in Figure 4). These configurations are valid, but difficult
to scale under practical circumstances. Since each VM stores data, the
ease of adding or removing nodes (simply copying from a template or
using live migrate capabilities) would be offset by the need to rebalance
the cluster.
If instead, compute and data roles on the same hypervisor server were
in separate VMs (see B in Figure 4), compute operations could be
scaled according to demand without redistributing any data.14 Like‐
wise in an aggregated cloud, Apache Mesos only spins up TaskTracker
nodes as a job runs. When the task is complete, the TaskTrackers are
killed, and their capacity is placed back in the consolidated pool of
resources.
This "separated" model is fairly intuitive (since the TaskTracker con‐
trols MapReduce and the Namenode controls the data storage) and
the flexibility of virtualized or aggregated clusters makes it relatively
simple to configure.
In addition to creating this elastic system (where compute processes
can be easily cloned or launched to increase throughput), the separated
model also allows you to build a multi-tenant system where multiple,
isolated compute clusters can operate on the same data. The same
cloud could host a production Hadoop cluster as well as a development
and QA environment.
Conclusions
Apply the Resources and Best Practices You
Already Know
If planning your first cluster or a deployment overhaul, it is important
to consider the following: