A Review of Hadoop Security Issues, Threats and Solutions
A Review of Hadoop Security Issues, Threats and Solutions
A Review of Hadoop Security Issues, Threats and Solutions
Abstract— Hadoop projects treat Security as a top agenda Not only Big Data is about the size of data but also includes
item which in turn represents which is again classified as a data variety and data velocity. Together, these three
critical item. Be it financial applications that are deemed attributes form the three V’s of Big Data
sensitive, to healthcare initiatives, Hadoop is traversing new
territories which demand security-subtle environments. With
the growing acceptance of Hadoop, there is an increasing
trend to incorporate more and more enterprise security
features. In due course of time, we have seen Hadoop
gradually develop to label important issues pertaining to,
what we summarize as 3ADE (authentication, authorization,
auditing, and encryption) within a cluster. There is no dearth
of Production environments that support Hadoop Clusters.In
this paper, we aim at studying “Big Data” security at the
environmental level, along with the probing of built-in
protections and the Achilles heel of these systems, and also
embarking on a journey to assess a few issues that we are
dealing with today in procuring contemporary Big Data and
proceeds to propose security solutions and commercially
accessible techniques to address the same.
Keywords—Big Data, SASL, delegation, sniffing, cell level,
variety, unauthorized
Fig.1 Three V's Of Big-Data [17]
I. INTRODUCTION Each of the V’s represented in Figure 1 are depicted as
So, what exactly is “Big data”. Put in simple words, it is below:
described as mammoth volumes of data which might be
both structured and unstructured. Generally, it is so gigantic Volume or the size of data in present time is larger than
that is provides a challenge to process using terabytes and petabytes. That data is comes from machines,
conventional database and software techniques. As networks and human interaction on systems like social
witnessed in enterprise scenarios, three observations can be media the volume of data to be analysed is very huge. [8]
inferred; Velocity defines the speed of data processing, is required
1. The data is stupendous in terms of volumes. not only for big data, but also all processes, and involves,
2. It moves at a very fast pace. real time processing,batch processing.
3. It outpaces the prevailing capacity. Variety refers todifferent types of data from different or
The volumes of Big Data are on a roll, which can be many sources both structured and unstructured. In Past data
inferred from the fact that as far back in the year 2012, was stored from sources like spreadsheets and databases.
there were a few dozen terabytes of data in a single dataset, Now in this data comes in the form of emails, pictures,
which has interestingly been catapulted to many petabytes audio, videos, monitoring devices, PDFs, etc. This
today. multifariousness of unstructured data creates problems for
To carter to the demands of the industry, new manifestos of storage, mining and analysing the data. [8]To process the
manipulating “Big Data” are being commissioned. large volume of data from different sources, for fast
Quick fact: 5 exabytes (1 Exabyte = 1.1529*1018 bytes) of processing Hadoop is used.
data were created by humans until 2003. Today this amount Hadoop is a free, Java-based programming framework that
of information is created in two days [8, 16]. In 2012, supports the processing of large data sets in a distributed
digital world of data was expanded to 2.72 zettabytes (1021 computing environment. Hadoopallows running
bytes). It is predicted to double every two years, reaching applications on systems with thousands of nodes with
the number about 8 zettabytes of data by 2015 [8, 16]. With thousands of terabytes of data [2]. Its distributed file system
an increase in the data, there is a corresponding increase in supports fast data transfer rates among nodes and allows
the applications and framework to administer it. This gives the system to continue operating uninterrupted at times of
rise to new vulnerabilities that need being responded to. node failure.
www.ijcsit.com 2126
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131
Hadoop consists of distributed file system, data storage and II. BIG DATA HADOOP ‘S TRADITIONAL SECURITY
analytics platforms and a layer that handles parallel
computation, rate of flow (workflow) and configuration A. Hadoop Security Overview
administration [8]. HDFS runs across the nodes in a Originally Hadoop was developed without security in mind,
Hadoop cluster and together connects the file systems on no security model, no authentication of users and services
many input and output data nodes to make them into one and no data privacy, so anybody could submit arbitrary
big file system [2]. The present Hadoop ecosystem (as code to be executed. Although auditing and authorization
shown in fig 2.) consists of the Hadoop kernel, Map- controls (HDFS file permissions and ACLs) were used in
Reduce, the Hadoop distributed file system (HDFS) and a earlier distributions, such access control was easily evaded
number of related components such as Apache Hive, because any user could impersonate any other user.
HBase, Oozie, Pig and Zookeeper and these components Because impersonation was frequent and done by most
are explained as below[7,8]: users, the security controls measures that did subsist were
not very effective. Later authorization and authentication
• HDFS: A highly faults tolerant distributed file was added, but that to have some weakness in it. Because
system that is responsible for storing data on the there were very few security control measures within
clusters. Hadoop ecosystem, many fortuity and security incidents
• MapReduce: A powerful parallel programming happened in such environments. Well-meant users can
technique for distributed processing of vast make mistakes (e.g. deleting massive amounts of data
amount of dataon clusters. within seconds with a distributed delete). All users and
• HBase: A column oriented distributed NoSQL programmers had the same level of access privileges to all
database for random read/write access. the data in the cluster, any job could access any of the data
• Pig: A high level data programming language for in the cluster, and any user could read any data set [4].
analyzing data of Hadoop computation. Because MapReduce had no concept of authentication or
• Hive: A data warehousing application that authorization, an impish user could lower the priorities of
provides a SQL like access and relational model. other Hadoop jobs in order to make his job complete faster
or to be executed first – or worse, he could kill the other
• Sqoop: A project for transferring/importing data
jobs.
between relational databases and Hadoop.
Hadoop is an entire eco-system of applications that
• Oozie: An orchestration and workflow
involves Hive, HBase, Zookeeper, Oozie, and Job Tracker,
management for dependent Hadoop jobs.
and not just a single technology. Each of these applications
• requires hardening. To add security potentials or
capabilities into a big data environment, functions need to
scale with the data. Supplementary security does not scale
well, and simply cannot keep up. [6]
The Hadoop community supports some security features
through the current Kerberosimplementation, the use of
firewalls, and basic HDFS permissions and ACLs [5].
Kerberos is not a compulsory requirement for a Hadoop
cluster, making it possible to run entire clusters without
deploying or implementingany security. Kerberos is also
not very easy to install and configure on the cluster, and to
integrate with Active Directory (AD) and Lightweight
Directory Access Protocol, (LDAP) services. [6]
This makes security problematic to be implemented, and
thus limits the adoption of even the most basic security
functions forusers of Hadoop. Hadoop security is not
properly addressed by firewalls, once a firewall is breached;
the cluster is wide-open for attack. Firewalls offer no
protection for data at-rest or in-motion within the cluster.
Firewalls also offer no protection from security failure
Fig. 2 Hadoop Architecture which originates from within the firewall perimeter [6]. An
attacker who can enter the data centre either physically or
The paper is organised asfollows: In section II we describe electronically can steal the data they want, since the data is
Big Data Hadoop traditional security and also discuss un-encrypted and there is no authentication enforced for
weakness of the same, security threats, we have describe access[6, 10].
various security issues in Section III, Section IV we present
our analysis of security solution for each of the hadoop B. Security Threats
components in tabular format and section V is also an We have identified three categories of security violation:
analysis of security technologies used to secure Hadoop. unauthorized release of information,
Finally we conclude in section VI. unauthorizedmodification of information and denial of
www.ijcsit.com 2127
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131
resources. The following are the related areas of threat we they don’t implement secure communication; they
identify in Hadoop [7]: bring into use the RPC (Remote Procedure Call) over
• An unauthorized user may access an HDFS file TCP/IP.
via the RPC or via HTTP protocols and could 5) Client Interaction: Communication of client takes place
execute arbitrary code or carry out further attacks with resource manager, data nodes. However, there is a
• An unauthorized client may read/write a data catch. Even though efficient communication is
block of a file at a DataNode via the pipeline facilitated by this model, it makes cumbersome to
streaming Data-transfer protocol. shield nodes from clients and vice-versa and also name
• An unauthorized client may gain access privileges servers from nodes. Clients that have been
and may submit a job to a queueor delete or compromised tend to propagate malicious data or links
change priority of the job. to either service.
• An unauthorized user may access intermediatedata 6) Virtually no security:Big data stacks were designed with
of Map job via its task trackers HTTP little or no security in mind. Prevailing big data
shuffleprotocol. installations are built on the web services model, with
• A task in execution may use the host OS interfaces few or no facilities for preventing common web threats
to access other tasks, or would accesslocal data making it highly susceptible.
which include intermediate Map outputor the local
storage of the DataNode that runs on the same IV. HADOOP SECURITY SOLUTION
physical node. Hadoop is a distributed system which allows us to store
• An unauthorized user may eavesdrop/sniff to data huge amounts of data and processing the data in parallel.
packets being sent by Data nodes to client. Hadoop is used as a multi-tenant service and stores
sensitive data such as personally identifiable information or
• A task or node may masquerade as a Hadoop
financial data. Other organizations, including financial
service component such as a DataNode,
organizations, using Hadoop are beginning to store
NameNode, job tracker, task tracker etc.
sensitive data on Hadoop clusters. As a result, strong
• A user may submit a workflow to Oozie as
authentication and authorization is necessary. [7]
another user.
The Hadoop ecosystem consists of various components.
• DataNodes imposed no access control, a We need to secure all the other Hadoop ecosystem
unauthorized user could read arbitrary data blocks components. In this section, we will look at the each of the
from DataNodes, bypassing access control ecosystem components security and the security solution
mechanism/restrictions, or writing garbage data to for each of these components, each component has its own
DataNode.[10] security challenges, issues and needs to be configured
III. SECURITY ISSUES properly based on its architecture to secure them. Each of
these hadoop components has end users directly accessing
Hadoop present some unique set of security issues for the component or a backend service accessing the Hadoop
data centre managers and security professionals. The core components (HDFS and Map-Reduce).
security issues are depicted below [5, 6]: We have done a security analysis of hadoop components
1) Fragmented Data: Big Data clusters contain data that and a brief study of built-in security of the Hadoop
portray the quality of fluidity, allowing multiple copies ecosystem and we see that hadoop security is not very
moving to-and-fro various nodes ensuring redundancy strong, so in this paper we provide with a security solution
and resiliency.The data is available for fragmentation around the four security pillars i.e. authentication,
and can be shared across multiple servers. As a result, authorization, encryption and audits (we summarize as
more complexity is added as a result of the 3ADE), for each of the ecosystem components. This
fragmentation which poses a security issue due to the section describes the four pillars (sufficient and necessary)
absence of a security model. to help secure the Hadoop cluster, we will narrow our focus
2) Distributed Computing: Since, the availability of and take a deep dive into the built-in and our proposed
resources leads to virtual processing of data at any security solution for the Hadoop ecosystem
instant or instance where it is available, this progresses
to large levels of parallel computation. As a result, A. Authentication
complicated environments are created that are at high Authentication is verifying user or system identity
risks of attacks than their counterparts of repositories accessing the system. Hadoop provides Kerberos as a
that are centrally managed and monolithic, which primary authentication. Initially SASL/GSSAPI was used
enables easier security implications. to implement Kerberos and mutually authenticate users,
3) Controlling Data Access:Commissioned data their applications, and Hadoop services over the RPC
environments provision access at the schema level, connections [7]. Hadoop also supports “Pluggable”
devoid of finer granularity in addressing proposed Authentication for HTTP Web Consoles meaning that
users in terms of roles and access related scenarios. implementers of web applications and web consoles could
Many of the available database security schemas implement their own authentication mechanism for HTTP
provide role based access. connections. This includes but was not limited to HTTP
4) Node-to-node communication: A concern with Hadoop SPNEGO authentication. The Hadoop components support
and a variety of players available in this field is that,
www.ijcsit.com 2128
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131
SASL Framework i.e. the RPC layer can be changed to B. Authorization and ACLs
support the SASL based mutual authentication viz. SASL Authorization is a process of specifying access control
Digest-MD5 authentication or SASL GSSAPI/Kerberos privileges for user or system. In Hadoop, access controls is
authentication. implemented by using file-based permissions that follow
MapReducesupports Kerberos authentication, SASL Digest the UNIX permissions model. Access control to files in
MD-5 authentication, and also Delegation token HDFS could be enforced by the NameNode based on file
authentication on RPC connections. In HDFS permissions and ACLs of users and groups. MapReduce
communications between the NameNode and DataNodes is provides ACLs for job queues; that define which users or
over RPC connection and mutual Kerberos authenticationis groups can submit jobs to a queue and change queue
performed between them [15]. HBase supports SASL properties. Hadoop offers fine-grained authorization using
Kerberos secure client authentication via RPC, HTTP. Hive file permissions in HDFS and resource level access control
supports Kerberos and LDAP authentication for the user using ACLs for MapReduce and coarser grained access
authentication and authentication via Apache Knox control at a service level[13].HBase offers user
explained in section V. authorization on tables, column families. The user
Pig uses the user credentials to submit the job to Hadoop. authorization is implemented using coprocessors.
So there is no need of any additional Kerberos security Coprocessors are like database triggers in HBase [15].
authentication required but before starting Pig the user They intercept any request to the table before and after,
should authenticate with KDC and get a valid Kerberos now we can use the Project Rhino [V] to extend HBase
ticket [15]. Oozie provides user authentication to the Oozie support for cell level ACLs. In Hive, authorization is
web services. It also provides Kerberos HTTP Simple and implemented using Apache Sentry [V].Pig provides
Protected GSSAPI Negotiation Mechanism (SPNEGO) authorization using ACLs for job queues; Zookeeper also
authentication for web clients. SPNEGO protocol is used offers authorization using node ACLs.Hue provides access
when a client application wants to authenticate to a remote control via file system permission; it also offers ACLs for
server, but is not sure of the authentication protocols to use. job queue.
Zookeeper supports SASL Kerberos authentication on RPC Although Hadoop can be set up to perform access control
connections. Hue offers SPENGO authentication,LDAP via user and group permissions and Access Control Lists
authentication, it now also supports SAML SSO (ACLs), this may not be sufficient for every organization.
authentication[15]. Now-a-days many organizations use flexible and dynamic
There are a number of data flows involved in Hadoop access control policies based on XACML and Attribute-
authentication – Kerberos RPC authentication mechanism Based Access Control [10, 13].Hadoop can now be
is used for users authentication, applications and Hadoop configured to support RBAC, ABAC access control using
Services, HTTP SPNEGO authentication is used for web some third party (as discussed in this section and section V)
consoles, and the use of delegation tokens [10]. Delegation framework or tool some of which are discussed in section
token is a two party authentication protocol used between V. Some of the Hadoop‘s components like HDFS can offer
user and NameNode for authenticating users, it is simple ABAC using Apache Knox and also Hive can support role
and more effective than three party protocol used by based access control using Apache Sentry. Zettaset
Kerberos [7, 15].Oozie and HDFS,MapReduce supports Orchestration a product by Zettaset provides role based
delegation token. access control support and enables Kerberos to be
seamlessly integrated into hadoop ecosystem. [6, 15]
MD5-Digest, Apache
SASL Kerberos
GSSAPI Kerberos, SASL Knox, Delegation
framework , User level authenticatio Kerberos
Authentication (Kerberos), (secure client LDAP tokens,
Delegation permissions n at RPC (Pluggable)
Delegation authentication) authenticat Kerberos
tokens layer
tokens ion
Job & Queue
POSIX HBase ACLs on ACLs,
ACL Apache ACLs and FS ACLs and FS
Authorization permissions, tables, columns, Apache ACLs
(resource sentry permissions permissions
ABAC families Sentry
level)
Encryption of AES, OS Third party Third party Third party Third party Third party
--- N/A
data at rest level solution solution solution solution solution
RPC – SASL,
Encryption of RPC – SASL, SASL (secure Third party Third party
Data transfer SASL SSL/TLS HTTPS
data in transit HTTPS RPC) solution solution
protocol
No (But Third
Yes (Base Yes (Base Yes (Hive Third party Yes Third party Yes(Hue
Audit Trails party solution
audit) audit) metastore) solution (services) solution logs)
can be used)
www.ijcsit.com 2129
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131
www.ijcsit.com 2130
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131
VI. CONCLUSION
In Big Data Era, where data is accumulated from various
sources, security is a major concern (critical requirement)
as there is no fixed source of data. With the Hadoop
gaining larger acceptance within the industry, a natural
concern over the security has spread. A growing need to
accept and assimilate these security solution and
commercial security features has surfaced. In this paper we
have tried to cover all the security solutionto secure the
Hadoop ecosystem.
REFERENCES
[1] Cloud Security Alliance “Top Ten big Data Security and Privacy
Challenges”
[2] Tom White O’Reilly |Yahoo! Press “Hadoop The definitive guide”
[3] Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, and
Christopher Harrell “Hadoop Security Design”
[4] Mike Ferguson “Enterprise Information Protection - The Impact of
Big Data”
[5] Vormetric “Securing Big Data: Security Recommendations for
Hadoop and NoSQL Environments ,October 12, 2012”
[6] Zettaset “The Big Data Security Gap: Protecting the Hadoop
Cluster”
[7] Devaraj Das, Owen O’Malley, Sanjay Radia, and Kan Zhang
“Adding Security to Apache Hadoop”
[8] Seref SAGIROGLU and Duygu SINANC “Big Data: A Review
Collaboration Technologies and Systems (CTS), 2013 International
Conference ,May 2013“
[9] Horton works “Technical Preview for Apache Knox Gateway”
[10] Kevin T. Smith “Big Data Security : The Evolution of Hadoop’s
Security Model”
[11] M. Tim Jones “Hadoop Security and Sentry”
[12] Victor L. Voydock and Stephen T. Kent “Security mechanisms in
high-level network protocols. ACM Comput. Surv.1983”.
[13] Vinay Shukla s “Hadoop Security: Today and Tomorrow”
[14] MahadevSatyanarayanan “Integrating security in a large distributed
system.ACM Trans. Comput. Syst., 1989”
[15] Sudheesh Narayana, Packt Publishing “Securing Hadoop-
Implement robust end-to-end security for your Hadoop ecosystem”
[16] S. Singh and N. Singh, "Big Data Analytics", 2012 International
Conference on Communication, Information & Computing
Technology Mumbai India, IEEE, October 2011
[17] jeffhurtblog.com “three-vs-of-big-data-as-applied-conferences, July
7,2012”
www.ijcsit.com 2131