Big Data Management Challenges: Article
Big Data Management Challenges: Article
Big Data Management Challenges: Article
net/publication/339672487
Article in International Journal of Advanced Trends in Computer Science and Engineering · February 2020
DOI: 10.30534/ijatcse/2020/102912020
CITATION READS
1 117
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Enhancing the efficiency of an e-learning system based on data mining, (reference number 13.13.1.2.02) View project
"Extending the information system development methodology with artificial intelligence methods" (reference number 13.13.1.2.01.) supported by the University of Rijeka
(Croatia) View project
All content following this page was uploaded by Sabrina Šuman on 16 April 2020.
analytical processing are used (for example to reduce costs,
ABSTRACT reduce time in business processes and respond to queries,
faster and better development of a new product, eliminate
The emergence of new data types in big data era implicates failures, errors and failures, better customer relationships,
the need to analyse and exploit them to gain valuable business risk assessment and making better quality decisions).
insight. Traditional platforms cannot fully meet the analytical Problems related to big data are most commonly related to
needs of the company if support for an unstructured data type storage, processing, and management in general [7][8]. There
is needed. This paper gives an overview and synthesis of areas are also issues related to data ownership, privacy and security
related to big data technologies, with a series of guidelines for (see for security issues and algorithms in [9]), quality (a large
adopting the appropriate software, storage structure, and amount of data from different sources should be readily
efficient deployment for big data management. A broad data available for analysis in a short time) and timeliness (a large
management context is presented through a conceptual model
amount of data, longer analysis, streaming issues analyses)
of business performance management in a modern data
[10]. In big data, there are also new storage structures (e.g.
management era.
data lakes, distributed file systems, non-relational databases,
etc.), needs for new competencies of existing experts and new
Key words: Big data, Big data management, Hadoop, Big
profiles of experts (big data analyst, big data engineer, big
data processing, Data lake
data architect and so on) [3].
1. INTRODUCTION
In order to maintain and increase the company's ability to use
The ICT field is extremely dynamic, with the frequent technology for successful decision-making, it is necessary to
emergence of new technologies. In the last decade there have build an innovative management platform with all new data
been major changes in data management with the emergence types, new methods of storage, processing and application of
of new types and sources of data summed up in the big data intelligent methods (knowledge and information retrieval,
concept. Big data are complex, layered, large amount of data, pattern recognition, machine learning, optimization methods,
and big data technologies are usually considered in etc.) [11]. As Russom's research [12] shows, employees are
comparison to standard 3V: Volume, Variety (of many aware of changes and issues in big data, and 82% of them
different data types from different sources), Velocity, and (225 of them) believe that data in their company evolves, in
additional 2V: Veracity (Data Quality) and Value (Value for terms of diversity in structure, type, sources, the way they are
Business). Big data can also be structured, but they are managed and how they are used in business (20% claim to be
primarily semi-structured and unstructured data types. It is drastic, while 62% claim to be moderate). In the same survey,
primarily the size and data diversity that creates new analytics one of the major issues highlighted the incompatibility of
approaches [1]. It imposes special ways of retrieving, large data types and structures with relational databases (68%
transforming and preparing, storing and analysing [2] [3]. of respondents). This requires a revision of the data
Utilizing big data technology has unlimited potential for management strategy and good information regarding the
improving both personal life and business competitiveness [4] potentials and disadvantages of big data technology, big data
[5]. At the same time, a large amount of data types make it tools, platforms and ways of implementation. The purpose of
difficult to find the right values from data [6], and data this paper is to provide arguments and guidelines for the
management in big data is extremely complex. The value of adoption of an appropriate big data management strategy,
big data lies in the way that information obtained through selection of software tools, storage structure and efficient
deployment, through an overview of the field of big data
technologies. The aim of the paper is to provide a synthetic
717
Sabrina Šuman et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 717 – 723
overview of the area relevant to the establishment of modern warehouse concept in big data should evolve because it does
data architecture. After the introduction, a review of concepts not solve all issues related to analytics and decision support.
important for understanding the wider area of modern data
architecture is given - big data technology from the aspect of One of the changes in business requirements is the need for
data types, storage types, analytical processing, and overall analysing unstructured and semi-structured data, performing
big data management strategy. The concept of warehousing streaming analytics, network analytics, and other phenomena
modernization, the Hadoop ecosystem and the data lake were and needs related to big data. The data are generated in huge
discussed in particular. In the results and discussion section, quantities, completely incompatible with the relational
examples of modern data architecture are given as well as the model, often with lacking ownership, so it is difficult to
synthetic representation of the processing phase of the big manage and store them in a rigid structure. Often, a lot of
data with the description of each phase with the examples of resources were spent on building a warehouse without
tools used in phases. thinking about data and analytical needs. Also, many
implementations were unsuccessful, mostly as a result of bad,
The motivation for this research stems from the current issues rigid designs. More modern and agile design and
of managing new forms of data and concerns all data implementation techniques show greater success, customer
management phases: from data sources, through purification, satisfaction, and greater return on investment. Such
analysis, visualization and storage. Also, companies are faced techniques are successful because they allow for flexible
with the problem of choosing the right solutions to manage upgrades and changes that result from changes in business
their data and reviewing possible solutions with a series of requirements.
guidelines can be beneficial.
In a response to the need to analyse new types of data in a big
2. DATA MANAGEMENT OVERVIEW IN BIG DATA data era, many innovative tools and techniques have been
ERA developed to store and process this data. The basic concept in
developing these tools and techniques was that each company
In this section, we list some traditional and modern elements accessed the data in a customized, personalized way that
of data management such as data warehouses, Hadoop meets the specifics and requirements of the company. Thus,
framework, Data lake, Spark and Map Reduce the evolution of data warehousing is needed to adapt and
coexist with other analytical solutions that include the use of
2.1. Data warehouses role in a big data era new data types and new data sources. This does not mean that
Data warehouses are created due to the need to integrate the there will be no need to store and manage structured data in
contents of different databases and other data sources over relational structures, but that companies will use different
time and access these data effectively to perform analytical forms of storage, management, and data processing. Using
processes. It is a centralized, cleaned and integrated big data and managing them is a completely separate need
organization of different source data. Today, the following and additional business potential that cannot replace the need
questions can be asked: can data warehouses meet the data for warehouses and warehousing [13].
management needs in big data? Is data warehouse needed in
big data era? Are solutions related to big data management a 2.2. Hadoop framework
replacement for data warehouses? The Hadoop framework is the technology that is becoming the
standard when it comes to storing and processing large
First of all, a big data solution is a technology that involves amounts of different data (big data). The Hadoop platform is
storage and management of big data, and a data warehouse is allows performing the tasks on multiple computers (Cluster).
architecture. Today, there are cases where some companies It is optimized for processing large amounts of data. Hadoop
may have a big data solution as well as a data warehouse, architecture consists of:
cases with no big data management solution, but with a data • Hadoop Common Package – contains Java Archive Files
warehouse, and a scenario where the company does not have a (JARs) and scripts required to run and manage all Hadoop
data warehouse but has a big data management solution. Each modules;
of the scenarios depending on the business activity and • MapReduce processing mechanism - a system based on
business model of the company can or does not have to be YARN for parallel processing of large data sets;
successful. While the company needs reliable, consolidated, • Hadoop distributed file system (HDFS) - a distributed,
relevant data for decision-making and management at all scalable and portable file system which stores large amounts
levels, it needs also a data warehouse [13]. Yet, the traditional of data (GB and TB) on multiple computers;
data warehouse provides the basis for reporting and analytics • YARN - a central platform for managing operations,
of structured data but probably does not represent the most security and data through Hadoop clusters [15], [16].
cost-effective way to store all kinds of data [14]. Data
718
Sabrina Šuman et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 717 – 723
Hadoop is a platform that offers a solution for many computing platform and a big data processing framework.
limitations of past technologies, such as storage constraints Apache Spark is a framework for fast processing of a large
and large volume data processing capabilities [17]. It supports amount of data that can be used generally for all types of data
multiple data formats – i.e. structured, semi-structured and processing (batch processing, interactive analysis, stream
unstructured data. It is the open source software, with low processing, machine learning and graph computing) [21].
implementation costs, and low learning curve. Using NoSQL Provides a fast-paced computing environment for general
solutions, such as Apache Cassandra, helps eliminate purpose and sophisticated analytical capabilities that enable
performance issues, costs, and the availability of big data the development of analytic applications written in Java,
applications. To manage such data, it is convenient to use Scala, Python or R [21]. Spark can run independently, on the
Apache Hive, data storage software built on Hadoop. It allows Hadoop cluster or in the Mesos environment. Spark can
reading, writing and managing large databases stored in connect at run time on HDFS, Amazon S3, Cassandra, and
distributed storage with SQL support [3]. Apache HBase.
The Hadoop ecosystem provides a variety of open source and Apache Spark stands out at its speed due to the ability to
commercial technologies for processing fast, interactive BI perform in-memory processing. Spark is more effective for
queries, data retrieval and research, and sophisticated disk processing applications. If linear data processing of large
analytical processes such as testing predictive models [18]. data sets is used, the advantage should be given to Hadoop
Today, technologies enable users to run queries and analysis MapReduce while Spark provides fast performance, real-time
within the Hadoop cluster and/or clouds without having to analytics, graphing, machine learning and other. In many
move data to a data warehouse, data marts, or stand-alone BI cases, Spark can surpass Hadoop MapReduce. Also, Spark is
server [18]. The Hadoop cluster consists of many parallel fully compatible with the Hadoop Ecosystem [22]. It is
machines where large data sets are stored and processed. suitable to be used by a data scientist or a statistician without
Client computers send tasks to this cloud of computers and get or limited knowledge of cluster computing. It is usually
results. Storage and data processing take place within this enabled by interactive shells similar to those such as
"cloud" machine. Different users can send computerized tasks MATLAB or R. It is particularly suitable for interactive data
to Hadoop from individual clients (their own machines at mining of large data sets in clusters [23].
remote locations from the Hadoop cluster). The linear
scalability offered by Hadoop clusters with flexible cloud
2.5. Data lake
scalability storage can enable organizations to be agile and
It is the central repository for all organization data (without
flexible in expanding computer power in response to
predefined data schema). The goal is to collect all the data
immediate BI analytic needs. Companies can also improve
security management - use security procedures that are set on before they are potentially lost. It provides data consolidation
and a highly customizable analytical approach. The lake
sources, as the preparation and processing takes place where
the data is located [18]. consists of a distributed, scalable file system (HDFS or
Amazon S3), and one or more dedicated processing and query
tools such as Apache Spark, Drill, Impala or Presto [24].
2.3. MapReduce Upgrading the existing data warehouse with a data lake
MapReduce is the script frame for applications handling huge (creating hybrid architecture) is a positive change for
amounts of both structured and unstructured data stored in the businesses. Advantages of increasing data warehouse include:
Hadoop Distributed File System (HDFS). Each MapReduce Great savings in storage costs - scaling architectures (e.g.
works in two phases: the map phase which maps the input Hadoop, AWS S3) can store non-processed data in any
data to key/value sets and reduce phase which takes key/value format at a much lower cost than data warehouses.
pairs and provides a desired output based on applying its own Significantly accelerating processing - flexible data lake
algorithms [19]. MapReduce algorithm can be written in architecture enables faster data loading and parallel
many languages (Java, C ++, or Python) and its tasks are easy processing, resulting in faster instant analytical insight.
to run. MapReduce can handle petabytes of data from HDFS Maximizing efficiency - spending less time on low-value
on one cluster at maximum processing speeds. MapReduce business activities (ETL, for example) and making better
takes care of the failures and allows retrieving the redundant use of resources for strategic goals of high business value.
copy of the processing data. It moves the processing to data in Getting more valuable business insights, faster and of a
HDFS, and not vice versa. Processing tasks may appear on a larger amount of data (lower storage costs allow you to
physical node where data is found, significantly contributing store more data, leading to more accurate trends, better
to Hadoop's processing speed [20]. forecasts, etc.) [25].
2.4. Spark Comparison of the classical data warehouse and data lake,
based on research data from [25][26] is given in Table 1.
Apache Spark is a distributed or clustered open source
719
Sabrina Šuman et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 717 – 723
Russom’s research [12] shows that out of 225 respondents, have very unpredictable burden and usage patterns. In this
90% of them know the concept of data lakes, 24% think that environment data scientists should have the freedom of
Hadoop data lake has a really high impact on the success of experiment with new data types from different sources, new
data management strategies in their company, 32% consider methods of data transformations, and new analytical models
the impact is moderate, while 44% of respondents do not even to get new valuable insights from data and build predictable
consider this question at all. The users consider the following business models. It's too manageable and allows data
items (in terms of consequences) as the ones that would scientists to use any tools that prefer research, analysis, and
benefit the most of data lake implementation (Table 2). analytical modelling [27].
Table 2 data lake's implementation positive impact [12] Data Lab - Big data and advanced analytics require different
Advanced Analytics (data mining, Machine 49% technologies and approaches. The analysis may require data
Learning, Complex SQL) that are not available in the warehouse. Models can be
Data Exploration and Knowledge Discovery 49% CPU-intensive and create problems for other applications at
Sources of big data for analytics 45% the same time. There may be conflicts between a warehouse
Data warehouse widening 39% administrator who wants a carefully controlled environment
Data retention for storage 36% and analysts, especially data scientists who want maximum
Reducing Cost of Data Storage 34% flexibility. That is why data labs can be created where users
A possibility of using the data also for 24% can make changes, unlike in original storage tables. The Data
nontechnical user types Warehouse Administrator creates certain lab owners for
To accept unstructured data 21% specific areas, such as marketing and sales. Each lab owner
identifies people who have access to a data lab. The Data
As barriers to the implementation of Hadoop data lakes, users Store Administrator in collaboration with the users
have often reported a series of reasons related to data determines how much the workspace is allocated to each lab
management, security and lack of knowledge and skills and sets the expiration date [27].
related to Hadoop and big data technology.
3. RESULTS AND DISCUSSION
2.6. Other elements of a modern data architecture Based on changes and challenges reported in previous section
Analytics / Sandbox Environments - This environment in area related to modern data management, a visualised and
almost completely contradicts the easy-to-manage BI / DW summarized overview and a more detailed insight into the
environment of a predictable workload that supports classical modern data architecture is given in Figure 1. It starts with
managerial reporting of a type "what's happened" with the general data management cycle (with some modern
business questions. It represents a research environment that architecture elements added). Then the possible benefits from
720
Sabrina Šuman et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 717 – 723
introducing modern data platforms’ elements, a number of "preparing" and storing data (cleaning, consolidating,
possible strategies and implementation solutions are structuring, aggregating, storing) to serve the company's
discussed. At the end of this chapter, detailed big data analytical needs as efficiently as possible, and an analytic
processing phases are described in detail, along with subsystem where data is "exploited" for the implementation of
descriptions of basic processes, and a description of software previously defined goals. In the last subsystem, there are a
support, needed for their realization. number of activities in which a whole spectrum of different
tools is used to extract the potential of different types of data.
3.1. Data management context in big data era
The need for continuity of running such performance
In order to show a wider context related to data management management cycles is stressed, so that the company
in big data era, a synthetic graphical overview of the moderates goals, KPIs, and selects the tools that deliver the
performance management cycle of an enterprise is given. A best results. Storage methods and analytical tools and
conceptual model, i.e. a framework (Figure 1), has been
methods used are divided into traditional BIs (storage in data
developed that specifies the components, that is, the areas that
warehouses and analytical processes mostly over structured
participate in the efficient business of the company. data) and on big data typical storage and analytical processes
Model is separated into three interconnected areas or (data lakes and sandboxes). For each analytical category there
subsystems. A subsystem representing internal or external is a full range of techniques and tools [28] [29].
data and/or systems generating such data. A subsystem for
722
Sabrina Šuman et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 717 – 723