Clouds and Big Data Computing
Clouds and Big Data Computing
Clouds and Big Data Computing
highlights
Survey of solutions for carrying out analytics and Big Data on Clouds.
Identification of gaps in technology for Cloud-based analytics.
Recommendations of research directions for Cloud-based analytics and Big Data.
1. Introduction Despite the popularity on analytics and Big Data, putting them
into practice is still a complex and time consuming endeavour. As
Society is becoming increasingly more instrumented and as a Yu [136] points out, Big Data offers substantial value to organisa-
result, organisations are producing and storing vast amounts of tions willing to adopt it, but at the same time poses a consider-
data. Managing and gaining insights from the produced data is a able number of challenges for the realisation of such added value.
challenge and key to competitive advantage. Analytics solutions An organisation willing to use analytics technology frequently ac-
that mine structured and unstructured data are important as they quires expensive software licences; employs large computing in-
can help organisations gain insights not only from their privately frastructure; and pays for consulting hours of analysts who work
acquired data, but also from large amounts of data publicly avail- with the organisation to better understand its business, organise
able on the Web [118]. The ability to cross-relate private informa- its data, and integrate it for analytics [120]. This joint effort of or-
tion on consumer preferences and products with information from ganisation and analysts often aims to help the organisation un-
tweets, blogs, product evaluations, and data from social networks derstand its customers needs, behaviours, and future demands for
opens a wide range of possibilities for organisations to understand new products or marketing strategies. Such effort, however, is gen-
the needs of their customers, predict their wants and demands, erally costly and often lacks flexibility. Nevertheless, research and
and optimise the use of resources. This paradigm is being popu- application of Big Data are being extensively explored by govern-
larly termed as Big Data. ments, as evidenced by initiatives from USA [20] and UK [106]; by
academics, such as the bigdata@csail initiative from MIT [19]; and
by companies such as Intel [122].
Corresponding authors. Cloud computing has been revolutionising the IT industry by
E-mail addresses: assuncao@acm.org (M.D. Assuno), rbuyya@unimelb.edu.au adding flexibility to the way IT is consumed, enabling organisations
(R. Buyya). to pay only for the resources and services they use. In an effort to
http://dx.doi.org/10.1016/j.jpdc.2014.08.003
0743-7315/ 2014 Elsevier Inc. All rights reserved.
4 M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315
reduce IT capital and operational expenditures, organisations of all Analytics solutions can be classified as descriptive, predictive,
sizes are using Clouds to provide the resources required to run their or prescriptive as illustrated in Fig. 2. Descriptive analytics
applications. Clouds vary significantly in their specific technologies uses historical data to identify patterns and create management
and implementation, but often provide infrastructure, platform, reports; it is concerned with modelling past behaviour. Predictive
and software resources as services [25,13]. analytics attempts to predict the future by analysing current and
The most often claimed benefits of Clouds include offering re- historical data. Prescriptive solutions assist analysts in decisions by
sources in a pay-as-you-go fashion, improved availability and elas- determining actions and assessing their impact regarding business
ticity, and cost reduction. Clouds can prevent organisations from objectives, requirements, and constraints.
spending money for maintaining peak-provisioned IT infrastruc- Despite the hype about it, using analytics is still a labour inten-
ture that they are unlikely to use most of the time. Whilst at first sive endeavour. This is because current solutions for analytics are
glance the value proposition of Clouds as a platform to carry out often based on proprietary appliances or software systems built for
analytics is strong, there are many challenges that need to be over- general purposes. Thus, significant effort is needed to tailor such
come to make Clouds an ideal platform for scalable analytics. solutions to the specific needs of the organisation, which includes
In this article we survey approaches, environments, and tech- integrating different data sources and deploying the software on
nologies on areas that are key to Big Data analytics capabilities and the companys hardware (or, in the case of appliances, integrat-
discuss how they help building analytics solutions for Clouds. We ing the appliance hardware with the rest of the companys sys-
focus on the most important technical issues on enabling Cloud tems) [120]. Such solutions are usually developed and hosted on
analytics, but also highlight some of the non-technical challenges the customers premises, are generally complex, and their opera-
faced by organisations that want to provide analytics as a service tions can take hours to execute. Cloud computing provides an in-
in the Cloud. In addition, we describe a set of gaps and recommen- teresting model for analytics, where solutions can be hosted on
dations for the research community on future directions on Cloud- the Cloud and consumed by customers in a pay-as-you-go fashion.
supported Big Data computing. For this delivery model to become reality, however, several tech-
nical issues must be addressed, such as data management, tuning
2. Background and methodology of models, privacy, data quality, and data currency.
This work highlights technical issues and surveys existing work
Organisations are increasingly generating large volumes of data on solutions to provide analytics capabilities for Big Data on the
as result of instrumented business processes, monitoring of user Cloud. Considering the traditional analytics workflow presented in
activity [14,127], web site tracking, sensors, finance, accounting, Fig. 1, we focus on key issues in the phases of an analytics solution.
among other reasons. With the advent of social network Web sites, With Big Data it is evident that many of the challenges of Cloud
users create records of their lives by daily posting details of ac- analytics concern data management, integration, and processing.
tivities they perform, events they attend, places they visit, pic- Previous work has focused on issues such as data formats, data
tures they take, and things they enjoy and want. This data deluge representation, storage, access, privacy, and data quality. Section 3
is often referred to as Big Data [99,55,17]; a term that conveys the presents existing work addressing these challenges on Cloud envi-
challenges it poses on existing infrastructure with respect to stor- ronments. In Section 4, we elaborate on existing models to provide
age, management, interoperability, governance, and analysis of the and evaluate data models on the Cloud. Section 5 describes solu-
data. tions for data visualisation and customer interaction with analyt-
In todays competitive market, being able to explore data to un- ics solutions provided by a Cloud. We also highlight some of the
derstand customer behaviour, segment customer base, offer cus- business challenges posed by this delivery model when we discuss
tomised services, and gain insights from data provided by multiple service structures, service level agreements, and business models.
sources is key to competitive advantage. Although decision makers Security is certainly a key challenge for hosting analytics solutions
would like to base their decisions and actions on insights gained on public Clouds. We consider, however, that security is an exten-
from this data [43], making sense of data, extracting non obvious sive topic and would hence deserve a study of its own. Therefore,
patterns, and using these patterns to predict future behaviour are security and evaluation of data correctness [130] are out of scope
not new topics. Knowledge Discovery in Data (KDD) [50] aims to of this survey.
extract non obvious information using careful and detailed anal-
ysis and interpretation. Data mining [133,84], more specifically,
3. Data management
aims to discover previously unknown interrelations among appar-
ently unrelated attributes of data sets by applying methods from
One of the most time-consuming and labour-intensive tasks of
several areas including machine learning, database systems, and
analytics is preparation of data for analysis; a problem often exac-
statistics. Analytics comprises techniques of KDD, data mining, text
erbated by Big Data as it stretches existing infrastructure to its lim-
mining, statistical and quantitative analysis, explanatory and pre-
its. Performing analytics on large volumes of data requires efficient
dictive models, and advanced and interactive visualisation to drive
methods to store, filter, transform, and retrieve the data. Some of
decisions and actions [43,42,63].
the challenges of deploying data management solutions on Cloud
Fig. 1 depicts the common phases of a traditional analyt-
environments have been known for some time [1,113,82], and so-
ics workflow for Big Data. Data from various sources, including
lutions to perform analytics on the Cloud face similar challenges.
databases, streams, marts, and data warehouses, are used to build
Cloud analytics solutions need to consider the multiple Cloud de-
models. The large volume and different types of the data can de-
ployment models adopted by enterprises, where Clouds can be for
mand pre-processing tasks for integrating the data, cleaning it, and
instance:
filtering it. The prepared data is used to train a model and to esti-
mate its parameters. Once the model is estimated, it should be vali- Private: deployed on a private network, managed by the organ-
dated before its consumption. Normally this phase requires the use isation itself or by a third party. A private Cloud is suitable for
of the original input data and specific methods to validate the cre- businesses that require the highest level of control of security
ated model. Finally, the model is consumed and applied to data as and data privacy. In such conditions, this type of Cloud infras-
it arrives. This phase, called model scoring, is used to generate pre- tructure can be used to share the services and data more effi-
dictions, prescriptions, and recommendations. The results are in- ciently across the different departments of a large enterprise.
terpreted and evaluated, used to generate new models or calibrate Public: deployed off-site over the Internet and available to the
existing ones, or are integrated to pre-processed data. general public. Public Cloud offers high efficiency and shared
M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315 5
resources with low cost. The analytics services and data man-
agement are handled by the provider and the quality of service
(e.g. privacy, security, and availability) is specified in a contract. Fig. 3. Some Vs of Big Data.
Organisations can leverage these Clouds to carry out analytics
with a reduced cost or share insights of public analytics results. a company can derive from employing Big Data computing. Al-
Hybrid: combines both Clouds where additional resources from though the choice of Vs used to explain Big Data is often arbitrary
a public Cloud can be provided as needed to a private Cloud. and varies across reports and articles on the Web e.g. as of writ-
Customers can develop and deploy analytics applications using ing Viability is becoming a new V variety, velocity, and volume
a private environment, thus reaping benefits from elasticity and [112,140] are the items most commonly mentioned.
higher degree of security than using only a public Cloud. Regarding Variety, it can be observed that over the years, sub-
Considering the Cloud deployments, the following scenarios are stantial amount of data has been made publicly available for
generally envisioned regarding the availability of data and analyt- scientific and business uses. Examples include repositories with
ics models [87]: (i) data and models are private; (ii) data is public, government statistics1 ; historical weather information and fore-
models are private; (iii) data and models are public; and (iv) data casts; DNA sequencing; information on traffic conditions in large
is private, models are public. Jensen et al. [79] advocate on deploy- metropolitan areas; product reviews and comments; demograph-
ment models for Cloud analytics solutions that vary from solutions ics [105]; comments, pictures, and videos posted on social network
using privately hosted software and infrastructure, to private ana- Web sites; information gathered using citizen-science plat-
lytics hosted on a third party infrastructure, to public model where forms [22]; and data collected by a multitude of sensors measuring
the solutions are hosted on a public Cloud. various environmental conditions such as temperature, air humid-
Different from traditional Cloud services, analytics deals with ity, air quality, and precipitation.
high-level capabilities that often demand very specialised re- An example illustrating the need for such a variety within a
sources such as data and domain experts analysis skills. For this single analytics application is the Eco-Intelligence [139] platform.
reason, we advocate that under certain business models es- Eco-Intelligence was designed to analyse large amounts of data
pecially those where data and models reside on the providers to support city planning and promote more sustainable develop-
premises not only ordinary Cloud services, but also the skills of ment. The platform aims to efficiently discover and process data
data experts need to be managed. To achieve economies of scale from several sources, including sensors, news, Web sites, television
and elasticity, Cloud-enabled Big Data analytics needs to explore and radio, and exploit information to help urban stakeholders cope
means to allocate and utilise these specialised resources in a proper with the highly dynamics of urban development. In a related sce-
manner. The rest of this section discusses existing solutions on nario, the Mobile Data Challenge (MDC) was created aimed at gen-
data management irrespective of where data experts are physi- erating innovations on smartphone-based research, and to enable
cally located, focusing on storage and retrieval of data for analytics; community evaluation of mobile data analysis methodologies [90].
data diversity, velocity and integration; and resource scheduling Data from around 200 users of mobile phones was collected over
for data processing tasks. a year as part of the Lausanne Data Collection Campaign. Another
related area benefiting from analytics is Massively Multiplayer On-
line Game (MMOGs). CAMEO [78] is an architecture for continu-
3.1. Data variety and velocity
ous analytics for MMOGs that uses Cloud resources for analysis
of tasks. The architecture provides mechanisms for data collection
Big Data is characterised by what is often referred to as a multi-
and continuous analytics on several factors such as understanding
V model, as depicted in Fig. 3. Variety represents the data types,
the needs of the game community.
velocity refers to the rate at which the data is produced and pro-
cessed, and volume defines the amount of data. Veracity refers
to how much the data can be trusted given the reliability of its
source [136], whereas value corresponds the monetary worth that 1 http://www.data.gov.
6 M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315
Tantisiriroj et al. [124] compared the Parallel Virtual File System preparation, storage, and processing. Authors of the report envi-
(PVFS) [28] against HDFS, where they observed that PVFS did not sion some data processing and Big Data analytics capabilities being
present significant improvement in completion time and through- migrated to the EDW, hence freeing organisations from unneces-
put compared to HDFS. sary data transfer and replication and the use of disparate data-
Although a large part of the data produced nowadays is un- processing and analysis solutions. Moreover, as discussed earlier,
structured, relational databases have been the choice most or- they advocated that analytics solutions will increasingly expose
ganisations have made to store data about their customers, sales, data processing and analysis features via MapReduce and SQLMR-
and products, among other things. As data managed by traditional like interfaces. SAP HANA One [115], as an example, is an in-
DBMS ages, it is moved to data warehouses for analysis and for spo- memory platform hosted by Amazon Web Services that provides
radic retrieval. Models such as MapReduce are generally not the real-time analytics for SAP applications. HANA One also offers a
most appropriate to analyse such relational data. Attempts have SAP data integrator to load data from HDFS and Hive-accessible
been made to provide hybrid solutions that incorporate MapRe- databases.
duce to perform some of the queries and data processing required EDWs or Cloud based data warehouses, however, create certain
by DBMSs [1]. Cohen et al. [39] provide a parallel database design issues with respect to data integration and the addition of new
for analytics that supports SQL and MapReduce scripting on top of a data sources. Standard formats and interfaces can be essential to
DBMS to integrate multiple data sources. A few providers of analyt- achieve economies of scale and meet the needs of a large num-
ics and data mining solutions, by exploring models such as MapRe- ber of customers [52]. Some solutions attempt to address some
duce, are migrating some of the processing tasks closer to where of these issues [105,21]. Birst [21] provides composite spaces and
the data is stored, thus trying to minimise surpluses of siloed data space inheritance, where a composite space integrates data from
preparation, storage, and processing [85]. Data processing and ana- one or more parent spaces with additional data added to the com-
lytics capabilities are moving towards Enterprise Data Warehouses posite space. Birst provides a Software as a Service (SaaS) solution
(EDWs), or are being deployed in data hubs [79] to facilitate reuse that offers analytics functionalities on a subscription model; and
across various data sets. appliances with the business analytics infrastructure, hence pro-
With respect to EDW, some Cloud providers offer solutions that viding a model that allows a customer to migrate gradually from
promise to scale to one petabyte of data or more. Amazon Red- an on-premise analytics to a scenario with Cloud-provided analyt-
shift [2], for instance, offers columnar storage and data compres- ics infrastructure. To improve the market penetration of analytics
sion and aims to deliver high query performance by exploring a solutions in emerging markets such as India, Deepak et al. [48] pro-
series of features, including a massively parallel processing archi- pose a multi-flow solution for analytics that can be deployed on the
tecture using high performance hardware, mesh networks, locally Cloud. The multi-flow approach provides a range of possible ana-
attached storage, and zone maps to reduce the I/O required by lytics operators and flows to compose analytics solutions; viewed
queries. Amazon Data Pipeline [3] allows a customer to move data as workflows or instantiations of a multi-flow solution. IVOCA [18]
across different Amazon Web Services, such as Elastic MapReduce is a tool aimed at Customer Relationship Management (CRM) that
(EMR) [4] and DynamoDB [46], and hence compose the required ingests both structured and unstructured data and provides data
analytics capabilities. linking, classification, and text mining tools to facilitate analysts
Another distinctive trend in Cloud computing is the increasing tasks and reduce the time to insight.
use of NoSQL databases as the preferred method for storing and re- Habich et al. [67] propose Web services that co-ordinate data
trieving information. NoSQL adopts a non-relational model for data Clouds for exchanging massive data sets. The Business Process Ex-
storage. Leavitt argues that non-relational models have been avail- ecution Language (BPEL) data transition approach is used for data
able for more than 50 years in forms such as object-oriented, hier- exchange by passing references to data between services to re-
archical, and graph databases, but recently this paradigm started duce the execution time and guarantee the correct data processing
to attract more attention with models such as key-store, column- of an analytics process. A generic data Cloud layer is introduced
oriented, and document-based stores [92]. The causes for such to handle heterogeneous data Clouds, and is responsible for map-
raise in interest, according to Levitt, are better performance, ca- ping generic operations to each Cloud implementation. DataDirect
pacity of handling unstructured data, and suitability for distributed Cloud [41] also provides generic interfaces by offering JDBC/ODBC
environments [92]. drivers for applications to execute SQL queries against different
Han et al. [68] presented a survey of NoSQL databases with em- databases stored on a Cloud. Users are not required to deal with
phasis on their advantages and limitations for Cloud computing. different APIs and query languages specific to each Cloud storage
The survey classifies NoSQL systems according to their capacity in solution.
addressing different pairs of CAP (consistency, availability, parti- PivotLinks AnalyticsCLOUD [105] handles both structured and
tioning). The survey also explores the data model that the studied unstructured data, providing data integration features. PivotLink
NoSQL systems support. also provides DataCLOUD with information about over 350 demo-
Hecht and Jablonski [69] compared different NoSQL systems in graphic, hobbies, and interest data fields for 120 million US house-
regard to supported data models, types of query supported, and holds. This information can be used by customers to perform brand
support for concurrency, consistency, replication, and partitioning. sentiment analysis [51] and verify how weather affects their prod-
Hecht and Jablonski concluded that there are big differences uct performance.
among the features of different technologies, and there is no single
system that would be the most suitable for every need. Therefore, it 3.4. Data processing and resource management
is important for adopters to understand the requirements of their
applications and the capabilities of different systems so that the MapReduce [45] is one of the most popular programming
system whose features better match their needs is selected [69]. models to process large amounts of data on clusters of com-
puters. Hadoop [10] is the most used open source MapReduce
3.3. Data integration solutions implementation, also made available by several Cloud providers
[4,16,77,132]. Amazon EMR [4] enables customers to instantiate
Forrester Research published a technical report that discusses Hadoop clusters to process large amounts of data using the Ama-
some of the problems that traditional Business Intelligence (BI) zon Elastic Compute Cloud (EC2) and other Amazon Web Services
faces [85], highlighting that there is often a surplus of siloed data for data storage and transfer.
8 M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315
Hadoop uses the HDFS file system to partition and replicate data The eXtreme Analytics Platform (XAP) [15] enables analytics
sets across multiple nodes, such that when running a MapReduce supporting multiple data sources, data types (structured and un-
application, a mapper is likely to access data that is locally stored structured), and multiple types of analyses. The target infrastruc-
on the cluster node where it is executing. Although Hadoop pro- ture of the architecture is a cluster running a distributed file
vides a set of APIs that allows developers to implement MapReduce system. A modified version of Hadoop, deployed in the cluster, con-
applications, very often a Hadoop workflow is composed of jobs tains an application scheduler (FLEX) able to better utilise the avail-
that use high-level query languages such as Hive and Pig Latin, cre- able resources than the default Hadoop scheduler. The analytics
ated to facilitate search and specification of processing tasks. Lee jobs are created via a high-level language script, called Jaql, that
et al. [94] present a survey about the features, benefits, and limita- converts the high-level descriptive input into an analytics MapRe-
tions of MapReduce for parallel data analytics. They also discuss ex- duce workflow that is executed in the target infrastructure.
tensions proposed for this programming model to overcome some Previous work has also considered other models for performing
of its limitations. analytics, such as scientific workflows and Online Analytical Pro-
Hadoop provides data parallelism and its data and task repli- cessing (OLAP). Rahman et al. [108] propose a hybrid heuristic for
cation schemes enable fault tolerance, but what is often criticised scheduling data analytics workflows on heterogeneous Cloud en-
about it is the time required to load data into HDFS and the lack of vironments; a heuristic that optimises cost of workflow execution
reuse of data produced by mappers. MapReduce is a model created and satisfies users requirements, such as budget, deadline, and data
to exploit commodity hardware, but when executed on reliable placement.
infrastructure, the mechanisms it provides to deal with failures In the field of simulation-enabled analytics, Li et al. [97]
may not be entirely essential. Some of the provided features can developed an analytical application, modelled as a Direct Acyclic
Graph (DAG), for predicting the spread of dengue fever outbreaks
be disabled in certain scenarios. Herodotou and Babu [71] present
in Singapore. The analytics workflow receives data from multiple
techniques for profiling MapReduce applications, identifying bot-
sources, including current and past data about climate and weather
tlenecks and simulating what-if scenarios. Previous work has
from meteorological agencies and historical information about
also proposed optimisations to handle these shortcomings [66].
dengue outbreaks in the country. This data, with user-supplied
Cuzzocrea et al. [40] discuss issues concerning analytics over big
input about the origin of the infection, is used to generate a map
multidimensional data and the difficulties in building multidimen-
of the spread of the disease in the country in a day-by-day basis. A
sional structures in HDFS and integrating multiple data sources to
hybrid Cloud is used to speed up the application execution. Other
Hadoop.
characteristics of the application are security features and cost-
Starfish [72], a data analytics system built atop Hadoop, focuses
effective exploration of Cloud resources: the system keeps the
on improving the performance of clusters throughout the data utilisation of public Cloud resources to a minimum to enable the
lifecycle in analytics, without requiring users to understand the analytics to complete in the specified time and budget. A public
available configuration options. Starfish employs techniques at Cloud has also been used in a similar scenario to simulate the
several levels to optimise the execution of MapReduce jobs. It uses impact of public transport disruptions on urban mobility [81].
dynamic instrumentation to profile jobs and optimises workflows Chohan et al. [36] evaluated the support of OLAP for Google
by minimising the impact of data unbalance and by balancing the App Engine (GAE) [58] highlighting limitations and assessing their
load of executions. Starfishs Elastisizer automates provisioning impact on cost and performance of applications. A hybrid approach
decisions using a mix of simulation and model-based estimation to perform OLAP using GAE and AppScale [24] was provided, using
to address what-if questions on workload performance. two methods for data synchronisation, namely bulk data transfer
Lee et al. [93] present an approach that allocates resources and and incremental data transfer. Moreover, Jung et al. [80] propose
schedules jobs considering data analytics workloads, in order to optimisations for scheduling and processing of Big Data analysis
enable consolidation of a cluster workload, reducing the number on federated Clouds.
of machines allocated for processing the workload during periods Chang et al. [29] examined different data analytics workloads,
of small load. The approach uses Hadoop and works with two pools where results show significant diversity of resource usage (CPU,
of machines core and accelerator and dynamically adjusts the I/O and, network). They recommend the use of transformation
size of each pool according to the observed load. mechanisms such as indexing, compression, and approximation to
Daytona [16], a MapReduce runtime for Windows Azure, lever- provide a balanced system and improve efficiency of data analysis.
ages the scalable storage services provided by Azures Cloud infras- The Cloud can also be used to extend the capabilities of analyses
tructure as the source and destination of data. It uses Cloud features initially started on the customers premises. CloudComet, for
to provide load balancing and fault tolerance. The system relies on example, is an autonomic computing engine that supports Cloud
a masterslave architecture where the master is responsible for bursts that has been used to provide the programming and runtime
scheduling tasks and the slaves for carrying out map and reduce infrastructure to scale out/in certain on-line risk analyses [83].
operations. Section 5 discusses the visualisation features that Day- CloudComet and commercial technologies such as Aneka [27] can
tona provides. utilise both private resources and resources from a public Cloud
Previous work shows that there is an emerging class of MapRe- provider to handle peaks in the demands of online risk analytics.
duce applications that feature small, short, and highly interactive Some analytics applications including stock quotes and weather
jobs [31,54]. As highlighted in Section 5, the visualisation commu- prediction have stringent time constraints, usually falling in the
nity often criticises the lack of interactivity of MapReduce-based near-time and stream categories described earlier. Request pro-
analytics solutions. Over the past few years, however, several at- cessing time is important to deliver results in a timely fash-
tempts have been made to tackle this issue. Borthakur et al. [23], for ion. Chen et al. [33] investigate Continuous analytics as a Service
instance, describe optimisations implemented in HDFS and HBase8 (CaaaS) that blends stream processing and relational data tech-
to make them more responsive to the realtime requirements of niques to extend the DBMS model and enable real-time continu-
Facebook applications. Chen et al. [30] propose energy efficiency ous analytics service provisioning. The dynamic stream processing
improvements to Hadoop by maintaining two distinct pools of re- and static data management for data intensive analytics are uni-
sources, namely interactive and batch jobs. fied by providing an SQL-like interface to access both static and
stream data. The proposed cycle-based query model and transac-
tion model allow SQL queries to run and to commit per cycle whilst
analysing stream data per chunk. The analysis results are made
8 http://hbase.apache.org. visible to clients whilst a continued query for results generation
M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315 9
Table 1
Summary of works on model building and scoring.
Work Goal Service model Deployment model
is still running. Existing work on stream and near-time process- file systems; engines able to combine applications from
ing attempt to leverage strategies to predict user or service be- multiple programming models (e.g. MapReduce, work-
haviour [137]. In this way, an analytics service can pre-fetch data flows, and bag-of-tasks) on a single solution/abstraction.
to anticipate a users behaviour, hence selecting the appropriate How to optimise resource usage and energy consumption
applications and methods before the users request arrives. when executing the analytics application?
Realtime analysis of Big Data is a hot topic, with Cloud providers
increasingly offering solutions that can be used as building blocks 4. Model building and scoring
of stream and complex event processing systems. AWS Kinesis [5]
is an elastic system for real-time processing of streaming data that
The data storage and Data as a Service (DaaS) capabilities
can handle multiple sources, be used to build dashboards, han-
provided by Clouds are important, but for analytics, it is equally
dle events, and generate alerts. It allows for integration with other
relevant to use the data to build models that can be utilised for
AWS services. In addition, stream processing frameworks includ-
forecasts and prescriptions. Moreover, as models are built based
ing Apache S4 [9], Storm [121] and IBM InfoSphere Streams [75]
on the available data, they need to be tested against new data in
can be deployed on existing Cloud offerings. Software systems such
order to evaluate their ability to forecast future behaviour. Exist-
as storm-deploy, a Clojure project based on Pallet,9 aim to ease de-
ing work has discussed means to offload such activities termed
ployment of Storm topologies on Cloud offerings including AWS
here as model building and scoring to Cloud providers and ways
EC2. Suro, a data pipeline system used by Netflix to collect events
to parallelise certain machine learning algorithms [126,11,74]. This
generated by its applications, has recently been made available
section describes work on the topic. Table 1 summarises the anal-
to the broader community as an open source project [8]. Aiming
ysed work, its goals, and target infrastructures.
to address similar requirements, Apache Kafka [62] is a real-time
Guazzelli et al. [64] use Amazon EC2 as a hosting platform for
publishsubscribe infrastructure initially used at LinkedIn to pro-
the Zementis ADAPA model [138] scoring engine. Predictive mod-
cess activity data and later released as an open source project.
Incubated by the Apache Software Foundation, Samza [12] is a els, expressed in Predictive Model Markup Language (PMML) [65],
distributed stream processing framework that blends Kafka and are deployed in the Cloud and exposed via Web Services interfaces.
Apache Hadoop YARN. Whilst Samza provides a model where Users can access the models with Web browser technologies to
streams are the input and output to jobs, execution is completely compose their data mining solutions. Existing work also advocates
handled by YARN. the use of PMML as a language to exchange information about pre-
dictive models [73].
Zementis [138] also provides technologies for data analysis
3.5. Challenges in big data management
and model building that can run either on a customers premises
or be allocated as SaaS using Infrastructure as a Service (IaaS)
In this section, we discuss current research targeting the issue of
provided by solutions such as Amazon EC2 and IBM SmartCloud
Big Data management for analytics. There are still, however, many
open challenges in this topic. The list below is not exhaustive, and Enterprise [76].
as more research in this field is conducted, more challenging issues Google Prediction API [59] allows users to create machine learn-
will arise. ing models to predict numeric values for a new item based on val-
ues of previously submitted training data or predict a category that
Data variety: How to handle an always increasing volume of best describes an item. The prediction API allows users to sub-
data? Especially when the data is unstructured, how to mit training data as comma separated files following certain con-
quickly extract meaningful content out of it? How to ventions, create models, share their models or use models that
aggregate and correlate streaming data from multiple others shared. With the Google Prediction API, users can develop
sources? applications to perform analytics tasks such as sentiment analy-
Data storage: How to efficiently recognise and store important sis [51], purchase prediction, provide recommendations, analyse
information extracted from unstructured data? How to churn, and detect spam. The Apache Mahout project [11] aims to
store large volumes of information in a way it can be provide tools to build scalable machine learning libraries on top
timely retrieved? Are current file systems optimised for of Hadoop using the MapReduce paradigm. The provided libraries
the volume and variety demanded by analytics applica- can be deployed on a Cloud and be explored to build solutions that
tions? If not, what new capabilities are needed? How require clustering, recommendation mining, document categorisa-
to store information in a way that it can be easily mi- tion, among others.
grated/ported between data centres/Cloud providers? By trying to ease the complexity in building trained systems
Data integration: New protocols and interfaces for integration such as IBMs Watson, Apples Siri and Google Knowledge Graph,
of data that are able to manage data of different nature the Hazy project [88] focuses on identifying and validating two
(structured, unstructured, semi-structured) and sources. categories of abstractions in building trained systems, namely pro-
Data Processing and Resource Management: New programm-
gramming abstractions and infrastructure abstractions. It is argued
ing models optimised for streaming and/or multidimen-
that, by providing such abstractions, it would be easier for one to
sional data; new backend engines that manage optimised
assemble existing solutions and build trained systems. To achieve
a small and compoundable programming interface, Hazy employs
a data model that combines the relational data model and a prob-
9 http://github.com/pallet/pallet. abilistic rule-based language. For infrastructure abstraction, Hazy
10 M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315
leverages the observation that many statistical analysis algorithms have been proposed to MapReduce frameworks to handle interac-
behave as a user-defined aggregate in a Relational Database Man- tive applications [23,30,100]. However, most of these solutions are
agement System (RDBMS). Hazy then explores features of the not yet made available for general use in the Cloud.
underlying infrastructure to improve the performance on these Several projects attempt to provide a range of visualisation
aggregates. methods from which users can select a set that suits their require-
ments. ManyEyes [129] from IBM allows users to upload their data,
4.1. Open challenges select a visualisation method varying from basic to advanced
and publish their results. Users may also navigate through existing
The key challenge in the area of Model Building and Scoring is visualisations and discuss their findings and experience with peers.
the discovery of techniques that are able to explore the rapid elas- Selecting data sources automatically or semi-automatically is also
ticity and large scale of Cloud systems. Given that the amount of an important feature to help users perform analytics. PanXpan
data available for Big Data analytics is increasing, timely process- [104] is an example of a tool that automatically identifies the fields
ing of such data for building and scoring would give a relevant ad- in structured data sets based on user analytics module selection.
vantage for businesses able to explore such a capability. FusionCharts [56] is another tool to allow users to visually select a
In the same direction, standards and interfaces for these activ- subset of data from the plotted data points to be submitted back to
ities are also required, as they would help to disseminate predic- the server for further processing. CloudVista [35,135] is a software
tion and analytics as services providers that would compete for to help on visual data selection for further analysis refinement.
customers. If the use of such services does not incur vendor lock Existing work also provides means for users to aggregate data
in (via utilisation of standards APIs and formats), customers can from multiple sources and employ various visualisation models, in-
choose the service provider only based on cost and performance of cluding dashboards, widgets, line and bar charts, demographics,
services, enabling the emergence of a new competitive market. among other models [105,98,60,61,103]. Some of these features
can be leveraged to perform several tasks, including create reports;
5. Visualisation and user interaction track what sections of a site are performing well and what kind of
content can create better user experience; how information shar-
With the increasing amounts of data with which analyses need ing on a social network impacts the web site usage; track mobile
to cope, good visualisation tools are crucial. These tools should con- usage [14,127]; and evaluate the impact of advertising campaigns.
sider the quality of data and presentation to facilitate navigation Choo and Park [37] argue that the reason why Big Data visu-
[44]. The type of visualisation may need to be selected according alisation is not real time is the computational complexity of the
to the amount of data to be displayed, to improve both displaying analytics operations. In this direction, authors discuss strategies to
and performance. Visualisation can assist in the three major types reduce computational complexity of data analytics operations by,
of analytics: descriptive, predictive, and prescriptive. Many visu- for instance, decreasing precision of calculations.
alisation tools do not describe advanced aspects of analytics, but Apart from software optimisation, dedicated hardware for visu-
there has been an effort to explore visualisation to help on pre- alisation is becoming key for Big Data analytics. For example, Reda
dictive and prescriptive analytics, using for instance sophisticated et al. [109] discuss that, although existing tools are able to pro-
reports and storytelling [86]. A key aspect to be considered on visu- vide data belonging to a range of classes, their dimensionality and
alisation and user interaction in the Cloud is that network is still a volume exceed the capacity of visualisation provided by standard
bottleneck in several scenarios [123]. Users ideally would like to vi- displays. This requires the utilisation of large-scale visualisation
sualise data processed in the Cloud having the same experience and environments, such as CyberCommons and CAVE2, which are com-
feel as though data were processed locally. Some solutions have posed of a large display wall with resolution three orders of mag-
been tackling this requirement. nitude higher than that achieved by commercial displays [109].
For example, as Fisher et al. [52] point out, many Cloud plat- Remote visualisation systems, such as Nautilus from XSEDE (Ex-
forms available to process data analytics tasks still resemble the treme Science and Engineering Discovery Environmentthe new
batch-job model used in the early times of the computing era. Users NSF TeraGrid project replacement), are becoming more common to
typically submit their jobs and wait until the execution is complete supply high demand for memory and graphical processors to assist
to download and analyse sample results to validate full runs. As very large data visualisation [134].
this back and forth of data is not well supported by the Cloud, the Besides visualisation of raw data, summarised content in form
authors issue a call to arms for both research and development of of reports are essential to perform predictive and prescriptive an-
better interactive interfaces for Big Data analytics where users it- alytics. Several solutions have explored report generation and vi-
eratively pose queries and see rapid responses. Fisher et al. intro- sualisation. For instance, SAP Crystal Solutions [116] provides BI
duce sampleAction [53] to explore whether interactive techniques functionalities via which customers can explore available data to
acting over only incremental samples can be considered as suffi- build reports with interactive charts, what-if scenarios, and dash-
ciently trustworthy by analysts to make closer to real time deci- boards. The produced reports can be visualised on the Web, e-mail,
sions about their queries. Interviews with three teams of analysts Microsoft Office, or be embedded into enterprise applications. An-
suggest that representations of incremental query results were ro- other example on report visualisation is Cloud9 Analytics [38],
bust enough so that analysts were prepared either to abandon a which aims to automate reports and dashboards, based on data
query, refine it, or formulate new queries. King [84] also highlights from CRM and other systems. It provides features for sales reports,
the importance of making the analytics process iterative, with mul- sales analytics, and sales forecasts and pipeline management.
tiple checkpoints for assessment and adjustment. By exploring history data and using the notion of risk, it offers
In this line, existing work aims to explore the batch-job model customers clues on which projects they should invest their re-
provided by solutions including MapReduce as a backend to fea- sources and what projects or products require immediate action.
tures provided in interfaces with which users are more familiar. Other companies also offer solutions that provide sales forecasts,
Trying to leverage the popularity of spreadsheets as a tool to ma- change analytics, and customised reports [111,21]. Salesforce [114]
nipulate data and perform analysis, Barga et al. proposed an Excel supports customisable dashboards through collaborative analytics.
ribbon connected to Daytona [16], a Cloud service for data stor- The platform allows authorised users to share their charts and in-
age and analytics. Users manipulate data sets on Excel and plugins formation with other users. Another trend on visualisation to help
use Microsofts Azure infrastructure [26] to run MapReduce ap- on predictive and prescriptive analytics is storytelling [86], which
plications. In addition, as described earlier, several improvements aims at presenting data with a narrative visualisation.
M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315 11
There are also visualisation tools specific for a given domain. services. This model is proposed to companies that do not have
For instance, in the field of climate-modelling, Lee et al. [95] de- enough data to make good predictions. Providers upload their
veloped a tool for visualisation of simulated MaddenJulian Oscil- models, which are consumed by customers via scoring services
lation, which is an important meteorological event that influences provided by the Cloud.
raining patterns from South America to Southeast Asia. The tool en-
To make Big Data analytics solutions more affordable, Sun
ables tracking of the event and its visualisation using Google Earth.
et al. [119] also propose cost-effective approaches that enable
In the area of computing networks management, Liao et al. [96]
multi-tenancy at several levels. They discuss the technical chal-
evaluated five approaches for visualisation of anomalies in large
lenges on isolating analytical artefacts. Hsueh et al. [73] discuss
scale computer networks. Each method has its own applications
issues related to pricing and Service Level Agreements (SLAs) on
depending on the specific type of anomaly to be visualised and the
a platform for personalisation in a wellness management platform
scale of the managed system. There are also solutions that provide
built atop a Cloud infrastructure. Krishna and Varma [87] envision
means to visualise demographic information. Andrienko et al. [7]
two types of services for Cloud analytics: (i) Analytics as a Service
proposed interactive visual display for analysis of movement be-
(AaaS), where analytics is provided to clients on demand and they
haviour of people, vehicle, and animals. The visualisation tool dis-
can pick the solutions required for their purposes; and (ii) Model
plays the movement data, information about the time spent in a
as a Service (MaaS) where models are offered as building blocks for
place, and the time interval from one place to another.
analytics solutions.
Bhattacharya et al. [18] introduced IVOCA, a solution for pro-
5.1. Open challenges
viding managed analytics services for CRM. IVOCA provides func-
tionalities that help analysts better explore data analysis tools to
There are many research challenges in the field of Big Data
reduce the time to insight and improve the repeatability of CRM
visualisation. First, more efficient data processing techniques are
analytics. Also in the CRM realm, KXEN [89] offers a range of
required in order to enable real-time visualisation. Choo and
products for performing analytics, some of which can run on the
Park [37] appoint some techniques that can be employed with this
Cloud. Cloud Prediction is a predictive analytics solution for Sales-
objective, such as reduction of accuracy of results, coarsely pro-
force.com. With its Predictive Lead Scoring, Predictive Offers, and
cessing of data points, compatible with the resolution of the visu-
alisation device, reduced convergence, and data scale confinement. Churn Prediction, customers can leverage the CRM, mobile, and so-
Methods considering each of these techniques could be further re- cial data available in the Cloud to score leads based on which one
searched and improved. can create sales opportunities; create offers that have a higher like-
Cost-effective devices for large-scale visualisation is another lihood to be accepted based on a prediction of offers and promo-
hot topic for analytics visualisation, as they enable finer resolu- tions; and gain insights into which customers a company is at risk
tion than simple screens. Visualisation for management of com- of losing.
puter networks and software analytics [101] is also an area that Cloud-enabled Big Data analytics poses several challenges with
is attracting attention of researchers and practitioners for its ex- respect to replicability of analyses. When not delivered by a Cloud,
treme relevance to management of large-scale infrastructure (such analytics solutions are customer-specific and models often have
as Clouds) and software, with implications in global software de- to be updated to consider new data. Cloud solutions for analytics
velopment, open source software development, and software qual- need to balance generality and usefulness. Previous work also dis-
ity improvements. cusses the difficulty of replicating activities of text analytics [107].
An analytical pathway is proposed to link business objectives to an
6. Business models and non-technical challenges analytical flow, with the goal of establishing a methodology that
illustrates and possibly supports repeatability of analytical pro-
In addition to providing tools that customers can use to build cesses when using complex analytics. King [84], whilst discussing
their Big Data analytics solutions on the Cloud, models for deliv- some of the problems in buying predictive analytics, provides
ering analytics capabilities as services on a Cloud have been dis- a best practice framework based on five steps, namely training,
cussed in previous work [120]. Sun et al. [119] provide an overview assessment, strategy, implementation, and iteration.
of the current state of the art on the development of customised Chen et al. [34] envision an analytics ecosystem where data ser-
analytics solutions on customers premises and elaborate on some vices aggregate, integrate, and provide access to public and private
of the challenges to enable analytics and analytics as a service on data by enabling partnerships among data providers, integrators,
the Cloud. Some of the potential business models proposed in their aggregators, and clients; these services are termed as DaaS. Atop
work include: DaaS, a range of analytics functionalities that explore the data ser-
Hosting customer analytics jobs in a shared platform: suit- vices are offered to customers to boost productivity and create
able for an enterprise or organisation that has multiple analytics value. This layer is viewed as AaaS. Similar to the previously de-
departments. Traditionally, these departments have to develop scribed work, they discuss a set of possible business models that
their own analytics solutions and maintain their own clusters. range from proprietary, where both data and models are kept pri-
With a shared platform they can upload their solutions to ex- vate, to co-developing models where both data and analytics mod-
ecute on a shared infrastructure, therefore reducing operation els are shared among the parties involved in the development of
and maintenance costs. As discussed beforehand, techniques the analytics strategy or services.
have been proposed for resource allocation and scheduling of
Big Data analytics tasks on the Cloud [93,108]. 7. Other challenges
A full stack designed to provide customers with end-to-end
solutions: appropriate for companies that do not have exper- In business models where high-level analytics services may
tise on analysis. In this model, analytical service providers pub- be delivered by the Cloud, human expertise cannot be easily
lish domain-specific analytical stream templates as services. replaced by machine learning and Big Data analysis [99]; in certain
The provider is responsible for hosting the software stack and scenarios, there may be a need for human analysts to remain in the
managing the resources necessary to perform the analyses. Cus- loop [91]. Management should adapt to Big Data scenarios and deal
tomers who subscribe to the services just need to upload their with challenges such as how to assist human analysts in gaining
data, configure the templates, receive models, and perform the insights and how to explore methods that can help managers in
proper model scoring. making quicker decisions.
Expose analytics models as hosted services: analytics capa- Application profiling is often necessary to estimate the costs of
bilities are hosted on the Cloud and exposed to customers as running analytics on a Cloud platform. Users need to develop their
12 M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315
applications to target Cloud platforms; an effort that should be car- Although Cloud infrastructure offers such elastic capacity to
ried out only after estimating the costs of transferring data to the supply computational resources on demand, the area of Cloud-
Cloud, allocating virtual machines, and running the analysis. This supported analytics is still in its early days. In this paper, we dis-
cost estimation is not a trivial task to perform in current Cloud of- cussed the key stages of analytics workflows, and surveyed the
ferings. Although best practices for using some data processing ser- state-of-the-art of each stage in the context of Cloud-supported
vices are available [49], there should be tools that assist customers analytics. Surveyed work was classified in three key groups: Data
to estimate the costs and risks of performing analytics on the Cloud. Management (which encompasses data variety, data storage, data
Data ingestion by Cloud solutions is often a weak point, whereas integration solutions, and data processing and resource manage-
debugging and validation of developed solutions is a challenging ment), Model Building and Scoring, and Visualisation and User In-
and tedious process. As discussed earlier, the manner analytics is teractions. For each of these areas, ongoing work was analysed and
executed on Cloud platforms resembles the batch job scenario: key open challenges were discussed. This survey concluded with an
users submit a job and wait until tasks are executed and then analysis of business models for Cloud-assisted data analytics and
download the results. Once an analysis is complete, they download other non-technical challenges.
sample results that are enough to validate the analysis task and af- The area of Big Data Computing using Cloud resources is moving
ter that perform further analysis. Current Cloud environments lack fast, and after surveying the current solutions we identified some
this interactive process, and techniques should be developed to fa- key lessons:
cilitate interactivity and to include analysts in the loop by provid-
There are plenty of solutions for Big Data related to Cloud com-
ing means to reduce their time to insight. Systems and techniques
puting. Such a large number of solutions have been created be-
that iteratively refine answers to queries and give users more con-
cause of the wide range of analytics requirements, but they may,
trol of processing are desired [70].
sometimes, overwhelm non-experienced users. Analytics can
Furthermore, market research shows that inadequate staffing
be descriptive, predictive, prescriptive; Big Data can have vari-
and skills, lack of business support, and problems with analytics
ous levels of variety, velocity, volume, and veracity. Therefore, it
software are some of the barriers faced by corporations when per-
is important to understand the requirements in order to choose
forming analytics [112]. These issues can be exacerbated by the
appropriate Big Data tools;
Cloud as the resources and analysts involved in certain analytics
It is also clear that analytics is a complex process that demands
tasks may be offered by a Cloud provider and may move from one
people with expertise in cleaning up data, understanding and
customer engagement to another. In addition, based on survey re-
selecting proper methods, and analysing results. Tools are fun-
sponses, currently most analytics updates and scores of methods
damental to help people perform these tasks. In addition, de-
occur daily to annually; which can become an issue for analytics
pending on the complexity and costs involved in carrying out
on streaming data. Russom [112] also highlights the importance of
these tasks, providers who offer Analytics as a Service or Big
advanced data visualisation techniques and advanced analytics
Data as a Service can be a promising alternative compared to
such as analysis of unstructured, large data sets and streams to
performing these tasks in-house;
organisations in the next few years.
Cloud computing plays a key role for Big Data; not only be-
Chen et al. [32] foresee the emergence of what they termed as
cause it provides infrastructure and tools, but also because it
Business Intelligence and Analytics (BI&A) 3.0, which will require
is a business model that Big Data analytics can follow (e.g. An-
underlying mobile analytics and location and context-aware tech-
alytics as a Service (AaaS) or Big Data as a Service (BDaaS)).
niques for collecting, processing, analysing, and visualising large
However, AaaS/BDaaS brings several challenges because the
scale mobile and sensor data. Many of these tools are still to be
customer and providers staff are much more involved in the
developed. Moreover, moving to BI&A 3.0 will demand efforts on
loop than in traditional Cloud providers offering infrastruc-
integrating data from multiple sources to be processed by Cloud
ture/platform/software as a service.
resources, and using the Cloud to assist decisions by mobile device
users. Recurrent themes among the observed future work include
More recently, terms such as Analytics as a Service (AaaS) and (i) the development of standards and APIs enabling users to eas-
Big Data as a Service (BDaaS) are becoming popular. They comprise ily switch among solutions and (ii) the ability of getting the most
services for data analysis similarly as IaaS offers computing of the elasticity capacity of the Cloud infrastructure. The latter
resources. However, these analytics services still lack well defined includes expressive languages that enable users to describe the
contracts since it may be difficult to measure quality and reliability problem in simple terms whilst decomposing such high-level de-
of results and input data, provide promises on execution times, scription in highly concurrent subtasks and keeping good perfor-
and guarantees on methods and experts responsible for analysing mance efficiency even for large numbers of computing resources.
the data. Therefore, there are fundamental gaps on tools to assist If this can be achieved, the only limitations for an arbitrary short
service providers and clients to perform these tasks and facilitate processing time would be market issues, namely the relation be-
the definition of contracts for both parties. tween the cost for running the analytics and the financial return
brought for the obtained knowledge.
8. Summary and conclusions
References
The amount of data currently generated by the various activi-
ties of the society has never been so big, and is being generated in [1] D.J. Abadi, Data management in the cloud: Limitations and opportunities,
an ever increasing speed. This Big Data trend is being seen by in- IEEE Data Engineering Bulletin 32 (1) (2009) 312.
[2] Amazon redshift, http://aws.amazon.com/redshift/.
dustries as a way of obtaining advantage over their competitors: if [3] Amazon data pipeline, http://aws.amazon.com/datapipeline/.
one business is able to make sense of the information contained in [4] Amazon Elastic MapReduce (EMR),
http://aws.amazon.com/elasticmapreduce/.
the data reasonably quicker, it will be able to get more costumers, [5] Amazon Kinesis, http://aws.amazon.com/kinesis/developer-resources/.
increase the revenue per customer, optimise its operation, and re- [6] R. Ananthanarayanan, K. Gupta, P. Pandey, H. Pucha, P. Sarkar, M. Shah,
duce its costs. Nevertheless, Big Data analytics is still a challeng- R. Tewari, Cloud Analytics: Do We Really Need to Reinvent the Storage
Stack? in: Proceedings of the Conference on Hot Topics in Cloud Computing
ing and time demanding task that requires expensive software,
(HotCloud 2009), USENIX Association, Berkeley, USA, 2009.
large computational infrastructure, and effort. [7] G. Andrienko, N. Andrienko, S. Wrobel, Visual analytics tools for analysis of
Cloud computing helps in alleviating these problems by pro- movement data, SIGKDD Explor. Newsl. 9 (2) (2007) 3846.
viding resources on-demand with costs proportional to the actual [8] Announcing Suro: Backbone of Netflixs Data Pipeline, http://techblog.
netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html.
usage. Furthermore, it enables infrastructures to be scaled up and [9] Apache S4: distributed stream computing platform, http://incubator.apache.
down rapidly, adapting the system to the actual demand. org/s4/.
M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315 13
[72] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F.B. Cetin, S. Babu, Starfish: [102] L. Moreau, P. Groth, S. Miles, J. Vazquez-Salceda, J. Ibbotson, S. Jiang,
A Self-tuning System for Big Data Analytics, in: Proceedings of the 5th S. Munroe, O. Rana, A. Schreiber, V. Tan, L. Varga, The provenance of electronic
Biennial Conference on Innovative Data Systems Research (CIDR 2011), 2011, data, Communications of the ACM 51 (4) (2008) 5258.
pp. 261272. [103] S. Murray, Interactive Data Visualization for the Web, OReilly Media, 2013.
[73] P.-Y. S. Hsueh, R.J. Lin, M.J. Hsiao, L. Zeng, S. Ramakrishnan, H. Chang, Cloud- [104] panXpan, https://www.panxpan.com.
based Platform for Personalization in a Wellness Management Ecosystem: [105] PivotLink AnalyticsCLOUD, http://www.pivotlink.com/products/
Why, What, and How, in: Proceedings of the 6th International Conference analyticscloud.
on Collaborative Computing: Networking, Applications and Worksharing [106] Prime Minister joins Sir Ka-shing Li for launch of 90m initiative in big data
(CollaborateCom 2010), 2010, pp. 18. and drug discovery at Oxford, http://www.cs.ox.ac.uk/news/639-full.html
[74] B. Huang, S. Babu, J. Yang, Cumulon: Optimizing Statistical Data Analysis in (May 2013).
the Cloud, in: Proceedings of the ACM SIGMOD International Conference on [107] L. Proctor, C.A. Kieliszewski, A. Hochstein, S. Spangler, Analytical pathway
Management of Data SIGMOD 2013, ACM, New York, USA, 2013, pp. 112. methodology: Simplifying business intelligence consulting, in: Proceedings
[75] IBM InfoSphere Streams, http://www.ibm.com/software/products/en/ of the Annual SRII Global Conference (SRII 2011), 2011, pp. 495500.
infosphere-streams. [108] M. Rahman, X. Li, H. Palit, Hybrid Heuristic for Scheduling Data Analytics
[76] IBM SmartCloud Enterprise, http://www-935.ibm.com/services/us/en/ Workflow Applications in Hybrid Cloud Environment, in: Proceedings of
cloud-enterprise/ (2012). the IEEE International Symposium on Parallel and Distributed Processing
[77] Infochimps cloud overview, http://www.infochimps.com/infochimps-cloud/ Workshops and Phd Forum (IPDPSW), 2011, pp. 966974.
overview/. [109] K. Reda, A. Febretti, A. Knoll, J. Aurisano, J. Leigh, A. Johnson, M.E. Papka,
[78] A. Iosup, A. Lascateu, N. Tapus, CAMEO: Enabling social networks for Mas- M. Hereld, Visualizing Large, Heterogeneous Data in Hybrid-Reality Environ-
sively Multiplayer Online Games through Continuous Analytics and Cloud ments, IEEE Computer Graphics and Applications 33 (4) (2013) 3848.
Computing, in: Proceedings of the 9th Annual Workshop on Network and [110] T.C. Redman, Data Quality for the Information Age, Artech House, 1997.
Systems Support for Games (NetGames 2010), 2010, pp. 16. [111] Right90, http://www.right90.com.
[79] D. Jensen, K. Konkel, A. Mohindra, F. Naccarati, E. Sam, Business Analytics in [112] P. Russom, Big Data Analytics, TDWI best practices report, The Data Ware-
the Cloud, White paper IBW03004-USEN-00, IBM (April 2012). housing Institute (TDWI) Research (2011).
[80] G. Jung, N. Gnanasambandam, T. Mukherjee, Synchronous Parallel Process- [113] S. Sakr, A. Liu, D. Batista, M. Alomari, A survey of large scale data management
ing of Big-Data Analytics Services to Optimize Performance in Federated approaches in cloud environments, IEEE Communications Surveys Tutorials
Clouds, in: Proceedings of the IEEE 5th International Conference on Cloud 13 (3) (2011) 311336.
[114] SalesForce, http://www.salesforce.com.
Computing (Cloud 2012), 2012, pp. 811818.
[115] SAP HANA One, http://www.saphana.com/community/solutions/cloud-info
[81] H. Kasim, T. Hung, E.F.T. Legara, K.K. Lee, X. Li, B.-S. Lee, V. Selvam, S. Lu,
(2013).
L. Wang, C. Monterola, V. Jayaraman, Scalable Complex System Modeling for
[116] SAP Crystal Solutions, http://www.crystalreports.com/.
Sustainable City, in: The 6th IEEE International Scalable Computing Chal-
[117] F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Com-
lenge (SCALE 2013) in conjunction with The 13th International Symposium
puting Clusters, in: Proceedings of the 1st Conference on File and Storage
on Cluster, Cloud and the Grid (CCGrid 2013), 2013.
Technologies (FAST02), Monterey, USA, 2002, pp. 231244.
[82] D.S. Katz, S. Jha, M. Parashar, O. Rana, J.B. Weissman, Survey and Analysis of [118] F. Schomm, F. Stahl, G. Vossen, Marketplaces for data: An initial survey,
Production Distributed Computing Infrastructures, CoRR abs/1208.2649. SIGMOD Record 42 (1) (2013) 1526.
[83] H. Kim, S. Chaudhari, M. Parashar, C. Marty, Online Risk Analytics on the [119] X. Sun, B. Gao, L. Fan, W. An, A Cost-Effective Approach to Delivering Analyt-
Cloud, in: Proceedings of the 9th IEEE/ACM International Symposium on ics as a Service, in: Proceedings of the 19th IEEE International Conference on
Cluster Computing and the Grid (CCGrid 2009), IEEE Computer Society, Web Services (ICWS 2012), Honolulu, USA, 2012, pp. 512519.
Washington, USA, 2009, pp. 484489. [120] X. Sun, B. Gao, Y. Zhang, W. An, H. Cao, C. Guo, W. Sun, Towards delivering
[84] E.A. King, How to buy data mining: A framework for avoiding costly project analytical solutions in cloud: Business models and technical challenges,
pitfalls in predictive analytics, DMReview 15(10). in: Proceedings of the IEEE 8th International Conference on e-Business
[85] J. Kobielus, In-Database Analytics: The Heart of the Predictive Enterprise, Engineering (ICEBE 2011), IEEE Computer Society, Washington, USA, 2011,
Technical report, Forrester Researc, Inc., Cambridge, USA (Nov. 2009). pp. 347351.
[86] R. Kosara, J. Mackinlay, Storytelling: The Next Step for Visualization, [121] Storm: distributed and fault-tolerant realtime computation, http://storm.
Computer 46 (5) (2013) 4450. incubator.apache.org.
[87] P.R. Krishna, K.I. Varma, Cloud Analytics: A Path Towards Next Generation [122] The Intel science and technology center for big data, http://istc-bigdata.org.
Affordable BI, White paper, Infosys (2012). [123] I. T. Tabor Communications, The UberCloud HPC Experiment: Compendium
[88] A. Kumar, F. Niu, C. R, Hazy: Making it Easier to Build and Maintain Big-Data of Case Studies, Tech. rep. (2013).
Analytics, Communications of the ACM 56 (3) (2013) 4049. [124] W. Tantisiriroj, S.W. Son, S. Patil, S.J. Lang, G. Gibson, R.B. Ross, On the
[89] KXEN, http://www.kxen.com. duality of data-intensive file system design: reconciling HDFS and PVFS,
[90] J.K. Laurila, D. Gatica-Perez, I. Aad, J. Blom, O. Bornet, T.-M.-T. Do, in: Proceedings of 2011 International Conference for High Performance
O. Dousse, J. Eberle, M. Miettinen, The Mobile Data Challenge: Big Data Computing, Networking, Storage and Analysis (SC 2011), ACM, New York, NY,
for Mobile Computing Research (2012). URL: http://research.nokia.com/ USA, 2011, pp. 67:167:12.
files/public/MDC2012_Overview_LaurilaGaticaPerezEtAl.pdf. [125] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.S. Sarma,
[91] D. Lazer, R. Kennedy, G. King, A. Vespignani, The Parable of google flu: Traps R. Murthy, H. Liu, Data warehousing and analytics infrastructure at Facebook,
in big data analysis, Science 343 (2014) 12031205. in: Proceedings of the 2010 International Conference on Management of
[92] N. Leavitt, Will NoSQL Databases Live Up to Their Promise? Computer 43 (2) Data, ACM, New York, NY, USA, 2010, pp. 10131020.
(2010) 1214. [126] S.R. Upadhyaya, Parallel Approaches to Machine LearningA Comprehensive
[93] G. Lee, B.-G. Chun, R.H. Katz, Heterogeneity-Aware Resource Allocation and Survey, Journal of Parallel Distributed Computing 73 (3) (2013) 284292.
Scheduling in the Cloud, in: Proceedings of the 3rd USENIX conference on Hot [127] Unlocking Game-Changing Wireless Capabilities: Cisco and SITA help
topics in Cloud computing (HotCloud 2011), USENIX Association, Berkeley, Copenhagen Airport Develop New Services for Transforming the Passenger
USA, 2011. Experience, Customer case study, CISCO (2012). URL http://www.cisco.com/
[94] K.-H. Lee, Y.-J. Lee, H. Choi, Y.D. Chung, B. Moon, Parallel Data Processing with en/US/prod/collateral/wireless/c36_696714_00_copenhagen_airport_cs.pdf.
MapReduce: A Survey, SIGMOD Record 40 (4) (2011) 1120. [128] S. Venugopal, R. Buyya, K. Ramamohanarao, A taxonomy of data grids for
[95] T.-Y. Lee, X. Tong, H.-W. Shen, P.C. Wong, S. Hagos, L.R. Leung, Feature distributed data sharing, management and processing, ACM Comput. Surv.
Tracking and Visualization of the Madden-Julian Oscillation in Climate 38 (1) (2006) 153.
Simulation, IEEE Computer Graphics and Applications 33 (4) (2013) 2937. [129] F.B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, M. McKeon, ManyEyes: a Site
[96] Q. Liao, L. Shi, C. Wang, Visual analysis of large-scale network anomalies, IBM for Visualization at Internet Scale, IEEE Trans. Vis. Comput. Graphics 13 (6)
J. Res. Dev. 57 (3/4) (2013) 13:113:12. (2007) 11211128.
[97] X. Li, R.N. Calheiros, S. Lu, L. Wang, H. Palit, Q. Zheng, R. Buyya, Design and [130] H. Wang, Integrity Verification of Cloud-Hosted Data Analytics Computa-
Development of an Adaptive Workflow-Enabled Spatial-Temporal Analytics tions, in: Proceedings of the 1st International Workshop on Cloud Intelligence
Framework, in: Proceedings of the IEEE 18th International Conference on (Cloud-I 2012), ACM, New York, NY, USA, 2012.
Parallel and Distributed Systems (ICPADS 2012), IEEE Computer Society, [131] C. Wang, K. Schwan, V. Talwar, G. Eisenhauer, L. Hu, M. Wolf, A Flexible
Singapore, 2012, pp. 862867. Architecture Integrating Monitoring and Analytics for Managing Large-Scale
[98] S. Lu, R.M. Li, W.C. Tjhi, K.K. Lee, L. Wang, X. Li, D. Ma, A framework for cloud- Data Centers, in: Proceedings of the 8th ACM International Conference on
based large-scale data analytics and visualization: Case study on multiscale Autonomic Computing (ICAC 2011), ACM, New York, USA, 2011, pp. 141150.
climate data, in: Proceedings of the IEEE 3rd International Conference on [132] Windows Azure HDInsight, http://www.windowsazure.com/en-us/
Cloud Computing Technology and Science (CloudCom 2011), IEEE Computer documentation/services/hdinsight/.
Society, Washington, USA, 2011, pp. 618622. [133] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools
[99] A. McAfee, E. Brynjolfsson, Big data: The management revolution, Harv. Bus. and Techniques, third ed., Morgan Kaufmann, 2011.
Rev. (2012) 6068. [134] XSEDE, http://www.xsede.org/.
[100] S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, [135] H. Xu, Z. Li, S. Guo, K. Chen, CloudVista: Interactive and Economical
T. Vassilakis, Dremel: Interactive Analysis of Web-Scale Datasets, Proceed- Visual Cluster Analysis for Big Data in the Cloud, Proceedings of the VLDB
ings of the VLDB Endowment 3 (12) (2010) 330339. Endowment 5 (12) (2012) 18861889.
[101] T. Menzies, T. Zimmermann, Software analytics: So what? IEEE Software 30 [136] P.S. Yu, On mining big data, in: J. Wang, H. Xiong, Y. Ishikawa, J. Xu, J. Zhou
(4) (2013) 3137. (Eds.), Web-AgeInformation Management, in: Lecture Notes in Computer
Science, vol. 7923, Springer-Verlag, Berlin, Heidelberg, 2013, p. XIV.
M.D. Assuno et al. / J. Parallel Distrib. Comput. 7980 (2015) 315 15
Dr. Rodrigo N. Calheiros is a Research Fellow in the De- Dr. Rajkumar Buyya is Professor of Computer Science and
partment of Computing and Information Systems, the Uni- Software Engineering and Director of the Cloud Computing
versity of Melbourne, Australia. Since 2010, he is a member and Distributed Systems (CLOUDS) Laboratory at the
of the CLOUDS Lab of the University of Melbourne, where University of Melbourne, Australia. He is also the founding
he researches various aspects of cloud computing. He CEO of Manjrasoft, a spin-off company of the University,
works in the field of Cloud computing since 2008. His re- commercialising its innovations in Cloud Computing. He
search interests also include virtualization, grid comput- has authored 400 publications and four text books. He is
ing, and simulation and emulation of distributed systems. one of the highly cited authors in computer science and
software engineering worldwide.