Big Data Analytics For Wireless and Wired Network Design: A Survey

1
Big Data Analytics for Wireless and Wired

Network Design: A Survey
Mohammed S. Hadi 1,*, Ahmed Q. Lawey 1, Taisir E. H. El-Gorashi 1 and Jaafar M. H. Elmirghani 1
1
School of Electronic and Electrical Engineering, University of Leeds, United Kingdom
National Security Agency (NSA) Utah data centre that can

Abstract—Currently, the world is witnessing a mounting store up to 1 yottabyte of data [4], and with a processing
avalanche of data due to the increasing number of mobile power that exceeds 100 petaflops [5]. Due to the increased
network subscribers, Internet websites, and online services. This needs to scale-up databases to data volumes that exceeded
trend is continuing to develop in a quick and diverse manner in processing and/or storage capabilities, systems that ran on
the form of big data. Big data analytics can process large computer clusters started to emerge. Perhaps the first
amounts of raw data and extract useful, smaller-sized
information, which can be used by different parties to make
milestone took place in June 1986 when Teradata [6] used the
reliable decisions. first parallel database system (hardware and software), with
In this paper, we conduct a survey on the role that big data one terabyte storage capacity, in Kmart data warehouse to
analytics can play in the design of data communication networks. have all their business data saved and available for relational
Integrating the latest advances that employ big data analytics queries and business analysis [7, 8]. Other examples include
with the networks’ control/traffic layers might be the best way to the Gamma system of the University of Wisconsin [9] and the
build robust data communication networks with refined GRACE system of the University of Tokyo [10].
performance and intelligent features. First, the survey starts with In light of the above, the term “Big Data” emerged, and it
the introduction of the big data basic concepts, framework, and can be defined as high-volume, high-velocity, and high-variety
characteristics. Second, we illustrate the main network design
cycle employing big data analytics. This cycle represents the
data that provides substantial opportunities for cost-effective
umbrella concept that unifies the surveyed topics. Third, there is decision-making and enhanced insight through advanced
a detailed review of the current academic and industrial efforts processing which extracts information and knowledge from
toward network design using big data analytics. Forth, we data [11]. Another way to define big data is by saying it is the
identify the challenges confronting the utilization of big data amount of data that is beyond traditional technology
analytics in network design. Finally, we highlight several future capabilities to store, manage, and process in an efficient and
research directions. To the best of our knowledge, this is the first easy way [12]. Big data is already being employed by digital-
survey that addresses the use of big data analytics techniques for born companies like Google and Amazon to help these
the design of a broad range of networks. companies with data-driven decisions [13]. It also helps in the
development of smart cities and campuses [14], as well as in
Index Terms—Big data analytics, network design, self-
other fields like agriculture, healthcare, finance [15], and
optimization, self-configuration, self-healing network.
transportation [16]. Big data has the following characteristics:
1- Volume: This is a representation of the data size [17].
1. Introduction
2- Variety: Generating data from a variety of sources results in
Networks generate traffic in rapid, large, and diverse ways, a range of data types. These data types can be structured
which leads to an estimate of 2.5 exabytes created per day [1]. (e.g. e-mails), semi-structured (e.g. log files data from a
There are many contributors to the increasing size of the data. webpage); and unstructured (e.g. customer feedback), and
For instance, scientific experiments can generate lots of data, hybrid data [18].
such as CERN’s Large Hadron Collider (LHC) that generates 3- Velocity: Is an indication of the speed of the data when
over 40 petabyte each year [2]. Social media also has its share, being generated, streamed, and aggregated [19]. It can
with over 1 billion users, spending an average 2.5 hours daily, also refer to the speed at which the data has to be
liking, tweeting, posting, and sharing their interests on analyzed to maintain relevance [17].
Facebook and Twitter [3]. It is without a doubt that using this Depending on the research area and the problem space,
activity-generated data can affect many aspects, such as other terms or Vs can be added. For example, is this data of
intelligence, e-commerce, biomedical, and data any value? How long can we consider this an accurate and
communication network design. However, harnessing the valid data? Since we are conducting a survey, we find it
powers of this data is not an easy task. To accommodate the compelling to briefly introduce other Vs as well. Typically,
data explosion, data centers are being built with massive the number of analyzed Vs is 3 to 7 in a single paper (e.g.
storage and processing capabilities, an example of which is the 6V+C [20]), where C represents Complexity, however,
different papers analyze different sets of Vs and the union
* Corresponding author. (sum) of all the analyzed Vs among all surveyed papers is 8V
E-mail addresses: elmsha@leeds.ac.uk (M. Hadi), a.q.lawey@leeds.ac.uk
(A. Q. Lawey), T.E.H.Elgorashi@leeds.ac.uk (T. E. H. El-Gorashi), and a C, as shown in Table 1.
J.M.H.Elmirghani@leeds.ac.uk (J. M. H. Elmirghani)
2
4- Value: Is a measure of data usefulness when it comes to frameworks (shown below) that usually require an
decision making [19], or how much added-value is upgradeable cluster dedicated solely for that purpose [17].
brought by the collected data to the intended process, Even if the cluster can be formed using a number of
activity, or predictive analysis/hypothesis [21]. commodity servers [45], however, this still forms an
5- Veracity: Refers to the authenticity and trustworthiness of impediment for limited-budget users who want to analyze
the collected data against unauthorized access and their data. The solution is presented through the
manipulation [21, 22]. democratization of computing. This made it possible for any-
6- Volatility: An indication of the period in which the data can sized company and business owners to analyze their data using
still be regarded as valid and for how long that data cloud computing platforms for big data analytics.
should be kept and stored [23]. Consequently, the use of big data analytics is not limited to
7- Validity: This might appear similar to veracity; however, the enterprise-level companies. Furthermore, business owners do
difference is that validity deals with data accuracy and not have to heavily invest in an expensive hardware dedicated
correctness regarding the intended usage. Thus, certain to analyzing their data [1]. Amazon is one of the companies
data might be valid for an application but invalid for that provide ‘cloud-computed’ big data analytics for its
another. customers. The service is called Amazon EMR (Elastic
8- Variability: This refers to the inconsistency of the data. This MapReduce), and it enables users to process their data in the
is due to the high number of distributed autonomous data cloud with a considerably lower cost in a pay-as-you-use
sources [24]. Other researchers refer to the variability as fashion. The user is able to shrink or expand the size of the
the consistency of the data over time [22]. computing clusters to control the data volume handled and
9- Complexity: A measure of the degree of interdependence response time [1, 46]
and inter-connectedness in big data [20]. Such that, a Dealing with big amounts of data is not an easy task,
system may witness a (substantial, low, or no) effect due especially if there is a certain goal in mind since data arrives
to a very small change(s) that ripples across the system in a fast manner, it is vital to provide fast collection, sorting,
[19]. Also, complexity can be considered in terms of and processing speeds. Apache Hadoop was created by Doug
relationship, correlation and connectivity of data. It can Cutting [47] for this purpose. It was later adopted, developed,
further manifest in terms of multiple data linkages, and and released by Yahoo [48]. Apache Hadoop can be defined as
hierarchies. Complexity and its mentioned attributes can a top-level, java-written, open source framework. It utilizes
however help better organize big data. It should be noted clusters of commodity hardware [49].
that complexity was included among the big data Hadoop V1.x (shown in Fig.1) consists of two parts: the
attributes (Vs) in [20] where big data was characterized as Hadoop Distributed File System (HDFS) that consists of a
having 6V + complexity. This is how we will arrange it in storage part, and a data processing and management
Table 1. (MapReduce) part. The master node has two processes, a Job
The process of extracting hidden, valuable patterns, and Tracker that manages the processing tasks and a Name Node
useful information from big data is called big data analytics that manages the storage tasks [50].
[44]. This is done through applying advanced analytics When a Job Tracker takes job requests, it splits the accepted
techniques on large data sets [28]. Before commencing the job into tasks and pushes them to the Task Trackers located in
analytics process, data sets may comprise certain consistency the slave nodes [51]. The Name Node resembles the master
and redundancy problems affecting their quality. These part, while the Data Nodes represent the slave part [12]. There
problems arise due to the diverse sources from which the data is more explanation in the HDFS part below.
originated. Data pre-processing techniques are used to address Many projects were developed in a quest to either
these problems. The techniques include integration, cleansing complement or replace the above parts, and not all projects are
(or cleaning), and redundancy elimination, and they were hosted by the Apache Software Foundation, which is the
discussed by the authors in [39]. reason for the emergence of the term Hadoop ecosystem [47].
Big data analytics can be carried out using a number of
Table 1: Various big data dimensions.

No. Dimensions (Characteristics)
of References
Vs Volume Velocity Variety Veracity Value Variability Volatility Validity Complexity
3Vs [25-31] √ √ √
[4, 32-34] √ √ √ √
4Vs
[35-39] √ √ √ √
[3, 11, 21,
5Vs √ √ √ √ √
40, 41]
[20, 22, 24,
6Vs √ √ √ √ √ √ √
42]
7Vs [23, 43] √ √ √ √ √ √ √
3
function groups the tuples that share the same word and
sums their occurrences to reach the concluding result [61].
2- HDFS: HDFS represents the storage file-system component
in the Hadoop ecosystem. Its main feature is to store huge
amounts of data over multiple nodes and stream those data
sets to user applications at high bandwidth. Large files are
split into smaller 128 MB blocks, with three copies of each
block of data to achieve fault tolerance in the case of disk
failure [17, 52, 53].
3- YARN: YARN was introduced in Hadoop version 2.0, and it
simply took over the tasks of cluster resource management
from MapReduce and separated it from the programming
model, thus making a more generalized Hadoop capable of
selecting programming models, like Spark [54], Storm
[55], and Dryad [56, 57].
4- Common utilities: To operate Hadoop’s sub-projects or
modules, a set of common utilities or components are
needed. Shared libraries support operations like error
Fig. 1. Hadoop V1.x architecture. detection, Java implementation for compression codes, and
I/O utilities [17, 58].
Hadoop V2.x is viewed as a three-layered model. These layers Over the last few years, researchers in telecommunication
are classified as storage, processing, and management, as networks started to consider big data analytics in their design
shown in Fig. 2. The current Hadoop project has four toolbox. Characterized by hundreds of tunable parameters,
components (modules), which are MapReduce, the HDFS, Yet wireless network design informed by big data analytics
Another Resource Negotiator (YARN), and Common utilities received most of the attention, however, other types of
[17]. networks received increasing attention as well.
The vast amount of data that can be collected from the
networks, along with the distributed modern high-performance
computing platforms, can lead to new cost-effective design
space (e.g. reducing total cost of ownership by employing
dynamic Virtual Network Topology adaptation) when
compared to classical approaches (i.e. static Virtual Network
Topologies) [59]. This new paradigm is promising to convert
networks from being sightless tubes for data into insightful
context-aware networks. Our contributions in this paper are as
follows:
1- We show in this paper the role big data analytics can play in
wireless and wired network design.
2- The above role is corroborated through the illustration of
case studies in Section 2.
3- The significance of this paper lies in helping academic
researchers save much effort by understanding the state-of-
the-art and identifying the opportunities, as well as the
challenges facing the use of big data analytics in network
design.
4- In addition to academic approaches, we surveyed network
equipment manufacturing companies highlighting network
Fig. 2. Hadoop V2.x architecture. solutions based on big data analytics. We also identified the
common areas of interest among these solutions, and thus
1- MapReduce: As a programming model, MapReduce is this survey can benefit both academic and industrial-oriented
used as a data processing engine and for cluster resource readers.
management. With the emergence of Hadoop v2.0, the 5- This paper provided insights on potential research directions
resource management task became YARN’s responsibility as illustrated in Section 8.
[17]. WordCount is an example illustrating how
MapReduce works. As the name implies, it calculates the This paper is organized as follows: Section 2 presents
number of times a specific word is repeated within a several case studies uses big data analytics in wireless and
document. Tuples ⟨𝑤, 1⟩ are produced by the map wired networks. Sections 3-6 illustrate the research conducted
function, where 𝑤 and 1 represents the word and the times in the direction of employing big data analytics in the fields of
it appeared in the document respectively. The reduce cellular, SDN & intra-data center, optical networks, and
network security, respectively. Section 7 summarizes some of
4
the main big data-based network solutions offered by industry. Several reasons contributed towards this choice, including the
Section 8 discusses the network design cycle based on big data Apache Flink’s ability to process data in both stream and
analytics and highlights the challenges encountered in big batch modes, ease of deployment, and fast execution when
data-powered network design. In Section 9 we propose open compared to other frameworks such as Spark. Furthermore,
directions for future research. Finally, the paper ends with the apache Flink can be integrated with other projects like
conclusions in Section 10. HDFS for data storage purposes. Moreover, Apache Flink is
scalable which makes it an optimal choice for this system.
2. Case studies of the use of big data analytics for wireless
2.3 Network anomaly detection using NetFlow data
and wired networks.
Big data analytics can support the efforts in the subject of
2.1 Detection of sleeping cells in 5G SON network anomaly and intrusion detection. Towards that end,
A wireless cell may cease to provide service with no alarm the authors in [62] proposed an unsupervised network
triggered at the Operation and Maintenance Center (OMC) anomaly detection method powered by Apache Spark cluster
side. Such cells are referred to as sleeping cells in self in Azure HDInsight. The proposed solution uses a network
organizing networks (SON). The authors in [60] tackled this protocol called NetFlow that collects traffic information that
problem and presented a case study on the identification of the can be utilized for the detection of network anomalies. The
Sleeping Cells (SC). The simulation scenario comprised of 27 procedure starts by dividing the NetFlows data embedded in
macro sites each with three sectors. The user equipment (UE) the raw data stream into 1 minute intervals. NetFlows are then
is configured to send radio measurement and cell aggregated according to the source IP, and data
identification data of the serving and neighboring cells to the standardization is carried out. Afterwards, a k-means
base station, in addition to event-based measurements. The algorithm is employed to cluster (according to normal or
above-mentioned measurements are sent periodically (i.e. abnormal traffic behavior) the aggregated NetFlows. The
every 240 ms). The simulation considered two scenarios; following step is to calculate the Euclidean distance between
reference (a normally-operating network) and SC. The latter the cluster center and its elements. The procedure concludes
was simulated by dropping the antenna gain from 15 dBi by evaluating the success criteria. The authors considered a
(reference scenario) to -50 dBi (SC scenario). Measurements dataset containing 4.75 hours of records captured from CTU
reported from UEs are then collected from each scenario and University to analyze botnet traffic. The proposed approach
stored in a database. The reference scenario provided attained 96% accuracy and the results were visualized in 3D
measurements used by an anomaly detection model that is after employing Principal Component Analysis (PCA) to
based on k-nearest-neighbor algorithm to provide a network attain dimension reduction.
model with normal behavior. Multidimensional Scaling
(MDS) is used to produce a minimalistic Key Performance 3. Role of big data analytics in cellular network design
Index (KPI) representation. Thus the interrelationship between In this section, we review the research done on the use
Performance Indexes (PIs) is reflected and an embedded space of big data analytics for the design of cellular networks.
is constructed. Consequently, similar measurements (i.e. Compared to other network design topics, we observed
normal network behavior) lie within close distances while that the wireless field has received the most attention, as
dissimilar measurements (i.e. anomalous network behavior) measured by its share of research papers. These papers can
are far-scattered and hence easily identified. The model be classified according to the application or area under
attained 94 percent detection accuracy with 7 minutes training investigation. Consequently, we have classified those
time. papers into the following:
1- Counter-failure-related: This includes fault tolerance (i.e.
2.2 A proposed architecture for fully automated MNO detection and correction), prediction, and prevention
reporting system. techniques that use big data analytics in cellular networks.
2- Network monitoring: This illustrates how big data analytics
Mobile Network Operators (MNOs) collect vast amounts of
can be beneficial as a large-scale tool for data traffic
data from a number of sources as it can offer actionable plans
in terms of service optimization. Visibility and availability of monitoring in cellular networks.
information is vital for MNOs due to its role in decision 3- Cache-related: Investigates how big data analytics can be
used for content delivery, cache node placement and
making. Employing a reporting system is pivotal in the cycle
distribution, location-specific content caching, and
of transforming data to information, knowledge, and lastly to
proactive caching.
actionable plans. The authors in [61] presented a case study
4- Network optimization: Big data analytics can be involved in
aimed at illustrating the potential role of big data analytics in
several topics including predictive wireless resource
the development a fully automated reporting system. A
allocation, interference avoidance, optimizing the network
Moroccan MNO is to benefit from the alternative architecture.
in light of Quality of Experience (QoE), and flexible
The authors highlighted the shortcomings of the existing
network planning in light of consumption prediction.
automatic reporting system that uses traditional technologies.
It should be noted that Table 2 provides further detailed
Moreover, they inferred that using big data analytics can
provide the opportunity to overcome those shortcomings. classification, with the chance to compare the role played
The authors chose the Apache Flink [61] in their proposed by big data analytics across different network types and
applications.
architecture to serve as their big data analytics framework.
5
3.1 Failure prediction, detection, recovery, and prevention success rate, according to the definition of the associated
indicators, the reason is suggested to be the low success
3.1.1 Inter-technology failed handover analysis using big rate of the handover preparation. The solution would be to
data adjust the overlapping coverage areas formed between the
One of the most frustrating encounters happens when a source and the target cells and the parameters (e.g., the
mobile subscriber gets surprised by a sudden call drop. Many decision threshold offset and the handover initiation).
of these incidents occur when the user is at the edge of a A recommended solution can be provided when a
coverage area and moving towards another, technologically- deteriorating indicator surfaces, and this is simply done by
different area, e.g., moving from a 3G Base Station (BS) to a clicking the index query that caused the deterioration.
2G BS. The common solutions to address such shortcomings
are by either conducting drive tests or performing network 3.1.3 Anomaly detection in cellular networks
simulation. However, another solution that leverages the When a certain problem occurs in the cellular network, the
power of big data was proposed by the authors in [63]. The user would usually be the first who feels the service disruption
proposed solution uses big data analytics (Hadoop platform) to and suffers the impact. An abnormal and disrupted service
analyze the Base Station System Application Part (BSSAP) may be identified by examining the Call Detail Record (CDR)
messages exchanged between the Base Station Subsystem of the users in a specific area. CDR files are generated upon
(BSS) and Mobile Switching Center (MSC) nodes. Location making a call, and include, among other information, the
updates (only those involved in the inter-technology handover) caller and called numbers, the call duration, the caller location,
are identified and the geographic locations where the 3G- and the cell ID where the call was initiated or received.
service disconnections occur are identified by relying on the A CDR based Anomaly Detection Method (CADM) was
provided target Cell ID. proposed by the authors in [64]. CADM was used to detect the
The results of the above method were then compared with a anomalous behavior of user movements in a cellular network.
drive test (which is an expensive and time-consuming This was done, first, with the CDR data being collected from
approach) results, where coherence between the two results the network nodes and stored in a mediation department.
was demonstrated. Another comparison was conducted with Then, the second phase starts by distributing the collected
the Key Performance Index (KPI)-based approach and the CDRs to the relevant departments (e.g., data warehouse,
results were in favor of the proposed approach. billing, and charging departments). After that, the Hadoop
platform is used to detect the anomalies. The discovered
3.1.2 Signaling data-based intelligent LTE network anomalies are then fed-back to the mediation department for
optimization adequate actions.
By utilizing the combination of all around signaling and The use of big data analytics was essential in this case.
user and wireless environment data, combined with Self- Large datasets that require distributed processing across
Organized Network technologies (SON), full-scale automatic computer clusters were processed by the Hadoop Platform.
network optimization could be realized. The result was an improved system that is able to detect
The authors of [27] developed an intelligent cellular location based anomalies and improve the cellular system’s
network optimization platform based on signaling data. This performance.
system involves three main stages:
1- Defining network performance indicators through the 3.1.4 Self-healing in cellular networks
extraction of XDR keywords: The External Data The idea to develop a system that is capable of monitoring
Representation (XDR) contains the key information of the itself, detecting the faults, performing diagnoses, issuing a
signaling (e.g., the causes of the process failures and compensation procedure, and conducting a recovery is very
signaling types). The status of a complete signaling appealing. However, the self-healing process has another
process can also be identified by the XDR (e.g., the factor to keep in mind, which is time. The process should be
success or failure of signaling establishment and release). carried out within a reasonable amount of time so it would not
A number of performance indicators are defined by relying degrade the quality of the delivered services.
on this information. Querying these indicators is possible Three use cases were presented by the authors in [65] for a
from multiple dimensions and levels (e.g., user, cell, and self-healing process in cellular networks:
grid level). 1- Data Reduction: The Operation and Maintenance (O&M)
2- Problem discovery: Service establishment rate, the database can be used for troubleshooting purposes.
handover success rate, and drop rate are among the However, the database size is relatively large as it contains
network signaling-plane statuses that can be reflected by the data related to both normal and degraded intervals,
the XDR-based network performance indicators. Network which makes it difficult to process. Separating the intervals
equipment with unsatisfactory performance indicators can to just keep the degraded intervals will help in reducing
be further analyzed, and this can be done by conducting a that size. The authors proposed parallelizing this process
further excavation of the corresponding indicators’ original independently by analyzing each BS separately.
signaling. They chose the degraded interval detection algorithm of
3- Providing best practice solutions: Identified and solved [66] (a degraded interval is the time where the BS behavior
problems can provide an optimization experience. As a is degraded), and these intervals were detected by
consequence, a variety of network problems can be comparing the BS’s KPIs to a certain threshold. This
verified. For example, when a cell has a low handover algorithm was parallelized by implementing it as a map
6
function, a field is added to identify each BS, and all the 3.2 Network monitoring
fields are added by a reduce function.
2- Detecting Sleeping Cells: Cell outage or sleeping cells is a 3.2.1 Large-scale cellular network traffic monitoring and
common problem in mobile networks. Users are directed analysis
to neighboring cells instead of the nearest and optimal cell. Large cellular networks have relatively high data rate links
According to the algorithm described in [67], sleeping and high requirements to meet. Usually these networks use a
cells can be detected through the utilization of neighboring high-performance and large capacity server to perform traffic
BS measurements hence calculating the impact of the monitoring and analysis.
sleeping cell outage. The detection process relies on the However, with the continuous expansion in data rates, data
Resource Output Period (ROP), where each BS produces volumes, and the requirements for detailed analysis, this
Configuration Management (CM), Fault Management approach seems to have a limited scalability. Hence, the
(FM), and Performance Management (PM) data every 15 authors of [69] proposed a system to undertake that task,
minutes. For each BS, incoming handovers from utilizing the Hadoop MapReduce, HDFS, and HBase (a
neighboring BSs are aggregated for the current and distributed storage system that manages the storage of
previous ROP. If the number of handovers suddenly structured data and stores them in a key/value pair) as an
dropped to zero, and a malfunction is indicated by the advanced distributed computing platform. They exploited its
cell’s Performance Indicators (PIs), the cell is regarded as capability of dealing with large data volumes while operating
a sleeping cell. on commodity hardware. The proposed system was deployed
The authors in [65] proposed the use of the above- in the core side of a commercial cellular network, and it was
mentioned algorithm under the big data principle. They capable of handling 4.2 TB of data per day supplied through
proposed to divide the terrain into partitions that are the 123 Gbps links with low cost and high performance.
maximum distance between neighbors, where each BS
within the partitioned area is sequentially tested by an 3.2.2 Mobile internet big data operator
instance of the algorithm, and this is done by examining China Unicom, China’s Largest WCDMA 3G mobile
the data of its neighbors. operator with 250 million subscribers in 2012, introduced an
This approach was compared to other methods (e.g., industry ecosystem. The researchers in [70] highlighted this as
lack of KPIs and availability of KPIs), and most of the a telecom operator-centric ecosystem that is based on a big
simulated outages were detected (5.9% false negatives and data platform.
0% false positives). While a lack of KPIs and availability The above-mentioned big data platform is developed for
of KPIs methodologies showed a high percentage of false retrieving and analyzing data generated by mobile Internet
negatives. users. In an aim to optimize the storage, enhance the
3- KPI Correlation-Based Diagnosis: The authors in [65] used performance, and accelerate the database transactions, the
a method that utilizes most correlated KPIs to identify the authors proposed a platform that uses HDFS for distributed
problem cause. To simplify the analysis task, the algorithm storage. The cluster had 188 nodes used to store data, perform
considers the PIs of both the affected BS and the statistical data analyses, and as management nodes. The
neighboring sectors. approximate storage space was 1.9 PB. HBase has the role of
MapReduce was used to implement this algorithm in a the distributed database, with a writing rate that can reach
parallelized manner, the correlation process and the 145k records per second; HBase stores the structured data
creation of a PIs list arranged by correlation were located on the HDFS.
implemented as map and reduce functions, respectively. Compared with the Oracle database, it is noted that the
system achieved a four times lower insertion rate. The query
3.1.5 Cell site equipment failure prediction rate was compared to an Oracle database as well, and the
A sudden outage of services might have serious HBase showed a better performance when taking into
consequences, and this is why keeping communication consideration the impact imposed by the records’ size.
equipment, like cell sites, in a good working state is of high
importance. The challenge identified by the authors in [68] is 3.3 Cache and content delivery
to analyze the user’s bandwidth on the cell level.
Equipment(s) failure and infrastructure faults can be predicted 3.3.1 Optimized bandwidth allocation for content delivery
by analyzing the bandwidth trends in a particular cell. Mobile networks, usually, have a large number of users, and
Due to the size and diversity of the collected data, it is with the increase in Internet-based applications, it has become
essential to use big data analytics to process it. Thus, the essential to allocate the required bandwidth that meets the user
customers’ received bandwidth can be acquired over a expectations, as well as to ensure a competitive level of
particular time period (i.e., month or year, etc.). Next the data service quality. Cellular networks can provide Internet
from diverse data sources are integrated and then analyzed to connectivity to their users at any time; however, video
know the bandwidth trends. (especially high quality) contents are still slow and relatively
expensive. From the base station’s point of view, the impact of
forwarding the same video content to several users on the
same base station is massive. The LTE system addressed this
through multicast techniques. However, multicast is still
regarded as a big challenge in cellular networks. To overcome
7
the above problem, the authors of [26] proposed a solution that and caching the popular data for a specific base station. Big
can dynamically allocate bandwidth. The idea is based on data analytics can be of major use in this situation by
sharing the base station’s wireless channel by a user cluster employing it to do the required analysis. The result would be
that wishes to download the contents. Thus, saving the base cached content available to the users faster (reduced
station resources, as well as providing a better data rate for the provisioning delay) and without burdening the network.
clustered users, and providing an opportunity for the users
who did not join the cluster to benefit from the saved 3.3.4 Proactive caching in 5G networks
resources (bandwidth). It should be noted that the clustered Cache-enabled base stations can serve cellular subscribers,
users can receive the contents from the cluster head by using this is done by predicting the most strategic contents and
short range communication techniques like Wi-Fi Direct [71] storing them in their cache. Thus, minimizing both the amount
and Device to Device (D2D). of time and the consumed network bandwidth, which can
Two conditions have to be satisfied before forming a user payoff in other ways (i.e., less congestion and less resource
cluster. First, the users who request the same content are the utilization).
ones who form the cluster. Second, the users should be or will An approach, proposed by the authors in [34], used big data
be within a short range of each other. For that reason, the analytics and machine learning to develop a proactive caching
authors suggested using big data analytics to identify the mechanism by predicting the popularity distribution of the
users’ closeness and to group the users into cluster(s). A content in 5G cellular networks. They demonstrated that this
cluster head is then selected among the nearby users, and the approach can achieve efficient utilization of network resources
process is repeated among the base station users until there is (backhaul offloading) and an enhanced user experience.
either a cluster of users or a free (un-clustered) user(s). The After collecting the raw data, i.e., the user traffic, the big
simulation was carried for a single base station network and data platform (Hadoop) has the task of predicting the user
the results showed faster content delivery and improved demands by extracting the useful information, like Location
throughput at the user level. Area Code (LAC), Hyper Text Transfer Protocol (HTTP)
request-Uniform Resource Identifier (URI), Tunnel Endpoint
3.3.2 Improve cache node determination, allocation, and Identifier (TEID)-DATA, and TEID for control and data
distribution accuracy in cognitive radio networks planes. Then using this information to evaluate the content
In cognitive radio networks, Secondary Users (SU) have to popularity from the previously collected raw data.
leave the licensed spectrum when their activity starts to affect Experimentally testing this work on 16 base stations, as part of
the QoS level of the licensed users. This move would require an operational cellular network, resulted in 100% request
the existence of a cache node to compensate for the satisfaction and 98% backhaul offloading.
interrupted data transactions during the SU switch to the
3.4 Network optimization
unlicensed spectrum.
The author of [72] proposed the use of big data analytics to
process the data accumulated over time within the nodes. The 3.4.1 Big data-driven mobile network optimization framework
When thinking about optimizing a cellular network, it is
goal was to utilize this data to reach a decision on the cache
important to collect as much information as possible. Large
node distribution in a cluster network.
networks, as well as their users, generate a plethora of data,
The author selected two out of three categories (open and
for which the use of big data analytics is vital to analyze the
selectively open systems) of cognitive radio networks. Due to
colossal amount.
the nature of the open systems, every SU willingly shares its
The authors in [73] proposed a mobile network optimization
information to be processed, which results in a large amount
framework that is Big Data Driven (BDD). This framework
of data, so the prediction accuracy is high.
includes several stages, starting from the collection of big
For the selectively open systems, the SU selectively shares
data, managing storage, performing data analytics, and the last
its information with either some cache nodes, with the cluster
stage of the process is the network optimization.
head for a particular time interval, or with specific SUs in a
cluster. This results in a variable amount of shared data, thus Three case studies were used to show that the proposed
resulting in variable accuracy. framework could be used for mobile network optimization.
1- Managing resources in HetNets:

3.3.3 Tracking and caching popular data
The Mobile Network Operators (MNOs) may use big data
The number of social network (i.e., Facebook and Twitter)
to provide real time and history analysis across users, mobile
users is massive. The multimedia contents of these networks
networks, and service providers. MNOs can benefit from BDD
are normally shared between common interest groups.
approaches in the operation and deployment of their network,
However, big and important events attract a lot of attention
and this can be done in several stages:
and consequently a lot of content is shared across these
networks. When a certain video or event goes viral, this
sharing will eventually burden the network as the requested A) Network Planning: Due to a deficiency in the level of
content would have to travel along the network on its way to sufficient statistical data, evolved Node B (eNB) sites are
the servers. The solution to such a problem was suggested by not optimally optimized, this can be dealt with if an
the authors of [68], they suggested monitoring popular and adequate amount of information (user and network) is
social media websites, analyzing the data, identifying if there provided for analyses. Big data analytics can help MNOs
is a growing interest in certain content, by which age category, reach better decisions concerning the deployment of eNB
8
in the mobile network. The authors in [73] suggested the Furthermore, the location and user traffic demands of
use of the network and anonymous users’ data (e.g., multiple eNBs can be optimized, offering the deactivation
dynamic position information and other service features). of a SeNB due to elevated Signal-to-Interference-plus-
Providing a relation between the data and their events can Noise Ratio (SINR) to avoid the interference caused by a
offer a better understanding of the traffic trends. Big data nearby SeNB that would also result in reducing the
sets provide actionable knowledge to reach an optimal energy consumption.
decision concerning how and where to deploy eNBs in the
network. Another important feature is the ability to 2- Deployment of cache server in mobile CDN
prepare for future investments depending on the predicted Popular content (e.g., movies) can be delivered through
traffic trends. a Content Delivery Network (CDN), which is a method
B)Predictive Resource Allocation: Resource requirements that is considered efficient by many MNOs. Distributed
change depending on the density and usage patterns of cache servers should be located near the users to achieve a
mobile network subscribers. Predicting where and when fast response as well as to reduce the delivery cost. In
mobile users are using the network can help in preparing hierarchical CDN, it is vital to place cache servers in an
for sudden significant traffic fluctuations. The authors in optimal location. Due to the unique features that RAN has,
[73] suggested the use of big data analytics to examine it was the primary interest of the authors in [73].
behavioral and sentiment data from social networks and It is expected that there will be an enhanced backhaul
other sources. They also showed an interest in utilizing capability in 5G networks, and this would result in
current and historical data to predict the traffic in highly minimizing the concerns related to the latency and traffic
populated areas within the network. load of backhaul transmissions. Therefore, not all MeNBs
Using the cloud RAN architecture [74], the right place at would require a dedicated distributed cache server. In
the right time can be served through the predictive addition, a SeNB can have a distributed cache server.
resource allocation, thus minimal service disruption can Optimal cache server placement depends on several
be achieved. factors, such as the features and load of traffic in a given
C)Interference Coordination: HetNets with small cells can area, as well as the cost of storage and streaming
be used to conduct interference coordination among equipment. To help the MNOs decide where to deploy
macro and small cells. This coordination has to be carried their cache servers, data analytics methods can be regarded
out in the time domain instead of the frequency domain. as a feasible solution. However, this would require the
Schemes like the enhanced Inter-Cell Interference collection of all the above-mentioned factors over a long
Coordination (eICIC) in LTE-Advanced [75] efficiently period in the related coverage area.
enable resource allocation among interfering cells, as well
as improving the inter-cell load balancing in the HetNets. 3- QoE modelling for the support of network optimization:
eICIC allows Macro cells evolved Node B (MeNB) and The authors of [73] believed that the management of
its neighboring Small cell eNBs (SeNBs) to have data services and applications needed more than just relying on
transmitted in isolated subframes, thus interference from the QoS parameters. Instead, they suggested taking the
MeNB to SeNB can be avoided. To implement eICIC, a quality (i.e., QoE), as perceived by the end users, to be
special type of subframe named an Almost Blank regarded as the optimization objective. Accurate and
Subframe (ABS) that carries minimum (and most automatic real time QoE estimation is important to realize
essential) control information, was defined. It is worth the optimization objective. In addition to the technical
noting that the ABS subframes are transmitted with factors, non-technical factors (e.g., user emotions, habit,
reduced power [75], and that the network operator can and expectations, etc.) can affect the QoE. A profile for
control the configuration of that subframe. each particular user comprising the above non-technical
Many factors contribute to the determination of the ABS factors would help in the QoE evaluation.
ratio of the macro cell to the small cell, such as the traffic Since answering the questions that would lead to a clear
load in a specific area, the service type, and so on. The profile is not a task that would be fancied by a typical user,
optimal ABS ratio varies dynamically, and this is due to the authors suggested installing a profile collection engine
the fact that inter-cell interference changes with time for on the users’ mobile devices. User activities are compared
the factors mentioned above. and tracked to recognize differences and similarities, and
In a BDD system, optimizing the radio resource allocation then they are stored in a database for additional processing.
can be accomplished through the use of network After profiling, the following step constitutes the use of
analytics. The deployment of BDD optimization functions machine learning to identify the relationship between QoE
at the MeNB would enable them to collect and analyze and the influencing factors.
eNB-originated raw big data (e.g., service characteristics Data analytics can be used to discover what impacts the
and traffic features) in real-time, thus enabling a quick QoE in users’ devices, as well as the services and network
response. As a result, the performance optimization of resources. The next step is for network optimization
each cell and the users can be fulfilled. functions to react to determining what caused the problem
Optimizing ICIC parameters (e.g., ABS ratio) can be and select the optimal action accordingly.
achieved by processing raw data in a periodic manner to
acquire statistics and to detect traffic variations
automatically.
9
3.4.2 Improve QoS in cellular networks through self- First stage: This process is carried out in the eNB system,
configured cells and self-optimized handover processing the data from the cellular and core network side.
Cellular networks have a crucial element on which the Binary values are acquired by comparing cellular level KPIs to
concept of mobility depends. This element is the handover their respective threshold values, thus keeping the binary
success rate, which ensures call continuity while the user matrix updated. This procedure is repeated at fixed intervals.
moves from one cell to another. Failing in that particular Second stage: Repeat the same steps as in the first stage.
element would impact the quality of the service, thus putting However, this process is carried out on subscriber level data to
the operator into a questionable situation. acquire subscriber KPI, and maintain a binary matrix.
Operators try to make sure that each cell has a list of Third stage: This is activated when a user initiates a resource
manually configured and optimized neighbor cells. Hence, it is allocation request. A binary pattern is generated based on the
vital to note the high probability of these cells failing to adapt user requirements. This pattern is later handed over to stage 2
when a rapid response is required due to a sudden network to update the binary matrix (if required) and incorporate the
change. new values in the row that represents the requested bandwidth.
The authors in [76] presented two methods that used big After generating the updated row, it is transferred to the first
data analytics to introduce a self-configured and self- stage for comparisons with the current Physical Resource
optimized handover process, the first was associated with Block (PRB) groups. To identify which PRBs suit the user, the
newly introduced cells, while the latter was concerned with fuzzy binary pattern-matching algorithm [78] was used for
the already existing cells. The analysis started by collecting that purpose.
and archiving predefined handover KPIs. A dispatcher process Using this algorithm, the execution time increased linearly
is run after the collection period, and its aim is to check the for an exponential increase in the number of comparison
files to see if they were marked as new cells (where Self- patterns.
Configuration Analytics is started) or not (where Self-
Optimization Analytics is started): 3.4.4 Framework development for big data empowered SON
for 5G
1- NCL self-configuration for new cells: The authors of [79] proposed a framework called Big data
Newly installed base stations require Neighbor Cell List empowered SON (BSON) for 5G cellular networks.
(NCL) to be configured on the new cells. The selection Developing an end-to-end network visibility is the core idea of
process takes into consideration the antenna type, the BSON. This is realized by employing appropriate machine
azimuth angle (for directional cells), the geographic learning tools to obtain intelligence from big data.
location of the candidate cells, and the process concludes According to the authors, what makes BSON distinct from
by selecting cells with a minimum distance and maximum SON are three main features:
traffic load to be the top candidate cells. The NCL is • Having complete intelligence on the status of the current
configured via Network Management System (NMS) network.
Configuration Management (CM) tools. • Having the ability to predict user behavior.
2- NCL self-optimization for existing cells: • Having the ability to link between network response and
The process starts by collecting KPI measurement network parameters.
statistics for the failed and successful handovers, and this The proposed framework contains operational and
task is done by the Performance Management (PM) or the functional blocks, and it involves the following steps:
NMS. 1- Data Gathering: An aggregate data set is formed from all
Cells with a handover failure rate below a predefined the information sources in the network (e.g., subscriber,
threshold are excluded from the NCL, while unlisted cell, and core network levels).
neighboring cells with a successful rate above a predefined 2- Data Transformation: This involves transforming the big
threshold are considered as new neighboring cells. data to the right data. This process has several steps,
starting from:
3.4.3 Optimizing the resource allocation in LTE-A/5G a. Classifying the data according to key Operational and
networks Business Objectives (OBO), such as accessibility,
The overall system performance evaluation in advanced retainability, integrity, mobility, and business
wireless systems, like LTE, depends on KPIs. In a quest to intelligence.
enhance the user experience, the authors of [77] proposed an b. Unify/diffuse stage, and the result of this stage is more
approach that utilizes user and network data, such as significant KPIs, which are obtained by unifying
configuration and log files, alarms, and database multiple Performance Indicators (PIs).
entries/updates. This approach relies on the use of big data c. According to the KPI impact on each OBO, the KPIs
analytics to process the above-mentioned data. The ultimate are ranked.
goal is to provide an optimal solution to the problem of d. Filtration is performed on the KPIs impacting the OBO
allocating radio resources to RAN users, and guarantee a less than a pre-defined threshold.
minimal latency between requesting the resource and e. Relate, for each KPI and find the Network Parameter
allocating it. This is done through user and network behavior (NP) that affects it.
identification, which is a task well-matched for big data f.Order the associated NP for each KPI according to their
analytics. association strength.
The proposed framework involves three stages:
10
g. Cross-correlate each NP by finding a vector that 3.4.5.2 Providing flexible network and functionality
quantifies its association with each KPI. deployment
3- Model: Learn from the right data acquired in step 2 that will The authors of [27] noted that processing the characteristics of
contribute to the development of a network behavior both the regional user and service using big data analytics can
model. be of great help in flexible network and functionality
4- Run SON engine: New NPs are determined and new KPIs deployment in the following ways:
are identified using the SON engine on the model. • Flexible Network Deployment
5- Validate: If a new NP can be evaluated by expert Since 5G will support diverse low-cost Access Points
knowledge or previous operator-experience, proceed with (APs), such as coverage APs and hot spot APs, using big
the changes. Otherwise, the network simulated behavior data analytics can be useful to forecast the traffic
for new NPs is determined. If the simulated behavior characteristic, hence establish the base for achieving a
tallies with the KPIs, proceed with the new NPs. dynamic network deployment for the APs.
6- Relearn/improve: If the validation in step 5 was • Flexible Functionality Deployment
unsuccessful, feedback to the concept drift [80] block, Analyzing and predicting regional user and service
which will update the behavior model. To maintain model requirements can be realized using big data analytics. This
accuracy, concept drift can be triggered periodically even will form the foundation of dynamic functionality
if there was a positive outcome in the validation step. deployment, which will help decide where to deploy
certain functionality modules (e.g., safety modules where
3.4.5 User-centric 5G access network design there are security requirements).
Enhancing the user experience, giving a higher data rate,
and reducing the latency are considered the key goals of a 5G 3.4.5.3 Using user behavior awareness to optimize wireless
system. The authors of [27] expect the following to be the key resources 5G networks
elements in the design of a user-centric 5G access network.
Optimization of wireless resources should be carried-out
3.4.5.1 Personalized local content provisioning according to the user and 5G service requirements. This is
done to enhance the user experience and improve the
It is important for the access network to evolve from being efficiency.
user and service agnostic, by acting merely as a blind pipe that According to what the authors of [27] proposed, big data
connects the user to the core network, to becoming user- can be used to analyze user mobility patterns, predict the
centric. The latter term would direct the network access to the motion trajectory, and hence pre-configure the network
right path of being user and service aware, thus facilitating a accordingly. Each user’s historical Access Point (AP) list is
key technology in 5G, which is local content provisioning, this recorded by the APs. This data can be uploaded to a central
in-turn would pay off in the form of end-to-end latency module for processing, or to a target AP in case the service AP
reduction and enhancing the user experience. was altered. Using a big data analytics algorithm, the collected
The role of big data analytics becomes very clear, as it data is analyzed to forecast the motion trajectories.
provides a necessary ability of predicting user requirements.
These requirements can be met in the case of local availability. 3.4.5.4 Big data-based network operation system
The authors of [27] proposed several steps to achieve
content provisioning, and they are as follows: The authors of [27] proposed a system that can maximize the
• Traffic and user Information Acquisition: Traffic attributes efficiency of big data based network operation. This can be
(application type, server address, and port number, etc.) are fulfilled by optimally allocating the network resources to each
collected through packet analysis, and analyzed using a AP and user. The proposed system consists of two parts:
clustering algorithm (e.g., k-means [81]) to perform traffic 1- Decision making domain
labelling (news, sports, and romance, etc.). This is responsible for collecting and managing the user,
network upgrades, configuration, service, and terminal
• Analyzing and Predicting User Requirements: This step is
state information. This domain exploits big data analytics
achieved by a big data analysis algorithm (an algorithm
to provide the basic configuration essential for network
like collaborative filtering was recommended [82]), which
initialization. For this domain to function properly, it
utilizes the traffic attributes mentioned above, and thus
would need to realize the whole picture of the user and
recommends content based on user’s interests.
service requirements, as well as the functionality
• Local Caching and Content Management: Popular content
distribution in the network.
is provided in the form of local copies. Content that might
2- Implementation domain
be of interest to the users are to be locally cached after
Its responsibility is mainly for status reporting of the
being downloaded from the application server.
user, terminal, and network, dynamic deployment, and
• Content Provisioning: When an application request is network configuration. Depending on the requirements
initiated by the user, the system will check if the content is after acquiring the personalized configuration, this domain
already locally cached, so it can be sent directly. can use the dynamic APs functionality and configuration to
Furthermore, big data analysis can give content build multi-connectivity bearers with terminals.
recommendations, thus the system will check if these
contents are cached locally, and push them directly to the
user.
11
3.4.5.5 Multi-RAT or HetNet energy saving status in an automatic manner. To simplify the network
Small cells are used in multi-Radio Access Technology management, SDN has made the above requirement possible
(RAT) or HetNet mobile networks. For energy saving by decoupling the control and forwarding planes through the
purposes, when traffic falls below a certain threshold, small OpenFlow protocol [84].
cells may enter into a dormant state, and this forces the small OpenFlow is considered to be the first standardized protocol
neighboring cells to serve the dormant cell’s coverage area. in SDN, it is also identified as the enabler of SDN.
Several energy saving schemes were discussed in [83]. SDN/OpenFlow has influenced Google to switch to OpenFlow
However, these schemes failed to adapt to the dynamic in its inter-data center network, which resulted in an
temporal and spatial traffic variations, as they operate under a approximate 99% increase in the average Google WAN link
relatively long time scale leading to unacceptable delay. This utilization [85].
happens because when a number of newly arrived User The architecture proposed by the authors of [85] included
Equipment (UE) seek access to the network, where cell the following parts:
activation is essential, the corresponding small cells requires 1- User preference analysis server:
to be first activated, before operating normally. Only at that The authors adopted the Hadoop platform to realize the
point can the UEs access the network using the standard prediction functionality. They utilized the analyses of both
procedure. UEs, however, would suffer from an unacceptable network traffic and user application information to find
delay when trying to access the cells. each application flow distribution. For each data flow, they
The authors of [27] proposed a user-centric approach that is found a specific distribution law. They analyzed that law,
based on big data analytics. The proposed solution claims to and for different applications and areas they developed a
achieve optimal implementation for the activation of small preliminary general prediction model to fit the case of the
cells as well as UE access to the network. Thus, joint same application but in different areas.
optimization for both UEs’ access and energy saving can be 2- Interface design between SDN controller and database:
achieved. A cloud platform is responsible for calculating and
predicting the flow distribution values of each OpenFlow
3.4.6 Network flexibility using consumption prediction switch. In addition, this platform will read the link
Consumption analysis is concerned with two factors: information and perform traffic prediction. A database will
customer locations and type of service. Consumption trends hold the recorded values and the last predicted values will
can be classified in a timely manner into long-term, seasonal, be updated. To ensure that the allocation of resources
and short-term. accommodates the traffic variation, Floodlight (a Java-
To reach an accurate prediction, the authors in [68] implied based SDN controller that can accommodate different
that user data (e.g., GPS location and service usage) can be applications by loading different modules) will read the
correlated with other data (e.g., news, social network, events, newest predicted values from the database regularly.
and weather conditions). Using big data analytics to analyze 3- SDN controller-based routing module:
these correlations, operators would be able to decide when and The predicted values are used as preferences to select
where to place their nodes without affecting the subscribers’ the best route. The researchers used an improvement on the
satisfaction. Dijkstra algorithm.
Application awareness and the prediction of user preferences
4. The role of big data analytics in SDN & intra-data were integrated into SDN through the newly proposed
center networks architecture, which would facilitate and enhance network
resources allocation and provide better application
SDN offers the ability to program the network with a classification.
centralized controller, this controller is capable of The role of big data is exemplified by its ability to use
programming several data planes using one standardized open network flow analyses and users’ behavior to forecast the type
interface, thus providing flexible architectural support [1]. The and rate of the incoming traffic flows.
following research topics utilized the properties of both The procedure proposed by the authors of [85] is as follows:
Software Defined Network (SDN) and big data analytics by 1- Current network load (i.e., traffic volume and type, etc.) is
employing the analysis results to program the network. Those read by a cloud platform.
topics can be classified according to the area under discussion 2- The overall traffic is predicted in advance and gathered in a
as follows: database. This is done using a big data-powered prediction
1- Traffic Prediction: The paper surveyed in this section algorithm.
employ traffic prediction to optimize network resource 3- The SDN controller accesses the database and reads the
allocation. stored data mentioned in step two.
2- Traffic reduction: Pushing the aggregation from the edge 4- A resource allocation scheme is created by the SDN
towards the network. controller using the above-mentioned routing algorithm,
4.1 Traffic Prediction and this scheme is sent later to the related switches.
Big data analytics can use the users’ requirements to provide
4.1.1 Cognitive routing resources in SDN networks the network with a dynamic resource allocation and
The network’s next generation has to be smart and flexible, application classification, hence providing the network with
with the ability to modify its strategy according to the network better load balancing techniques.
12
The results from the implemented testbed showed the ability 5. The Role of Big Data Analytics in Optical Networks:
of the proposed solution to self-adapt towards flow variation This section discusses research papers that employ big data
by dynamically issuing traffic tables to the related switches, analytics for optical network design, the topics are classified
which can increase the resource utilization and attain an as follows:
improved overall load balance. 1- Network optimization: Here the parallel processing
merits of Hadoop are utilized to reduce the execution
4.1.2 Predicting data communication volume at runtime in time of different (bin-packing) optimization algorithms.
data center networks using SDN 2- Traffic Prediction: Using big data analytics to
Networks that deal with big data applications may suffer dynamically reconfigure the network according to
from the size and speed of data. For example, the networks’ predicted traffic.
overall response time is affected by MapReduce’s heavy-
communication phase. This problem can be intensified if the 5.1 Providing solutions to network optimization problems
communication patterns experience a heavy skew impact. The
authors of [86] have proposed Pythia, which is a system that 5.1.1 Solving the RWA problem
can optimize the data center network at the runtime by The Routing and Wavelength Assignment (RWA)
utilizing the real time communication prediction of Hadoop. It algorithm [90] plays an important role in optical networks.
also maps the end-to-end flows to the underlying network. The authors in [91] considered the RWA algorithm to be an
Pythia utilizes the SDN-offered programmability to achieve illustration of the bin-packing problem that is listed as a
efficient and timely network resource allocation for shuffle classical NP-hard problem [92].
transfers. The authors in [91] used a Hadoop cloud computing system
Depending on the network workload and blocking ratio, the that consisted of 10 low-end desktop computers to
Hadoop workload saw a consistent acceleration when using independently run an instance of the RWA algorithm on each
Pythia, and job completion time was reduced between 3% and of them for a certain number of demand sequences. The goal
46% in comparison with MapReduce benchmarks. is to sufficiently evaluate the demand sequence within a short
period. The procedure is as follows:
4.2 Traffic Reduction 1- An input file is fed to the HDFS, it incorporates the
information of the lightpath demand requests.
4.2.1 Increasing network performance through traffic 2- The file is read by the map function, where the demand list
reduction is regarded as a value and combined with different keys
Large networks, such as those of Google and Facebook, or ranging from 0 to 19 that later serve as random seeds in the
even small and medium sized enterprise networks suffer from reduce functions. It is worth noting that the authors set two
a plethora of traffic. This happens due to the colossal amount reduce functions per computer (i.e., a total of 20 reduce
of data being processed either in batch or real time functions).
applications. 3- The key-value pairs are then forwarded to the 20 reduce
A common solution is to increase the available bandwidth functions where parallel computing is conducted. The
in the enterprise clusters. However, the authors of [87] lightpath demand list is shuffled 250 times in a random
proposed another approach that improves the network fashion (i.e., 5000 shuffled demand sequences), and this
performance, pushing the data aggregation from the edge happens for each key-value pair within each reduce
towards the network, thus decreasing the traffic. function. To acquire the number of required wavelengths,
A platform called CamCube [88] was used; it substitutes the the RWA heuristic is run for each of the shuffled demand
use of dedicated switches by distributing the functionality of sequences. The optimal result of each reduce function is
the switch across the servers. It is worth noting that CamCube then compared against the remaining 19 to find the global
offers the ability to intercept and modify packets at each hop. optimum.
In addition, it uses a direct-connect topology, in which, a Different test networks (ranging from 20 to 500 nodes) were
1Gbps Ethernet cross-over cable is used to connect servers to used to evaluate the performance of the Hadoop system. The
each other in a direct way. results were optimality evaluated by comparing them against
Exploiting CamCube's properties to realize high the results of an Integer Linear Programming (ILP)
performance, Camdoop, which is a CamCube service that runs optimization model and they showed a close proximity to the
MapReduce-like jobs, was used. It offers full on-path data optima (except for two cases). It is worth noting that the ILP
stream aggregation. Camdoop builds aggregation trees where approach assumes full wavelength conversion, which plays the
the children are resembled by the intermediate data sources, role of the performance’s lower bound in the present
while the roots are located at the servers performing the final evaluation.
reduction in traffic.
A small prototype of Camdoop running on CamCube was 5.1.2 Solving multiple optimization problems using Hadoop
tested, and a simulation was used to show that the same The authors in [93] proposed to solve several optimization
properties still hold at scale. The results showed a significant problems in the optical network paradigm. The problems are:
traffic reduction with the proposed system when compared to 1- Energy minimization problem [94], where the goal is to
Camdoop running on a switch and when compared to systems minimize the overall network power consumption from
like Hadoop and Dryad/DryadLINQ [56] [89]. non-renewable energy sources.
13
2- Shared Backup Path Protection (SBPP) –based elastic reconfiguration [97]. The drawback is that there is no saving
optical network planning problem [95], where a heuristic in the number of transponders that needs to be installed in
used the concept of Spectrum Windows Planes (SWPs). each IP router compared to overprovisioning.
The objective was to minimize the maximum number of An alternative solution is proposed in [59] where VNT
Frequency Slots (FSs) in the network. reconfiguration can be attained regularly using big data
3- Adaptive Forward Error Correction (FEC) assignment analytics. This is done by periodically analyzing Origin-
problem [96], where the goal is to maximize the total Destination (OD) traffic so that VNT reconfiguration can be
number of FSs utilized for user data transmission. A performed accordingly. Collection of traffic monitoring data
heuristic based on SWPs was developed to solve the takes place at edge IP routers regularly. A set of traffic
Routing and Spectrum Allocation (RSA) problem. samples is collected by every edge router for every other
The above problems are of a bin-packing type and classified destination router. These sets are stored in a collected data
as NP-hard. Several aspects (i.e. demand size and route) repository. According to a predefined time period, the
should be considered when serving network traffic demands. collected data is then summarized for each OD pair by
Due to the high computational complexity and the order of periodically retrieving the collected data repository and
served demands, the performance of heuristic algorithms performing data stream mining sketches. The result of this
trying to solve these problems cannot be guaranteed. This is stage is a modeled data repository which includes, among
because of using the simple largest to smallest ordering others, maximum, average, and minimum bit rate for every
strategy. Good performance can be attained by randomly OD pair. Using machine learning techniques (i.e. Artificial
shuffling demand sequences, then implement a heuristic Neural Networks), a prediction module generates the predicted
algorithm for each sequence and choosing the one with the OD traffic matrix for the upcoming period. The decision on
optimum performance. To shorten the computation time and to whether to perform VNT reconfiguration or not is determined
overcome the computational complexity, a Hadoop cloud by a decision-maker module by relying on the above-
computing system consisting of seven computers was mentioned matrix. If a reconfiguration is required, a VNT
proposed by the authors in [93]. This way, a heuristic optimizer is provided with both current and predicted OD
algorithm can be executed for multiple shuffled demand traffic matrices. The solution is fedback to the network
sequences in a parallel manner. controller to implement the required changes in the data plane.
The Hadoop MapReduce platform makes it possible to The performance of the proposed solution was compared
evaluate multiple shuffled demand sequences in parallel. A against the static and the threshold-based methods. An overall
heuristic algorithm serves each of the shuffled demand saving between 8% and 42% in the number of installed
sequences and a result is produced each time. The results are transponders was achieved. The proposed solution has the
then compared and the one with the best performance is ability to react during low traffic hours by deactivating
chosen. The same procedure is repeated on each Reduce transponders, thus, energy saving is attained. Also, the
function. The final global optimum is found by comparing the advantage of cost reduction is attained by releasing lightpaths
results across all reduce functions. from the underlying optical layer.
Performance evaluation is done by employing two test
networks; the 24-node, 43-link USNET network (adopted for 6. The role of big data analytics in network security
problems 1 and 3) and the 11-node, 26-link COST239 network
6.1 Peer-to-Peer Botnet detection
(used for problem 2). For the first optimization problem, the
total consumption of non-renewable energy decreased by 8% Many security problems on the Internet are caused by
(when the number of shuffled demand sequences increased Botnets, which can be defined as networks of malware-
from 1000 to 10,000). As for the second optimization infected machines controlled by an adversary [98]. Botnet
problem, the number of required FSs was significantly attacks form a real security concern, with the ability to utilize
reduced. In the third problem, the total number of FSs for user 90,000 IPs in an attack [99], it is a challenge on an
data transmission was increased and 3% performance international scale, especially when taking into consideration
improvement was noted when compared with the case of the financial damage they can inflict.
running Hadoop on a single machine. The computation time To detect and neutralize such attacks, security researchers
for all three problems was significantly shorter compared to a and network analysts consider packet capturing and network
single-Hadoop machine. tracing to be amongst the most appreciated resources.
However, analyzing these massive sized datasets is not an
5.2 Dynamic network reconfiguration easy task for today’s computers. To overcome that challenge,
the authors of [100] proposed a scalable threat detection
5.2.1 VNT adaptability using traffic prediction framework that uses the following components:
The emergence of new services is placing new demands on 1- Traffic sniffer: Dumpcap [101] is used for packet sniffing
networks in terms of large and dynamic bit rate requirements. while Tshark [101] is used for field extraction, and the
This caused network operators to look for a Virtual Network fields are then submitted to the HDFS for storage.
Topology (VNT) architecture that can cope with the 2- Feature Extraction Module: The HDFS-submitted files are
anticipated traffic in a dynamic fashion. One solution is then processed by Apache Hive [102] for feature
realized by overprovision the network, however, the downside extraction.
is the increase in Total Cost of Ownership (TCO). Another 3- Machine Learning Module: Scalability is a requirement
solution saves power by using threshold-based capacity when it comes to the machine learning module. This
14
requirement is met using a machine learning library called failing to pinpoint the malicious devices in an exact manner.
Mahout [103], thus harnessing the cluster high In spite of the limitation, unsupervised learning approaches
computational power to achieve optimized results. It is showed further practicality when compared to white-list based
worth noting that Mahout is built on top of Hadoop, and its approaches, as they require no pre-registration process and
classification and clustering core algorithms are run as human intervention.
MapReduce jobs.
6.4 Machine learning methods for cybersecurity intrusion
The proposed approach achieved a detection time within
detection
tens of seconds, and the authors claimed that this time can be
reduced to less than 10 seconds after some Hadoop tweaking The authors in [107] surveyed the topic of intrusion detection
and additions to the cluster. methods based on data mining and machine learning
algorithms. They compared different methods taking into
6.2 Improving network security by discovering multi-pronged account complexity, accuracy, understandability of the final
attacks solution, and classification time of an unknown instance using
Networks are considered a target for intruders who would a trained model. They referred to the availability of labeled
try to infiltrate them. Multi-pronged attacks may spread over data as the biggest gap that, if bridged, can lead to significant
network subnets; the spreading might target several scattered advances in machine learning and data mining methods in the
network points or take place in different events over time. field of cyber security. They also highlighted an open area for
To discover and predict such attacks, the authors of [104] research, namely the investigation of fast incremental learning
proposed a system named Big-distributed Intrusion Detection methods to update misuse and anomaly detection models on a
System (B-dIDS) that relies on two components: daily bases.
1- HAMR: An in-memory MapReduce engine used for big
data processing. It is worth noting that HAMR supports Finally, we summarise the research outcomes related to big
both batch and streaming analytics in a seamless manner. data analytics-based network design in Fig 3. It is clear that
2- An analytics engine: Residing on top of HAMR, the the wireless field is getting most of the researchers’ attention
analytics engine includes a novel ensemble algorithm. Its compared to the other fields. This may be attributed to the
basic principle relies on using clusters with multiple IDS more significant challenges faced by wireless networks
alarms to extract the training data. compared to wired networks and hence the more significant
The proposed system scans the IDS log data, checking for level of opportunities. The larger numbers of papers
alarms that might be treated as unthreatening at first glance addressing the use of big data analytics and methods in
(when examined separately) but that may result in an opposite wireless may also be a reflection of the larger number of
judgment after combining them with other alerts. researchers that focus on wireless networks compared to wired
networks. Furthermore, we present a summary in Table 2 for
6.3 Device fingerprinting in wireless networks
all the research topics illustrated between sections 2 and 5.
Big data analytics and machine learning have several
algorithms in common [105]. More security concerns can be
addressed through the use of machine learning algorithms
such as device fingerprinting. The authors in [106] have
conducted a survey on wireless device fingerprinting methods
in wireless networks. They illustrated the main features and
techniques used towards this end. Device fingerprinting can be
defined as the process of generating device-specific signatures
by gathering device information. This is done through
analyzing the information across the protocol stack, and it can
be used to counter the vulnerability of wireless networks to
insider attacks and node forgery. Two types of fingerprinting
algorithms were discussed; white-list based (i.e. supervised
learning) and unsupervised learning based approaches.
The device fingerprinting process is broken into three main
steps; step one is concerned with identifying relevant features
found in all layers across the protocol stack. Step two is where
features are extracted and modeled. The features tend to be Fig. 3. Percentage of surveyed research topics according to
stochastic in nature due to the dynamic nature of wireless subject area.
channels, consequently, the models will be stochastic as well.
Step three is where device identification takes place by
employing machine learning algorithms (supervised and
unsupervised).
The authors reviewed the existing algorithms and concluded
that despite the high computational complexity of
unsupervised learning methods, their role is limited to detect
the presence and the likely culprit involved in the attack while
15
Table 2: Research summary

Network Research
Reference Proposed or Deployed Technique
Type Category
Failure [63] Analyzed inter-technology (2G-3G) failed handovers.
Prediction, [27] Used XDR data to discover network failures and present a solution advice.
Detection, [64] Developed CADM which uses CDRs to identify anomalous sites.
Recovery, and [65] Presented three case studies of self-healing using big data analytics.
Prevention [68] Suggested the analysis of the bandwidth trends to predict equipment failure.
Network [69] Developed a Hadoop-based system to monitor and analyze network traffic.
Developed a solution powered by big data platforms with distributed storage and distributed database to solve the
Monitoring [70] issues of data analysis and acquisition.
[26] Utilized big data to form a cluster made up of nearby users that share the base station’s wireless channel.
Analyzed the data that resides within the cache nodes to enhance the determination, allocation, and distribution of
Cache and [72] cache nodes.
Content Suggested monitoring and analyzing social media and popular sites, to predict and cache certain contents,
[68]
Delivery according to age category, at the predicted locations where these contents are highly demanded.
Proposed the use of big data analytics and machine learning techniques to proactively cache popular content in 5G
[34] networks.
Presented three case studies in which a proposed network optimization framework is efficiently utilized. In
Wireless particular, the work suggested:
1) The use of big data analytics to manage resources in HetNets. This is done in three stages (network planning,
[73] resource allocation, and interference coordination).
2) The deployment of cache servers in mobile CDN.
3) The optimization of networks with QoE in mind.
Proposed NCL self-configuration/optimization algorithms to achieve an automatic, self-optimized handover. The
[76] work relied on the processing of CM and PM KPIs using big data analytics platform.
Developed a three-stage framework that utilizes the network and user KPIs to reach an optimal allocation of radio
Network [77] resources (PRBs).
Optimization Presented a framework that uses big data collected from the cellular network to empower SON. They also
[60] presented a case study on how to detect sleeping cells using this framework.
Investigated the impact of big data on 5G networks in terms of:
1) Efficient content provisioning.
2) Flexibility in functionality and network deployment.
[27] 3) Utilizing user behavior in wireless resource optimization.
4) Achieving highly efficient network operation.
5) Saving energy in HetNet or Multi-RAT networks.
Correlated location data, service usage, and other contextual data to predict the consumption trends and select the
[68] optimal node location.
SDN and Traffic [85] Dynamic allocation of network resources by relying on traffic predicted by employing Hadoop platform.
Developed Pythia, a system that uses Hadoop’s properties to predict the volume of data communication at runtime
Inter- Prediction [86] in data center networks.
Data
Traffic Proposed Camdoop and run it over CamCube, the performance surpassed that of Camdoop running on a
Center [87] conventional switch.
Reduction
Network [91] Used Hadoop to find a solution for the RWA problem.
Optimization [59] Predicted future traffic by using Big Data Analytics to reconfigure VNT regularly.
Optical
Traffic
[93] Employed Hadoop MapReduce to solve multiple optimization problems of bin-packing nature.
Prediction
[100] Proposed a threat detection framework to detect peer-to-peer Botnet attacks.
[104] Developed B-dIDS, a system that scans IDS log files to detect multi-pronged attacks in distributed networks.
Security Surveyed the topic of wireless device fingerprinting methods and how machine learning algorithms can be used
[106] for device identification.
Provided a survey discussing the role of data mining and machine learning algorithms in intrusion detection
[107] methods.
introduces itself as a real-time traffic monitoring tool that

7. Big data analytics in the industry analyzes user behavior to gain network insights, similar
Throughout our survey, we came across several companies approaches were presented in academia by the authors of [69,
that offer network solutions based on big data analytics. These 109]. The Wireless Network Guardian detects user anomalies
companies and solutions are highlighted in Table 3. It should in mobile networks where a comparable topic was discussed in
be noted that these solutions are enabled by research [110]. Preventive Complaint Analysis makes use of big data
conducted in their corresponding areas. We have added analytics to detect behavioral anomalies in mobile network
academic research papers related to each solution in Table 3. elements where the authors in [111] provided a similar
Due to the proprietary nature of industrial products, the exact approach. Predictive Care utilizes big data analytics to
algorithms or methods behind these products is not available identify anomalies in network elements before affecting the
in the open literature. Therefore, academic papers with related user, a comparable academic approach is presented in [110,
concept(s) are cited. NetReflex IP and NetReflex MPLS utilizes 112]. HP presented Vertica, a solution that exploits CDRs for
big data analytics [27, 73, 108]to provide services like network planning, optimization, and fault prediction purposes.
anomaly analysis and traffic analysis. Nokia provided several
solutions targeting the wireless field. For example, Traffica
16
Table 3: Big data analytics-powered industrial solutions.

Related
No. Manufacturer Solution Name Academic Usage, Functions and Capabilities
Papers
Eliminates network errors.
NetReflex IP Monitors QoS/QoE.
Capacity planning, traffic routing, caching, and other optimizations.
1 Juniper [27, 73, 108] Segment and trend MPLS and VPN usage to plan for congestion.
NetReflex MPLS Identifies traffic utilization and trends to optimize operational cost.
Ability to slice network performance according to VPN, Cost of Service (CoS), and
Provider Edge (PE)-PE enabling more efficient planning.
Real-time issues detection and network troubleshooting.
Traffica [69, 109]
Gain real-time, end-to-end insight on traffic, network, devices, and subscribers.
Improves end-to-end network analytics and reporting with real-time subscriber-level
information.
Wireless Network
[110] Detects anomalies and reports airtime, signaling, and bandwidth resource consumption.
Guardian
Proactive detection of issues, including automatic detection of user anomalies and low QoE
2 Nokia score alerts.
Preventive Detects network elements’ behavior anomalies.
Complaint [111] Predicting where customer complaints might arise and prioritizes network optimization
Analysis accordingly.
Used for network elements, and proved its effectiveness by helping Shanghai Mobile
Predictive Care [110, 112] become more agile and responsive.
Accuracy of the simplified alerts is around 98 percent, reducing operational workload.
Provides CDR analysis that can help Communication Service Provides (CSPs).
Examines dropped call records and other maintenance data to determine where to invest in
3 HP (HPE) Vertica [64, 113]
infrastructure.
Failure prediction and proactive maintenance.
Deep Network Combines RAN information with BSS and customer data to deploy the network proactively.
4 Amdocs Analytics
[114]
Predictive maintenance.
Apervi’s Real-time
5 Apervi Log Analytics [115-117] Collects, aggregates, and stores log data in real-time.
Solution (ARLAS)
The authors in [64, 113] researched akin approaches. Amdoc’s increased interest in anomaly prediction and network node
Deep Network deployment. Thus, offering the customer a service that is as
Analyzer provides predictive maintenance and proactive close to optimal as possible, while minimizing network
network deployment for cellular networks. The authors in expansion expenditures.
[114] presented a similar approach. Log analytics can be used
for a variety of purposes. Aprevi’s ARLAS solution provided 8. Big data analytics-powered design cycle and challenges
real-time collection and storage of network logs. Related In this section, we are highlighting a common theme among
academic research was presented by the authors in [115-117]. most of the surveyed papers. This can be realized as depicted
Examining the above solutions, one can note that the majority in Fig. 4. Also, we are illustrating the challenges facing the
of the solutions are in the wireless field. This, in fact, implementation of big data analytics in network design and
coincides with the orientation of the academically-researched operation.
topics. Sampling through the offered solutions, we noticed the
Fig. 4. Big-data-powered network design cycle.

17
8.1 Big data analytics design cycle would take place downtown, or even checking the same online
The quest for a well-designed communication network is channels. This information can be of a great value when used
never-ending. Researchers in the big data era rely on the for network planning or optimization. However, to use this
capabilities offered by big data analytics to transform the way information, access to user data has to be obtained, which is a
thought that may cause unrest for many.
networks are being designed. This includes employing big
When dealing with user data, there is always a flag raised,
data analytics to predict and minimize the bandwidth
and that flag carries two issues: these issues are the security
utilization, anticipate and prepare for upcoming failures, and
and the privacy of the data. This is why big data has to be
predict the precise energy requirements. Hence, creating a
protected from unauthorized access and release [35].
network with fewer outages, higher user satisfaction, and an Big data security is a vital topic. If we want to label a
enhanced performance. system as “secured”, it must meet the data security
The network design process using big data can be outlined requirements, which are [120] :
as shown in Fig. 4. Big data is collected from the network, 1- Confidentiality: This implies the means to protect the data
stored, and processed in a big data cluster to extract useful from unapproved disclosure.
information, such as trends, patterns, and correlations (step 1). 2- Integrity: This implies the measures taken to protect the
The resulting information is then transferred to the decision- data from being modified improperly or without
making platforms where a new design decision for the permission.
network is evaluated by algorithms based on the inward 3- Availability: This is the system’s ability to prevent and
inferred knowledge (step 2). Finally, the new design decision recover from hardware as well as software failures that
is sent as feedback configuration parameters to the network might result in the database system being unavailable.
where re-configuration is implemented (step 3). Privacy of data is an increasing concern. As a matter of fact,
It should be noted that the duration of the above-mentioned having accessible data does not mean it is ethical to access it
cycle might vary depending on the application type of the [121]. Electronic health records have strict laws that precisely
network, e.g., enterprise, healthcare, agriculture, or identify what can and cannot be accessed.
transportation. For instance, enterprise networks can generate As an example, a user’s location information can be tracked
through cell towers and after a while, “a trail of crumbs” is
large amounts of data over a short period and usually
going to be left by the user that could be used to link the user
configuration faults could be undone anytime. On the other
to their residence or office location, and to eventually
hand, healthcare networks usually generate less monitoring
determine the user’s identity, private health information (e.g.,
data over time, and they should not be re-configured until attending a cancer treatment center) or religious preferences
there is sufficient data available, as frequent reconfigurations (e.g., attending a church) may be discovered by tracking the
may result in failures with severe impacts on peoples’ health. user’s movement over time [122], especially when we take
8.2 Challenges facing the use of Big Data Analytics in into consideration the close correlation between an
Network Design individual’s identity and their movement patterns [123]. Some
user data can be very valuable, for example, the estimated
8.2.1 Network size vs Big Data Analytics gains value of all global personal location data could reach $100
Depending on the network size, the ease of redesigning a billion in revenue during the next 10 years for service
network through the feedback cycle that we mentioned in Fig. providers, and when it comes to consumers and business end
4 is highly affected by the number of nodes. users, that figure can reach up to $700 billion [39].
For instance, large data streams can be generated from the With no obvious and secure way to handle the collected
mass deployment of small Wireless Sensor Networks (WSNs) user data, big data analytics cannot be considered a reliable
nodes and IoT [118]. The collected data may not carry a system. The security issues related to big data analytics can be
meaningful value until it is effectively analyzed. However, divided into four concerns, starting with an input (e.g.,
analyzing or mining that immense amount of data demands handheld device, sensor, or even IoT device) where protecting
finely tuned big data analytical capabilities, which turns out to the sensors from being compromised by attacks is regarded as
be a challenging task [119]. Furthermore, these massive an important security issue, as well as the other areas of data
amounts of data require hierarchal communication and data analysis, output, and communication with other systems [124].
processing solutions. The planning of such deployments in It should be noted that these concerns are present in all steps
conjunction with the data processing framework is a throughout the design cycle shown in Fig 4.
challenging task [118]. A solution that has been designed to address the big data
Comparing optical to IoT networks, the former has a small security and privacy challenge is the integrated Rule-Oriented
number of nodes, hence they are easier to redesign, while the Data (iRODS [125]). This novel technology was designed to
latter has a larger number of connected objects, and that can ensure security and privacy in big data, and it has some
impose a problem. technological features such as federated data grid or
"intelligent clouds", distributed rule engine, “iCAT” metadata
8.2.2 Security and privacy catalogue, storage access layer that facilitates common access,
Users’ common patterns can be of great help. Network users two ways of interfacing graphical and command line, and
can share certain patterns, like downloading some popular APIs to interact with the iRODS data grid [35, 125].
videos, retweeting about some certain upcoming game that
18
In a position paper, the authors of [126] noted a number of can efficiently store, process, and analyze big data, these
privacy-preserving challenges in the realm of big data challenges can be mapped to the middle octagon (big data
analytics, and these challenges are classified as follows: cluster) shown in Fig. 4, and they are:
1- Individuals’ Interaction: • Taking into consideration the variety and sheer volume of
a. Transparency: Big data analytics is mostly associated the disparate data sources, just collecting and integrating
with information collection and processing of specific data with scalability from scattered locations is a difficult
individuals’ data. However, this means that each task to accomplish.
individual is entitled to know about the data processing • Massive datasets must be mined by big data analytics at
operations conducted on his/her data, and the different levels and in either a real time or near real time
challenging part is in allocating that specific piece of fashion.
information linked to that person’s identity • Massive and heterogeneous datasets are to be stored and
b. Individual’s Consent: According to many privacy laws, managed by big data systems while providing the function
an individual is entitled to the right to be asked for and performance guarantees needed in terms of fast
his/her informed consent, and such consent is a way of retrieval, scalability, and privacy protection. . Facebook is
ensuring the individual is aware of the type of a clear example, in that particular matter as it needs to
processing that is conducted. This type of consent, store, access, and analyze over 30 petabytes of user-
along with the explanation it requires is in fact generated data [39].
considered challenging. Although some might claim that the current problem is not
c. Consent Cancellation and Discarding Personal Data: about storage (large volume), but it is about the online
Granting consent, on one hand, should also allow the processing ability [11], a scalable data center should also
right of revoking it. However, if an individual wished incorporate the ability to have a scalable storage system. Non-
for his/her consent to be canceled, then this means all volatile memory (NVM) technologies are expected to have a
personal data has to be erased as well. This is a promising role in future memory/storage designs [127].
challenging requirement when considering the fact that An ideal storage platform has three vital points (constraints)
the data might have been spread to various data to meet: it should support efficient data access in case of
collectors and data analysts. failure (network partitions and node failures), offer its clients a
2- Re-Identification Attacks: A user’s identity may be consistent view of the data, and provides high-availability.
compromised when correlating different types of datasets, However, according to Brewer’s CAP theorem [128], this
and this type of attack was further classified: ideal system cannot exists, which is due to the fact that it is
a. Correlation Attacks. impossible for the consistency to be guaranteed and for high-
b. Arbitrary Identification Attacks. availability to be offered in the presence of network partitions.
c. Targeted Identification Attacks. As a result, one of the above constraints has to be relaxed by
3- Probable vs. Provable Results: Different results can be distributed storage systems [127].
produced by different queries conducted upon datasets. In When it comes to securing the required processing speed,
this way, a provable link can turn out to be merely a Chip Multiprocessors (CMPs) are expected to be the
probable one. computational plotter for big data analytics [127]. Targeting
4- Economical Outcomes: Providing huge amounts of datasets the emerging trend, Datacenter-on-Chip (DoC) architectures
in advance is essential for big data analytics to work. One were proposed by the authors of [39], with four usage models
way to provide such datasets is by buying them from data that depend on the state of the consolidating applications, if
providers who offer to sell their users’ data to their they were cooperating or not. Key scalability challenges were
customers, thus privacy threats might appear. Context identified and addressed by cache hierarchies and shortage in
faults along with confusion and distraction are just two performance isolation [127, 129].
examples of other threats (i.e., fraud, censorship, and
surveillance). 9. Open research directions
1- Processing minimization: The first step in the processing of
8.2.3 Data center scalability
In the big data paradigm, data centers are not only a big data is the collection of data and performing pre-
platform to concentrate data storage, but can also carry out processing. Data cleaning is one form of data pre-
further responsibilities, such as acquiring, managing, processing. One particular example where pre-processing
organizing, processing and leveraging data values and might be implemented is using Computational Radio
functions. That would encourage the growth of the Frequency Identification (CRFID) sensors. In this
infrastructure and related software [36]. approach, wireless sensors can be wirelessly powered
The continuous expansion in data volume, coupled with the using technologies like magnetically powered resonance
ever greater demand for faster processing speeds, and the [130], upon proximity to a moving collector object (e.g., a
increasing complexity of Relational Database Management vehicle). This would enable the movement of some of the
System (RDBMS) are considered the main elements to pre-processing tasks towards the CRFID sensors’ side, thus
motivate the hunt for expandable (scalable) data centers to collecting an already cleaned and reduced amount of data
handle the data volume and parallel processing requirements; that is transferred to the relay before moving it to the data
hence, a number of technical challenges have to be taken into center for final processing. This would allow more
consideration when we try to design a scalable data center that
19
efficiency, reduce the analysis time, allow for better especially when coupled with other information, like
storage utilization, and facilitate real-time analytics. As a weather conditions, social activities, and movement
result, it would lead to faster decision making and an patterns. To reach the optimal IoT network design, big data
optimally-optimized network. analytics can correlate several parameters (e.g., traffic
2- Facilitating satellite based Internet connectivity in highly patterns, social events, network parameters, and whether
populated and poor areas: Projects like SpaceX are conditions) to determine where the best locations are to
already emerging with more than 4000 satellites and more place the sensors.
than 1$ billion combined funding, the project announced 6- Providing test environments for critical applications:
by Elon Musk [131], intends to provide high-speed Collecting large amounts of processed data may not be
Internet satellites worldwide. By utilizing big data enough to proceed with network reconfiguration. This has
analytics in the field of satellite communication networks, to be considered for some critical applications (e.g., health
this will focus more power in a selective fashion. The care, military, and aerospace) where human lives could be
result is less signal-reception requirements, e.g., smaller jeopardized. The design cycle has to comprise an
antenna size and lower Block Up Converter (BUC) power additional test environment in which the proposed design
in the above-mentioned areas. Big data analytics can be modification has to undergo a certain test cycle before
used to correlate ground data, e.g., geographical info and being put to work, although this might postpone the
weather conditions, along with economy-related data to ratification of the newly-proposed design. There will
help identify these areas. always be a trade-off between accuracy and speed. It is
3- Efficient use of idle time: Big data analytics can be used by true that waiting for sufficient data to be accumulated
operators to help them run their own data and discover would pay off as a better decision-making step, but that
patterns that would facilitate service and network rule is not suitable when it comes to critical applications
optimization. However, analytics may not be a 24/7 job, (e.g., medical networks). The design cycle has to undergo
especially if it is a batch process. Hence, this would leave a thorough test first. Identifying these applications and
the equipment and the software in an idle state. An providing suitable test environments is a very important
operator may offer the use of his/her equipment to his/her task.
clients from medium and small sized businesses. They 7- Selecting the most efficient energy source for network
could run their data during the idle time, which would nodes: Another aspect that can be added for a greener
offer better energy utilization, provide big data analytics network is the ability to selectively utilize energy sources
for everyone, and create another source of money where based on the correlation of energy source attributes and
everyone is benefiting. Game theory approaches can be their ability to serve a particular task. For example, solar
harnessed here to coordinate resource provisioning among energy source can be ideal for outdoor usage during the
several providers. day when it is sunny, with a backup plan to switch to other
4- Analytics reuse: Cellular networks have high similarity in sources (i.e., electrical) during special events or bad
terms of equipment capabilities, specifications, subscriber weather conditions. This can be the case for IoT devices
requirements, and subscriber geographic distribution. scattered in a business district, where they are mostly
Those operators can benefit from other operator’s big data utilized during the day, while running idle after the usual
analytics, thus the result of running the data can be applied office hours.
directly, or after small modifications. For example by
omitting the parts associated with different features of the 10. Conclusions
two networks. This would reduce the purchasing cost, There are many areas in which big data analytics can be
minimize the energy consumption, and reduce the utilized in the network design process. The concept of
optimization time by adopting a proven solution. Another gathering network data and correlating them with user trends
challenge here is to provide a standard APIs between the and service requirements can indeed create an adaptive and
different operators’ equipment so they can access each user-centric network design.
other’s data in an agreed up on manner. Throughout our survey, we noticed a lot of focus on the
5- Big data and IoT node placement: The main cause for the field of wireless communication networks design using big
increase in IoT sensors is the desire to collect more data, data. Delving deeper reveals that the field of 5G is getting the
which –in turn- will result in reaching an enhanced control majority of the researchers’ attention due to the new
or comprehension. According to HP, by 2030, IoT sensors opportunities it has to offer. The optical networking, inter-DC
will reach one trillion, and this will make IoT data the most and SDN fields, on the other hand, have yet further research
significant part of big data [36]. However, gathering data challenges to tackle. We also note that the integration of SDN
efficiently requires placing the IoT sensors where they can and big data analytics would facilitate the perfection of the
harvest as much data as possible. Many sensors are simply design cycle. The field of network security also has its share
wasted due to placing them in the wrong location (a where big data analytics is utilized to detect security threats.
location that will not be helpful in providing a valuable Industrial efforts toward optimizing networks based on big
amount / type of data). Big data can be used to identify data analytics reflect the increasing trend toward employing
these IoT sensors and simply propose better locations, AI-like approaches, such as pattern recognition and machine
20
learning for network design. [16] Y. Lv, Y. Duan, W. Kang, Z. Li, F.-Y. Wang, Traffic Flow Prediction
With Big Data: A Deep Learning Approach, IEEE Transactions on Intelligent
Some of the considered solutions handle big data in a batch Transportation Systems, 16 (2014) 865 - 873.
manner while others are capable of performing real-time [17] S. Landset, T.M. Khoshgoftaar, A.N. Richter, T. Hasanin, A survey of
processing. Handling big data in a batch mode can offer more open source tools for machine learning with big data in the Hadoop
ecosystem, Journal of Big Data, 2 (2015) 24.
accurate information at the expense of delayed results due to
[18] H. Baek, S.-K. Park, Sustainable Development Plan for Korea through
the size of the processed data, while real-time processing Expansion of Green IT: Policy Issues for the Effective Utilization of Big Data,
offers fast results at the expense of accuracy. Hence, it would Sustainability, 7 (2015) 1308-1328.
be an application-dependent decision whether to choose the [19] S. Kaisler, F. Armour, J.A. Espinosa, W. Money, Big Data: Issues and
Challenges Moving Forward, 2013 46th Hawaii International Conference on
former or the latter option. System Sciences, (2013) 995-1004.
We predict that the field of network design based on big data [20] A. Gani, A. Siddiqa, S. Shamshirband, F. Hanum, A survey on indexing
analytics will continue to flourish in the near future as more techniques for big data: taxonomy and performance evaluation, Knowledge
and Information Systems, 46 (2016) 241-284.
data are collected from the networks and processed to extract [21] Y. Demchenko, P. Grosso, C. De Laat, P. Membrey, Addressing big data
useful information regarding network behavior. In the far issues in Scientific Data Infrastructure, Proceedings of the 2013 International
future, or maybe quite soon, as some claim, employing Conference on Collaboration Technologies and Systems, CTS 2013, IEEE,
2013, pp. 48-55.
quantum computing for machine learning purposes could help [22] J. Andreu-Perez, C.C. Poon, R.D. Merrifield, S.T. Wong, G.Z. Yang, Big
in dethroning Moor’s law and provide more processing space data for health., IEEE journal of biomedical and health informatics, 19 (2015)
per unit time. This extra space can be harnessed for big data 1193-1208.
[23] L. Zhang, A framework to model big data driven complex cyber physical
analytics employed in network design. control systems, 2014 20th International Conference on Automation and
Computing, IEEE, 2014, pp. 283-288.
11. Acknowledgment [24] P.D.C.d. Almeida, J. Bernardino, Big Data Open Source Platforms, 2015
IEEE International Congress on Big Data, IEEE, 2015, pp. 268-275.
The authors would like to acknowledge funding from the [25] C. Senbalci, S. Altuntas, Z. Bozkus, T. Arsan, Big data platform
Engineering and Physical Sciences Research Council development with a domain specific language for telecom industries, 2013
(EPSRC), INTERNET (EP/H040536/1) and STAR High Capacity Optical Networks and Emerging/Enabling Technologies,
HONET-CNS 2013, (2013) 116-120.
(EP/K016873/1) projects. [26] B. Fan, S. Leng, K. Yang, A dynamic bandwidth allocation algorithm in
mobile networks with big data of users and networks, IEEE Network, 30
12. References (2016) 6-10.
[27] C.-L. I, Y. Liu, S. Han, S. Wang, G. Liu, On Big data Analytics for
Greener and Softer RAN, IEEE Access, 3 (2015) 3068-3075.
[1] J. Qadir, N. Ahad, E. Mushtaq, M. Bilal, SDNs, Clouds, and Big Data: [28] P. Russom, Big data analytics, TDWI Best Practices Report, (2011) 38.
New Opportunities, 2014 12th International Conference on Frontiers of [29] A. Belle, R. Thiagarajan, S.M. Soroushmehr, F. Navidi, D.A. Beard, K.
Information Technology, IEEE, 2014, pp. 28-33. Najarian, Big Data Analytics in Healthcare, Biomed Res Int, 2015 (2015)
[2] R. Tudoran, A. Costan, G. Antoniu, OverFlow: Multi-Site Aware Big Data 370194.
Management for Scientific Workflows on Clouds, IEEE Transactions on [30] R. Buyya, K. Ramamohanarao, C. Leckie, R.N. Calheiros, A.V.
Cloud Computing, 4 (2015) 76-89. Dastjerdi, S. Versteeg, Big Data Analytics-Enhanced Cloud Computing:
[3] S. Gole, A survey of Big Data in social media using data mining Challenges, Architectural Elements, and Future Directions, (2015) 75-84.
techniques, 2015 International Conference on Advanced Computing and [31] P. Gölzer, L. Simon, P. Cato, M. Amberg, Designing Global
Communication Systems, IEEE, 2015, pp. 1-6. Manufacturing Networks Using Big Data, Procedia CIRP, 33 (2015) 191-196.
[4] Z. Nyikes, Z. Rajnai, Big Data , As Part of the Critical Infrastructure, [32] R. Kapdoskar, S. Gaonkar, N. Shelar, A. Surve, P.S. Gavhane, Big Data
(2015) 217-222. Analytics, 4 (2015) 518-520.
[5] L. Null, J. Lobur, The essentials of computer organization and [33] C. Hu, H. Li, Y. Jiang, Y. Cheng, P. Heegaard, Deep semantics
architecture, Jones & Bartlett Publishers2014. inspection over big network data at wire speed, IEEE Network, 30 (2016) 18-
[6] J. Shemer, P. Neches, The genesis of a database computer, Computer, 17 23.
(1984) 42-56. [34] E. Bastug, M. Bennis, E. Zeydan, M.A. Kader, I.A. Karatepe, A.S. Er, M.
[7] V.R. Borkar, M.J. Carey, C. Li, Big data platforms, XRDS: Crossroads, Debbah, Big data meets telcos: A proactive caching perspective, Journal of
The ACM Magazine for Students, 19 (2012) 44. Communications and Networks, 17 (2015) 549-557.
[8] S. Ghemawat, H. Gobioff, S.-t. Leung, The Google File System, (2003). [35] B. Matturdi, X. Zhou, S. Li, F. Lin, Big Data security and privacy: A
[9] D.J. DeWitt, B. Gerber, G. Graefe, M. Heytens, K. Kumar, G.A. review, China Communications, 11 (2014) 135-145.
Muralikrishna, A High Performance Dataflow Database Machine, Computer [36] M. Chen, S. Mao, Y. Liu, Big data: A survey, Mobile Networks and
Science Department, University of Wisconsin1986. Applications, 19 (2014) 171-209.
[10] S. Fushimi, M. Kitsuregawa, H. Tanaka, An Overview of The System [37] A. Asahara, H. Hayashi, N. Ishimaru, R. Shibasaki, H. Kanasugi,
Software of A Parallel Relational Database Machine GRACE, VLDB, 1986, International standard “OGC® moving features” to address “4Vs” on
pp. 209-219. locational bigdata, 2015 IEEE International Conference on Big Data (Big
[11] S. Yin, O. Kaynak, Big Data for Modern Industry :, Proceedings of the Data), IEEE, 2015, pp. 1958-1966.
IEEE, 103 (2015) 143-146. [38] L. He, P. Yue, Moving towards intelligent giservices, 2015 IEEE
[12] A.S. Alghamdi, I. Ahmad, T. Hussain, Big Data for C4I Systems : Goals , International Geoscience and Remote Sensing Symposium (IGARSS), IEEE,
Applications , Challenges and Tools, International Conference on Innovative 2015, pp. 1373-1376.
Computing Technology (INTECH), 2015, pp. 89-93. [39] H. Hu, Y. Wen, T.-S. Chua, X. Li, Toward Scalable Systems for Big
[13] A. McAfee, E. Brynjolfsson, Big Data. The management revolution, Data Analytics: A Technology Tutorial, IEEE Access, 2 (2014) 652-687.
Harvard Buiness Review, 90 (2012) 61-68. [40] B. Cyganek, M. Grana, A. Kasprzak, K. Walkowiak, M. Wozniak,
[14] V. Moreno-Cano, F. Terroso-Saenz, A.F. Skarmeta-Gomez, Big data for Selected aspects of electronic health record analysis from the big data
IoT services in smart cities, 2015 IEEE 2nd World Forum on Internet of perspective, 2015 IEEE International Conference on Bioinformatics and
Things (WF-IoT), IEEE, 2015, pp. 418-423. Biomedicine (BIBM), IEEE, 2015, pp. 1391-1396.
[15] K. Sravanthi, T. Subba Redy, Applications of BIG Data in Various [41] L. Cui, F.R. Yu, Q. Yan, When big data meets software-defined
Fields, International Journal of Computer Science and Technologies, 6 (2015) networking: SDN for big data and big data for SDN, IEEE Network, 30
4629-4632. (2016) 58-65.
[42] Y. Demchenko, E. Gruengard, S. Klous, Instructional Model for Building
Effective Big Data Curricula for Online and Campus Education, 2014 IEEE
21
6th International Conference on Cloud Computing Technology and Science, [67] I. de la Bandera, R. Barco, P. Munoz, I. Serrano, Cell Outage Detection
IEEE, 2014, pp. 935-941. Based on Handover Statistics, Communications Letters, IEEE, 19 (2015)
[43] M.A.-u.-d. Khan, M.F. Uddin, N. Gupta, Seven V's of Big Data 1189-1192.
understanding Big Data to extract value, Proceedings of the 2014 Zone 1 [68] A. Sahni, D. Marwah, R. Chadha, Real time monitoring and analysis of
Conference of the American Society for Engineering Education, IEEE, 2014, available bandwidth in cellular network-using big data analytics, Computing
pp. 1-5. for Sustainable Global Development (INDIACom), 2015 2nd International
[44] M.K. Pusala, M.A. Salehi, J.R. Katukuri, Y. Xie, V. Raghavan, Massive Conference on, (2015) 1743-1747.
Data Analysis: Tasks, Tools, Applications, and Challenges, Big Data [69] J. Liu, F. Liu, N. Ansari, Monitoring and analyzing big traffic data of a
Analytics, Springer2016, pp. 11-40. large-scale cellular network with Hadoop, IEEE Network, 28 (4) (2014) 32-
[45] S. Pyne, B.P. Rao, S.B. Rao, Big Data Analytics: Methods and 39.
Applications, Springer2016. [70] W. Huang, Z. Chen, W. Dong, H. Li, B. Cao, J. Cao, Mobile Internet big
[46] K. Lee, K. Jung, J. Park, D. Kwon, ARLS: A MapReduce-based output data platform in {China} Unicom, Tsinghua Science and Technology, 19
analysis tool for large-scale simulations, Advances in Engineering Software, (2014) 95-101.
95 (2016) 28-37. [71] Wi-Fi direct | Wi-Fi Alliance, URL: http://www.wi-fi.org/discover-wi-
[47] T. White, Hadoop: The definitive guide, " O'Reilly Media, Inc."2012. fi/wi-fi-direct.
[48] M. Lemoudden, B.E. Ouahidi, Managing cloud-generated logs using big [72] A. Omar, Improving Data Extraction Efficiency of Cache Nodes in
data technologies, 2015 International Conference on Wireless Networks and Cognitive Radio Networks Using Big Data Analysis, 2015 9th International
Mobile Communications (WINCOM), IEEE, 2015, pp. 1-7. Conference on Next Generation Mobile Applications, Services and
[49] D. Singh, C.K. Reddy, A survey on platforms for big data analytics, Technologies, IEEE, 2015, pp. 305-310.
Journal of Big Data, 2 (2014) 8. [73] K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, W. Xiang, Big
[50] A.B. Ayed, M.B. Halima, A.M. Alimi, MapReduce Based Text Detection data-driven optimization for mobile networks toward 5G, IEEE Network, 30
in Big Data Natural Scene Videos, Procedia Computer Science, 53 (2015) (1) (2016) 44-51.
216-223. [74] A. Checko, H.L. Christiansen, Y. Yan, L. Scolari, G. Kardaras, M.S.
[51] N. Zhu, X. Liu, J. Liu, Y. Hua, Towards a cost-efficient MapReduce: Berger, L. Dittmann, Cloud RAN for mobile networks—a technology
mitigating power peaks for Hadoop clusters, Tsinghua Science and overview, Communications Surveys & Tutorials, IEEE, 17 (2015) 405-426.
Technology, 19 (2014) 24-32. [75] K.I. Pedersen, Y. Wang, S. Strzyz, F. Frederiksen, Enhanced inter-cell
[52] Big Data in the Enterprise : Network Design Considerations, White interference coordination in co-channel multi-layer LTE-advanced networks,
Paper, (2011) 1-33. Wireless Communications, IEEE, 20 (2013) 120-127.
[53] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop Distributed [76] C.-L. Lee, W.-S. Su, K.-A. Tang, W.-I. Chao, Design of handover self-
File System, 2010 IEEE 26th Symposium on Mass Storage Systems and optimization using big data analytics, The 16th Asia-Pacific Network
Technologies (MSST), IEEE, 2010, pp. 1-10. Operations and Management Symposium, IEEE, 2014, pp. 1-5.
[54] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark : [77] P. Kiran, M.G. Jibukumar, C.V. Premkumar, Resource allocation
Cluster Computing with Working Sets, HotCloud'10 Proceedings of the 2nd optimization in LTE-A/5G networks using big data analytics, 2016
USENIX conference on Hot topics in cloud computing, (2010) 10. International Conference on Information Networking (ICOIN), IEEE, 2016,
[55] N. Marz, Storm-distributed and fault-tolerant realtime computation, 2013, pp. 254-259.
URL http://www.storm-project.net. [78] M. Cayrol, H. Farreny, H. Prade, Fuzzy pattern matching, Kybernetes, 11
[56] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed (1982) 103-116.
data-parallel programs from sequential building blocks, ACM SIGOPS [79] A. Imran, A. Zoha, A. Abu-Dayya, Challenges in 5G: How to Empower
Operating Systems Review, ACM, 2007, pp. 59-72. SON with Big Data for Enabling 5G, Ieee Network, 28 (2014) 27-33.
[57] V.K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. [80] G. Widmer, M. Kubat, Learning in the presence of concept drift and
Reed, E. Baldeschwieler, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. hidden contexts, Machine learning, 23 (1996) 69-101.
Evans, T. Graves, J. Lowe, H. Shah, Apache Hadoop YARN, Proceedings of [81] J.A. Hartigan, M.A. Wong, Algorithm AS 136: A K-Means Clustering
the 4th annual Symposium on Cloud Computing - SOCC '13, ACM Press, Algorithm, Journal of the Royal Statistical Society. Series C (Applied
New York, New York, USA, 2013, pp. 1-16. Statistics), 28 (1979) 100-108.
[58] P. Zikopoulos, C. Eaton, D. DeRoos, Understanding big data, New York [82] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, The Journal
et al: McGraw …, (2012) 166. of Machine Learning Research, 3 (2003) 993-1022.
[59] F. Morales, M. Ruiz, L. Gifre, L.M. Contreras, V. López, L. Velasco, [83] Z. Niu, Y. Wu, J. Gong, Z. Yang, Cell zooming for cost-efficient green
Virtual network topology adaptability based on data analytics for traffic cellular networks, IEEE Communications Magazine, 48 (2010) 74-79.
prediction, Journal of Optical Communications and Networking, 9 (2017) [84] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J.
A35-A45. Rexford, S. Shenker, J. Turner, OpenFlow: enabling innovation in campus
[60] A. Imran, A. Zoha, Challenges in 5G: how to empower SON with big networks, ACM SIGCOMM Computer Communication Review, 38 (2008)
data for enabling 5G, IEEE Network, 28 (2014) 27-33. 69-74.
[61] H. Daki, A. El Hannani, A. Aqqal, A. Haidine, A. Dahbi, H. Ouahmane, [85] H. Cui, Y. Zhang, C. Ma, W. Lai, N.C. Beaulieu, S. Sobolevsky, Y. Liu,
Towards adopting Big Data technologies by mobile networks operators: A Design and Realization of Cognitive Routing Resources Using Big Data
Moroccan case study, Cloud Computing Technologies and Applications Analysis in SDN, 2015 IEEE International Congress on Big Data, 2 (2015)
(CloudTech), 2016 2nd International Conference on, IEEE, 2016, pp. 154- 424-429.
161. [86] M.V. Neves, C.A.F.D. Rose, K. Katrinis, H. Franke, Pythia: Faster Big
[62] D.S. Terzi, R. Terzi, S. Sagiroglu, Big data analytics for network Data in Motion through Predictive Software-Defined Network Optimization at
anomaly detection from netflow data, International Conference on Computer Runtime, (2014) 82-90.
Science and Engineering (UBMK), IEEE, 2017, pp. 592-597. [87] P. Costa, A. Donnelly, A. Rowstron, G. O'Shea, Camdoop:
[63] Ö.F. Çelebi, E. Zeydan, Ö.F. Kurt, Ö. Dedeoglu, Ö. Iieri, B.A. Sungur, Exploiting In-network Aggregation for Big Data Applications, Presented as
A. Akan, S. Ergut, On use of big data for enhancing network coverage part of the 9th USENIX Symposium on Networked Systems Design and
analysis, 2013 20th International Conference on Telecommunications, ICT Implementation (NSDI 12), 2012, pp. 29-42.
2013, IEEE, 2013, pp. 1-5. [88] P. Costa, A. Donnelly, G. O’shea, A. Rowstron, CamCube: a key-based
[64] I.A. Karatepe, E. Zeydan, Anomaly Detection In Cellular Network Data data center, Microsoft Res., Redmond, WA, USA, Technical Report MSR TR-
Using Big Data Analytics, European Wireless 2014; 20th European Wireless 2010-74, (2010).
Conference; Proceedings of, 2014, pp. 1-5. [89] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P.K. Gunda, J.
[65] E.J. Khatib, R. Barco, P. Munoz, I.D. La Bandera, I. Serrano, Self- Currey, DryadLINQ: A System for General-Purpose Distributed Data-Parallel
healing in mobile networks with big data, IEEE Communications Magazine, Computing Using a High-Level Language, OSDI, 2008, pp. 1-14.
54 (2016) 114-120. [90] R. Ramaswami, K.N. Sivarajan, Routing and wavelength assignment in
[66] E.J. Khatib, R. Barco, I. Serrano, P. Munoz, LTE performance data all-optical networks, IEEE/ACM Transactions on Networking (TON), 3
reduction for knowledge acquisition, Globecom Workshops (GC Wkshps), (1995) 489-500.
2014, IEEE, 2014, pp. 270-274. [91] G. Shen, Y. Li, L. Peng, Almost-optimal design for optical networks with
hadoop cloud computing: Ten ordinary desktops solve 500-node, 1000-link,
and 4000-request RWA problem within three hours, Transparent Optical
22
Networks (ICTON), 2013 15th International Conference on, IEEE, 2013, pp. Personal Multimedia Communications (WPMC), 2013 16th International
1-4. Symposium on, IEEE, 2013, pp. 1-6.
[92] R.G. Michael, S.J. David, Computers and intractability: a guide to the [116] G. Qi, W.-T. Tsai, W. Li, Z. Zhu, Y. Luo, A cloud-based triage log
theory of NP-completeness, WH Free. Co., San Fr, (1979). analysis and recovery framework, Simulation Modelling Practice and Theory,
[93] Y. Li, G. Shen, B. Chen, M. Gao, X. Fu, Applying Hadoop Cloud 77 (2017) 292-316.
Computing Technique to Optimal Design of Optical Networks, Asia [117] B.H. Park, S. Hukerikar, R. Adamson, C. Engelmann, Big Data Meets
Communications and Photonics Conference, Optical Society of America, HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme
2015, pp. ASu3H. 3. Scale, Cluster Computing (CLUSTER), 2017 IEEE International Conference
[94] G. Shen, Y. Lui, S.K. Bose, “Follow the Sun, Follow the Wind” on, IEEE, 2017, pp. 758-765.
Lightpath Virtual Topology Reconfiguration in IP Over WDM Network, [118] C. Jardak, P. Mähönen, J. Riihijärvi, Spatial big data and wireless
Journal of Lightwave Technology, 32 (2014) 2094-2105. networks: experiences, applications, and research challenges, IEEE Network,
[95] C. Wang, G. Shen, S.K. Bose, Distance adaptive dynamic routing and 28 (2014) 26-31.
spectrum allocation in elastic optical networks with shared backup path [119] L. Xu, W. He, S. Li, Internet of Things in Industries: A Survey, IEEE
protection, Journal of Lightwave Technology, 33 (2015) 2955-2964. Transactions on Industrial Informatics, PP (2014) 1-11.
[96] Y. Li, H. Dai, G. Shen, S.K. Bose, Adaptive FEC selection for lightpaths [120] E. Bertino, Big Data - Security and Privacy, Proceedings of the 5th
in elastic optical networks, Optical Fiber Communication Conference, ACM Conference on Data and Application Security and Privacy, (2015) 757-
Optical Society of America, 2014, pp. W3A. 7. 761.
[97] A. Aguado, M. Davis, S. Peng, M.V. Alvarez, V. López, T. Szyrkowiec, [121] K. Crawford, Six provocations for big data, (2011) 1-17.
A. Autenrieth, R. Vilalta, A. Mayoral, R. Muñoz, Dynamic virtual network [122] A. Labrinidis, H.V. Jagadish, Challenges and opportunities with big
reconfiguration over SDN orchestrated multitechnology optical transport data, Proceedings of the VLDB Endowment, 5 (2012) 2032-2033.
domains, Journal of Lightwave Technology, 34 (2016) 1933-1938. [123] M.C. González, C.A. Hidalgo, A.-L. Barabási, Understanding individual
[98] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. human mobility patterns, Nature, 453 (2008) 779-782.
Kemmerer, C. Kruegel, G. Vigna, Your botnet is my botnet: analysis of a [124] C.-W. Tsai, C.-F. Lai, H.-C. Chao, A.V. Vasilakos, Big data analytics: a
botnet takeover, Proceedings of the 16th ACM conference on Computer and survey, Journal of Big Data, 2 (2015) 21.
communications security, ACM, 2009, pp. 635-647. [125] A. Rajasekar, R. Moore, C.-Y. Hou, C.a. Lee, R. Marciano, A. de
[99] WordPress Sites Targeted by Mass Brute-force Botnet Attack | US- Torcy, M. Wan, W. Schroeder, S.-Y. Chen, L. Gilbert, P. Tooby, B. Zhu,
CERT, U.S. Department of Homeland Security Seal. United States Computer iRODS Primer: Integrated Rule-Oriented Data System, Synthesis Lectures on
Emergency Readiness Team US-CERT, 2013. Information Concepts, Retrieval, and Services, 2 (2010) 1-143.
[100] K. Singh, S.C. Guntuku, A. Thakur, C. Hota, Big Data Analytics [126] M. Jensen, Challenges of Privacy Protection in Big Data Analytics,
framework for Peer-to-Peer Botnet detection using Random Forests, 2013 IEEE International Congress on Big Data, IEEE, 2013, pp. 235-238.
Information Sciences, 278 (2014) 488-497. [127] K. Kambatla, G. Kollias, V. Kumar, A. Grama, Trends in big data
[101] C. Sanders, J. Smith, Applied network security monitoring: collection, analytics, Journal of Parallel and Distributed Computing, 74 (2014) 2561-
detection, and analysis, Elsevier2013. 2573.
[102] Y. Liu, S. Guo, S. Hu, T. Rabl, H.-A. Jacobsen, J. Li, J. Wang, [128] E.A. Brewer, Towards robust distributed systems, PODC, 2000.
Performance Evaluation and Optimization of Multi-dimensional Indexes in [129] R. Iyer, Datacenter-on-Chip Architectures Terascale Opportunities and
Hive, IEEE Transactions on Services Computing, pp (2016) 1-1. Challenges, Intel Technology Journal, 11 (2007).
[103] S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, F. Herrera, Big [130] T. Imura, Y. Hori, Maximizing Air Gap and Efficiency of Magnetic
Data: Tutorial and guidelines on information and process fusion for analytics Resonant Coupling for Wireless Power Transfer Using Equivalent Circuit and
algorithms with MapReduce, Information Fusion, 42 (2018) 51-61. Neumann Formula, IEEE Transactions on Industrial Electronics, 58 (2011)
[104] V.P. Janeja, A. Azari, J.M. Namayanja, B. Heilig, B-dids: Mining 4746-4752.
anomalies in a Big-distributed Intrusion Detection System, 2014 IEEE [131] K. Finley, Internet by Satellite Is a Space Race With No Winners,
International Conference on Big Data (Big Data), IEEE, 2014, pp. 32-34. Wired, 2015.
[105] V. Ayma, R. Ferreira, P. Happ, D. Oliveira, R. Feitosa, G. Costa, A.
Plaza, P. Gamba, Classification Algorithms for Big Data Analysis, a Map Mohammed S. Hadi received the B.Sc. and M.Sc.
Reduce Approach, The International Archives of Photogrammetry, Remote degrees in computer engineering from Al-Nahrain
Sensing and Spatial Information Sciences, 40 (2015) 17. University, Baghdad, Iraq, in 2003 and 2009
[106] Q. Xu, R. Zheng, W. Saad, Z. Han, Device fingerprinting in wireless respectively.
networks: Challenges and opportunities, IEEE Communications Surveys & He is currently working toward the Ph.D. in Electrical
Tutorials, 18 (2016) 94-104. Engineering at the University of Leeds, Leeds, U.K.
[107] A.L. Buczak, E. Guven, A survey of data mining and machine learning From (2010 – 2015) he was an assistant lecturer in
methods for cyber security intrusion detection, IEEE Communications Al-Mansour University College, Baghdad, Iraq and,
Surveys & Tutorials, 18 (2016) 1153-1176. prior to that (2007 – 2010), he was an Intelligent
[108] M. Molina, I. Paredes-Oliva, W. Routly, P. Barlet-Ros, Operational Network (IN), Short Message System (SMS), and
experiences with anomaly detection in backbone networks, Computers & (Public Switched Telephone Network) PSTN engineer with ZTE Corporation
Security, 31 (3) (2012) 273-285. for Telecommunication, Iraq. His research interests include big data analytics,
[109] F. Ricciato, Traffic monitoring and analysis for the optimization of a 3G network design and energy efficiency in networks.
network, IEEE Wireless Communications, 13 (4) (2006) 42-49.
[110] M.S. Parwez, D. Rawat, M. Garuba, Big Data Analytics for User
Activity Analysis and User Anomaly Detection in Mobile Wireless Network, Ahmed Q. Lawey received the BS degree (first-class
Honors) in computer engineering from the University
IEEE Transactions on Industrial Informatics, 13 (2017) 2058 - 2065.
[111] J. Spiess, Y. T'Joens, R. Dragnea, P. Spencer, L. Philippart, Using big of Al-Nahrain, Iraq, in 2002, the MSc degree (with
data to improve customer experience and business performance, Bell Labs distinction) in computer engineering from University
of Al-Nahrain, Iraq, in 2005, and the PhD degree in
Technical Journal, 18 (4) (2014) 3-17.
[112] J. Zhong, W. Guo, Z. Wang, Study on network failure prediction based communication networks from the University of
on alarm logs, Big Data and Smart City (ICBDSC), 2016 3rd MEC Leeds, UK, in 2015.
From 2005 to 2010 he was a core network engineer in
International Conference on, IEEE, 2016, pp. 1-7.
[113] L.H. Shuan, T.Y. Fei, S.W. King, G. Xiaoning, L.Z. Mein, Network ZTE Corporation for Telecommunication, Iraq
Equipment Failure Prediction with Big Data Analytics, International Journal branch. He is currently a lecturer in communication
networks in the School of Electronic and Electrical Engineer, University of
of Advances in Soft Computing & Its Applications, 8 (3) (2016) 59-69.
[114] K. Yang, R. Liu, Y. Sun, J. Yang, X. Chen, Deep Network Analyzer Leeds. His current research interests include energy efficiency in optical and
(DNA): A Big Data Analytics Platform for Cellular Networks, IEEE Internet wireless networks, big data, cloud computing and Internet of Things.
of Things Journal, 4 (6) (2017) 2019-2027.
[115] Y. Qiao, Z. Lei, J. Yang, G. Cheng, FLAS: Traffic analysis of emerging
applications on Mobile Internet using cloud computing tools, Wireless
23
Taisir E. H. El-Gorashi received the B.S. degree outstanding Service award 2015 in recognition of “Leadership and
(first-class Hons.) in electrical and electronic Contributions to the Area of Green Communications”, (ii) the GreenTouch
engineering from the University of Khartoum, 1000x award in 2015 for “pioneering research contributions to the field of
Khartoum, Sudan, in 2004, the M.Sc. degree (with energy efficiency in telecommunications”, (iii) the IET 2016 Premium Award
distinction) in photonic and communication systems for best paper in IET Optoelectronics and (iv) shared the 2016 Edison Award
from the University of Wales, Swansea, UK, in 2005, in the collective disruption category with a team of 6 from GreenTouch for
and the PhD degree in optical networking from the their joint work on the GreenMeter.
University of Leeds, Leeds, UK, in 2010. She is He is currently an editor of: IET Optoelectronics and Journal of Optical
currently a Lecturer in optical networks in the School Communications, and was editor of IEEE Communications Surveys and
of Electrical and Electronic Engineering, University Tutorials and IEEE Journal on Selected Areas in Communications series on
of Leeds. Previously, she held a Postdoctoral Research post at the University Green Communications and Networking. He was Co-Chair of the GreenTouch
of Leeds (2010–2014), where she focused on the energy efficiency of optical Wired, Core and Access Networks Working Group, an adviser to the
networks investigating the use of renewable energy in core networks, green IP Commonwealth Scholarship Commission, member of the Royal Society
over WDM networks with datacenters, energy efficient physical topology International Joint Projects Panel and member of the Engineering and Physical
design, energy efficiency of content distribution networks, distributed cloud Sciences Research Council (EPSRC) College. He has been awarded in excess
computing, network virtualization and Big Data. In 2012, she was a BT of £22 million in grants to date from EPSRC, the EU and industry and has
Research Fellow, where she developed energy efficient hybrid wireless- held prestigious fellowships funded by the Royal Society and by BT. He was
optical broadband access networks and explored the dynamics of TV viewing an IEEE Comsoc Distinguished Lecturer 2013-2016.
behavior and program popularity. The energy efficiency techniques developed
during her postdoctoral research contributed 3 out of the 8 carefully chosen
core network energy efficiency improvement measures recommended by the
GreenTouch consortium for every operator network worldwide. Her work led
to several invited talks at GreenTouch, Bell Labs, Optical Network Design
and Modelling conference, Optical Fiber Communications conference,
International Conference on Computer Communications and EU Future
Internet Assembly and collaboration with Alcatel Lucent and Huawei.
Jaafar M. H. Elmirghani (M’ 92–SM’ 99) is the

Director of the Institute of Communication and Power
Networks within the School of Electronic and
Electrical Engineering, University of Leeds, UK. He
joined Leeds in 2007 and prior to that (2000–2007) as
chair in optical communications at the University of
Wales Swansea he founded, developed and directed
the Institute of Advanced Telecommunications and the
Technium Digital (TD), a technology incubator/spin-
off hub. He has provided outstanding leadership in a
number of large research projects at the IAT and TD.
He received the BSc in Electrical Engineering, First Class Honours from the
University of Khartoum in 1989 and was awarded all 4 prizes in the
department for academic distinction. He received the PhD in the
synchronization of optical systems and optical receiver design from the
University of Huddersfield UK in 1994 and the DSc in Communication
Systems and Networks from University of Leeds, UK, in 2014. He has co-
authored Photonic switching Technology: Systems and Networks, (Wiley) and
has published over 450 papers. He has research interests in optical systems
and networks.
Prof. Elmirghani is Fellow of the IET, Chartered Engineer, Fellow of the
Institute of Physics and Senior Member of IEEE. He was Chairman of IEEE
Comsoc Transmission Access and Optical Systems technical committee and
was Chairman of IEEE Comsoc Signal Processing and Communications
Electronics technical committee, and an editor of IEEE Communications
Magazine. He was founding Chair of the Advanced Signal Processing for
Communication Symposium which started at IEEE GLOBECOM’99 and has
continued since at every ICC and GLOBECOM. Prof. Elmirghani was also
founding Chair of the first IEEE ICC/GLOBECOM optical symposium at
GLOBECOM’00, the Future Photonic Network Technologies, Architectures
and Protocols Symposium. He chaired this Symposium, which continues to
date under different names. He was the founding chair of the first Green Track
at ICC/GLOBECOM at GLOBECOM 2011, and is Chair of the IEEE Green
ICT initiative within the IEEE Technical Activities Board (TAB) Future
Directions Committee (FDC), a pan IEEE Societies initiative responsible for
Green ICT activities across IEEE, 2012-present. He is and has been on the
technical program committee of 34 IEEE ICC/GLOBECOM conferences
between 1995 and 2016 including 15 times as Symposium Chair. He has
given over 55 invited and keynote talks over the past 8 years.
He received the IEEE Communications Society Hal Sobol award, the IEEE
Comsoc Chapter Achievement award for excellence in chapter activities (both
in international competition in 2005), the University of Wales Swansea
Outstanding Research Achievement Award, 2006; and received in
international competition: the IEEE Communications Society Signal
Processing and Communication Electronics outstanding service award, 2009,
a best paper award at IEEE ICC’2013. Related to Green Communications he
received (i) the IEEE Comsoc Transmission Access and Optical Systems

Big Data Analytics For Wireless and Wired Network Design: A Survey

Uploaded by

Copyright:

Available Formats

Big Data Analytics For Wireless and Wired Network Design: A Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics For Wireless and Wired Network Design: A Survey

Uploaded by

Copyright:

Available Formats

1

Big Data Analytics for Wireless and Wired

National Security Agency (NSA) Utah data centre that can

Table 1: Various big data dimensions.

1- Managing resources in HetNets:

Table 2: Research summary

introduces itself as a real-time traffic monitoring tool that

Table 3: Big data analytics-powered industrial solutions.

Fig. 4. Big-data-powered network design cycle.

Jaafar M. H. Elmirghani (M’ 92–SM’ 99) is the

You might also like